WO2022052782A1 - Procédé de traitement d'image et dispositif associé - Google Patents

Procédé de traitement d'image et dispositif associé Download PDF

Info

Publication number
WO2022052782A1
WO2022052782A1 PCT/CN2021/113635 CN2021113635W WO2022052782A1 WO 2022052782 A1 WO2022052782 A1 WO 2022052782A1 CN 2021113635 W CN2021113635 W CN 2021113635W WO 2022052782 A1 WO2022052782 A1 WO 2022052782A1
Authority
WO
WIPO (PCT)
Prior art keywords
current image
map
depth map
feature
pose
Prior art date
Application number
PCT/CN2021/113635
Other languages
English (en)
Chinese (zh)
Inventor
曾柏伟
柳跃天
宋晗
吴昊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180062229.6A priority Critical patent/CN116097307A/zh
Publication of WO2022052782A1 publication Critical patent/WO2022052782A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • G06T2207/10021Stereoscopic video; Stereoscopic image sequence

Definitions

  • the present application relates to the field of augmented reality (AR), and in particular, to an image processing method and related equipment for realizing realistic virtual and real occlusion effects.
  • AR augmented reality
  • AR is the use of computer-generated virtual information to supplement the real world, making the virtual information and the real world appear to coexist in the same space.
  • most AR applications simply superimpose virtual objects in front of the real scene, and do not properly handle the occlusion relationship between virtual objects and the real world, which may easily cause confusion in the spatial position of the user's senses, and cannot exceed the real sensory experience.
  • the effect of no virtual and real occlusion is shown in Figure 1a.
  • AR applications it is necessary to deal with the real relationship between virtual objects and real scenes, that is, virtual and real occlusion.
  • the effect of virtual and real occlusion is shown in Figure 1b.
  • the correct occlusion relationship allows users to have a natural and correct spatial perception in AR applications; the wrong occlusion relationship will reduce the realism of AR applications.
  • the IPad Pro 2020 released by Apple uses RGB images and the depth map collected by the direct time of flight (dToF) camera to obtain a depth map for enhancing the occlusion effect of virtual and real using machine learning methods, and realizes realistic virtual and real based on the depth map. occlusion effect.
  • dToF direct time of flight
  • the present application provides an image processing method and related equipment based on device-cloud joint depth estimation, which solves the problem of scale ambiguity and inconsistency in virtual and real occlusion effects.
  • an embodiment of the present application provides an image processing method based on device-cloud joint depth estimation, including:
  • Obtain the current image calculate and obtain the first depth map corresponding to the current image according to the current image, obtain the second depth map corresponding to the current image from the server according to the current image, and the second depth map represents that the depth information of the feature points in the current image is higher than that of the first depth map.
  • the depth map represents that the depth information of the feature points in the current image is rich, and the first depth map contains the depth information of the feature points that the second depth map does not have;
  • the target of the current image is obtained according to the current image, the first depth map and the second depth map Depth map, the target depth map represents that the depth information of the feature points in the current image is richer than the second depth map represents the depth information of the feature points in the current image; obtain the virtual object image and the depth map of the virtual object;
  • the depth map, the current image and the virtual object image get the target image, that is, the image with the virtual and real occlusion effect.
  • the feature point in this application refers to a point where the gray value of the image changes drastically or a point with a large curvature on the edge of the image.
  • the second depth map represents that the depth information of the feature points in the current image is richer than the first depth map represents that the depth information of the feature points in the current image is specifically reflected from the following three aspects:
  • the number of pixels in the second depth map is greater than the number of pixels in the first depth map.
  • the first depth map is obtained by the terminal device based on the feature points obtained by extracting the features of the image, some of the pixels of the second depth map are obtained based on the offline map pre-stored in the server, and the other part of the pixels are also obtained based on the feature extraction of the image. obtained from the feature points.
  • the number of 2D feature points extracted by the terminal device is lower than the number of 2D feature points extracted by the server; Hundreds, while the number of 2D feature points extracted by the server is generally tens of thousands; for 3D features that correspond one-to-one with 2D feature points, the number of 3D points obtained by the terminal device based on the extracted 2D feature points is less than the number of 2D feature points extracted by the server based on the extracted 2D feature points.
  • the number of obtained 3D points so the number of pixels in the depth map based on the projection of the 3D points determined by the server is greater than the number of pixels in the depth map based on the projection of the 3D points determined by the terminal device, and the second The depth map also includes pixels obtained based on the offline map pre-stored in the server, so it can be known that the number of pixels in the second depth map is greater than the number of pixels in the first depth map.
  • the accuracy of the pixels in the second depth map is higher than that of the pixels in the first depth map.
  • the second depth map is obtained from the offline map pre-stored in the server.
  • the 3D points in the offline map are determined based on the calibrated laser equipment, and the first depth map is obtained based on the feature point matching between the current image and the historical image. , so the accuracy of the pixels in the second depth map is higher than that of the pixels in the first depth map.
  • the distribution of pixels in the second depth map is more uniform than the distribution of pixels in the first depth map.
  • the distribution of pixels in the second depth map is relative to the pixels in the first depth map.
  • the distribution of points is more uniform.
  • the number of pixels in the target depth map can be regarded as the union of the pixels in the first depth map and the pixels in the second depth map, and the number of pixels in the target depth map is equal to the number of pixels in the second depth map. Therefore, the depth information of the feature points in the current image represented by the target depth map is more abundant than the depth information of the feature points in the current image represented by the second depth map.
  • the target depth map is located at the pixel position.
  • the pixel value of S is the maximum value of the pixel value at pixel position S in the first depth map and the pixel value at pixel position S in the second depth map; the pixel value at pixel position S in the first depth map and the second
  • the pixel value at pixel position S in the target depth map is the pixel value at pixel position S in the first depth map and the pixel value at pixel position S in the second depth map.
  • the target depth map is obtained based on the first depth map and the second depth map, while in the prior art, the depth map used to enhance the virtual and real occlusion effect is obtained based on the RBG image and the depth map collected by the dToF camera, which can be seen in the target depth map.
  • the depth information is richer than the depth information represented by the depth map used to enhance the virtual and real occlusion effect. Therefore, the target image is obtained based on the target depth map, the depth map of the virtual object, the current image and the virtual object image.
  • the virtual and real occlusion effect of the target image is obtained. There is no problem of flickering and instability between frames.
  • the second depth map is obtained from the offline map stored in the server according to the current image.
  • the target depth map obtained from the current image and the second depth map can be regarded as Based on the multi-eye depth estimation, compared with the monocular depth estimation in the prior art, the virtual-real occlusion effect of the target image obtained based on the target depth map, the depth map of the virtual object, the current image and the virtual object image has no scale. Discrepancies and inconsistencies.
  • the target depth map and the depth map of the virtual object are used to determine the representation and distribution of the pixels of the virtual object image and the pixels of the current image in the target image.
  • the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image; if The first depth value is not greater than the second depth value, then the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the current image; wherein, the first depth value and the second depth value are respectively the target depth map The depth value corresponding to the pixel point P in the depth map of the virtual object.
  • the second depth map corresponding to the current image is obtained from the server according to the current image, including:
  • the map includes scene data corresponding to the current image; obtains the second pose of the current image, the second pose of the current image is the pose of the terminal device when shooting the current image; according to the second pose of the current image and the local map
  • the corresponding 3D point cloud projects the current image to obtain a second depth map corresponding to the current image.
  • the local map is the area indicated by the first geographic location information in the offline map stored by the server; the first geographic location information is the geographic location information in the first pose of the current image, and the first pose of the current image is the location of the current image in The pose in the coordinate system where the offline map is located, the first pose of the current image is obtained by processing the current image based on the VPS technology.
  • projecting the current image according to the second pose of the current image and the 3D point cloud corresponding to the local map to obtain the second depth map corresponding to the current image specifically refers to: mapping the local map corresponding to the second pose of the current image
  • the 3D point cloud is projected onto the imaging plane of the current image to obtain a second depth map.
  • the offline map includes a panoramic map, 2D feature points and 3D points corresponding to the 2D feature points.
  • the 2D feature points are 2D feature points obtained by feature extraction on the panoramic map.
  • the panoramic map is obtained based on a multi-frame base map, such as stitching based on a multi-frame base map.
  • the base map is an RGB image, of course, it can also be a map of other forms, which is not limited here.
  • the offline map is collected using the calibrated laser panoramic camera equipment.
  • the pose in this application refers to a pose with 6 degrees of freedom, and the pose includes geographic location information and orientation information.
  • obtaining a second depth map corresponding to the current image from the server according to the current image further comprising:
  • the second 2D feature point is a 2D feature point that matches the eighth 2D feature point in the historical image in the first 2D feature point of the current image; according to the second pose of the current image, the second 2D feature point of the current image The feature point, the 2D feature point in the eighth 2D feature point in the historical image that matches the second 2D feature point, and the second pose of the historical image, to obtain the 3D point corresponding to the second 2D feature point of the current image; the current image The second pose of is the pose of the terminal device when the current image is captured;
  • the current image is projected according to the second pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the second 2D feature point of the current image to obtain a second depth map.
  • projecting the current image according to the second pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the second 2D feature point of the current image to obtain the second depth map specifically refers to: according to the current image
  • the second pose of projects the 3D point cloud corresponding to the local map and the 3D point corresponding to the second 2D feature point of the current image onto the imaging plane of the current image to obtain the second depth map.
  • the first 2D feature point of the current image and the eighth 2D feature point in the historical image may be Oriented FAST and Rotated BRIEF (ORB) feature points, where the full name of FAST is the accelerated segmentation detection feature (features). from accelerated segment test) BRIEF is called binary robust independent element features (binary robust independent elementary features); it can also be used to accelerate KAZE (Accelerated-KAZE, AKAZE) feature points, and it can also be depth-based based on knowledge distillation or network search.
  • the learned ASLFeat feature points or SuperPoint feature points can also be Difference of Guassian (DOG) feature points, histogram of oriented gradient (HOG) feature points, BRIEF feature points, BRISK feature points or FREAK Feature points.
  • DOG Difference of Guassian
  • HOG histogram of oriented gradient
  • ORB method or an improved method based on ORB is used to extract textures in images.
  • the features of rich areas; the features extracted based on this method are ORB feature points, and the number of the feature points can be preset, generally several hundreds.
  • the AKAZE feature points can be obtained by extracting the features of the image using the AKAZE feature extraction method.
  • Other feature points in this embodiment can be obtained by extracting features from an image by using a corresponding method.
  • a second depth map is obtained, which further enriches the representation of the second depth map.
  • the depth information of the feature points in the current image also enriches the depth information of the feature points in the current image represented by the target depth map.
  • obtaining a second depth map corresponding to the current image according to the current image including:
  • the second acquisition request carries the current image
  • the second acquisition request instructs the server to acquire the second depth map according to the current image and the offline map stored by the server; and receives the data sent by the server in response to the second acquisition request.
  • the second response message carries the second depth map.
  • the second depth map is acquired from the server. Since the second depth map is calculated by the server, the terminal device does not need to consume computing resources for calculation, thus reducing resource consumption and power consumption of the terminal device.
  • calculating and obtaining a first depth map corresponding to the current image according to the current image including:
  • the first pose is the pose of the current image in the first coordinate system, and the first coordinate system is the coordinate system where the offline map stored by the server is located; process to obtain a first image; the first image is an image obtained by transforming the second pose of the current image into the first pose; the second pose of the current image is the pose when the terminal device shoots the current image; A feature extraction is performed on an image to obtain the third 2D feature point of the first image; the third 2D feature point is matched with the eighth 2D feature point of a plurality of historical images to obtain the fourth 2D feature point of the first image; The fourth 2D feature point is the 2D feature point that matches the eighth 2D feature point of the multiple historical images in the third 2D feature point; according to the fourth 2D feature point, the first pose of the current image, the The 2D feature point in the eighth 2D feature point that matches the fourth 2D feature point and the first position of the historical image to which the 2D feature point belongs to obtain the 3D point corresponding to the fourth 2D feature
  • projecting the current image according to the first pose of the current image and the 3D point corresponding to the fourth 2D feature point to obtain the first depth map specifically refers to: converting the fourth 2D feature according to the first pose of the current image The 3D point corresponding to the point is projected onto the imaging plane of the current image to obtain the first depth map.
  • the third 2D feature point may be an ORB feature point, an AKAZE feature point, an ASLFeat feature point or a superpoint feature point based on deep learning obtained by using knowledge distillation or network search, or a DOG feature point.
  • acquiring the first pose of the current image includes:
  • acquiring the first pose of the current image includes:
  • the terminal device obtains the first pose or pose transformation information of the current image, so that the coordinate system of the current image is the same as the coordinate system of the offline map, which facilitates subsequent calculations.
  • obtaining the target depth map of the current image according to the current image, the first depth map and the second depth map is specifically:
  • the fifth acquisition request carries the geographic location information of the terminal; receive a fifth response message sent by the server for responding to the fifth acquisition request, where the fifth response message carries the depth estimation model;
  • the depth estimation model is the neural network model corresponding to the geographic location information of the terminal;
  • the first depth map and the second depth map are spliced to obtain the third depth map;
  • the current image and the third depth map are input into the depth estimation model to obtain the target depth map.
  • a plurality of depth estimation models are stored in the server, and the plurality of depth estimation models are in one-to-one correspondence with a plurality of geographic location information.
  • obtaining the target depth map of the current image according to the current image, the first depth map and the second depth map is specifically:
  • the initial convolutional neural network is trained to obtain a depth estimation model, including:
  • the server determines a depth estimation model for different geographic locations.
  • the server will determine the depth estimation model based on the geographic location of the terminal device.
  • the depth estimation model corresponding to the location is determined, and the target depth map is obtained based on the depth estimation model, which further improves the density of the pixel points in the target depth map of the current image, that is, enriches the depth of the feature points in the current image represented by the target depth map. information.
  • the target depth map of the current image is obtained according to the current image and the second depth map, including:
  • each first feature map in T first feature maps The resolutions of T are different, and the resolution of each second feature map in T second feature maps is different; T is an integer greater than 1; the T first feature maps and T second feature maps are The first feature map and the second feature map with the same resolution are superimposed to obtain T third feature maps; the T third feature maps are upsampled and fused to obtain the target depth map of the current image; The three depth maps are obtained by splicing the first depth map and the second depth map.
  • the target depth map of the current image is obtained according to the current image and the second depth map, including:
  • the fourth feature map the resolutions of each first feature map in T first feature maps are different, the resolutions of each second feature map in T second feature maps are different, and the T fourth feature maps
  • the resolution of each fourth feature map in the figure is different;
  • the reference depth map is obtained from the depth map collected by the time of flight (TOF) camera, and T is an integer greater than 1;
  • T second feature maps and T fourth feature maps the first feature maps, second feature maps and fourth feature maps with the same resolution are superimposed to obtain T fifth feature maps;
  • the five feature maps are up-sampled and fused to obtain the target depth map of the current image; wherein, the third depth map is obtained by splicing the first depth map and the second depth map.
  • a reference depth map based on the depth map collected by the TOF camera is introduced, which further enriches the depth information of the current image represented by the target depth map.
  • the reference depth map is obtained according to the image collected by the TOF camera, and specifically includes:
  • the depth map collected by the TOF camera is projected into the three-dimensional space according to the pose of the current image to obtain the fourth depth map; the fourth depth map is back projected onto the reference image according to the pose of the reference image to obtain the reference depth map; the reference image is the image adjacent to the current image in acquisition time.
  • the upsampling and fusion processing includes:
  • the current image and the third depth map are input into the depth estimation model of the current image for processing to obtain the target depth map of the current image; wherein the depth estimation model is implemented based on a convolutional neural network , the third depth map is obtained by splicing the first depth map and the second depth map.
  • the target image is obtained according to the target depth map, the depth map of the virtual object, the current image and the virtual object image, including:
  • the accuracy of the optimized depth map is higher than that of the target depth map; segment the optimized depth map to obtain the foreground depth map of the current image and the background depth map, the background depth map is the depth map containing the background area in the optimized depth map, and the foreground depth map is the depth map containing the foreground area in the optimized depth map; according to the L background depth maps corresponding to L
  • the first pose fuses L background depth maps to obtain a fused three-dimensional scene; the L background depth maps include the background depth map of the pre-stored image and the background depth map of the current image, and the L first poses include The first pose of the pre-stored image and the current image; L is an integer greater than 1; the fused 3D scene is back-projected according to the pose of the current image to obtain the fused background depth map; the fused background
  • the depth map and the foreground depth map of the current image are spliced to obtain the updated depth map; the virtual object and the
  • a depth map with sharp edges is obtained; then the pixel values in the target image are determined by comparing the depth map with sharp edges and the depth values at the same position in the depth map of the virtual object, so that the current image.
  • the current image includes a target person
  • the method of the present application further includes:
  • Segment the optimized depth map to obtain the foreground depth map and background depth map of the current image including:
  • the optimized depth map is segmented according to the detection result to obtain a foreground depth map and a background depth map of the current image, wherein the foreground depth map of the current image includes the depth map corresponding to the target person.
  • Determine the foreground depth map including the target task from the optimized depth map then obtain the above-mentioned updated depth map based on the foreground depth map, and obtain the target image based on the depth map, the depth map of the virtual object, the virtual object image and the current image , which enhances the virtual and real occlusion effect between virtual objects and people in the target image, the overall immersion is strong, and the user experience is excellent.
  • an embodiment of the present application provides an image processing method based on device-cloud joint depth estimation, including:
  • the depth estimation model of the image is a depth estimation model corresponding to the geographic location information of the terminal device in the multiple depth estimation models; the multiple depth estimation models are in one-to-one correspondence with the multiple geographic location information;
  • a response to the depth estimation model request is sent to the terminal device
  • the response message of the message, the response message carries the depth estimation model of the current image.
  • the server determines a depth estimation model for different positions; when the depth estimation model needs to be used later, the server determines based on the position of the terminal device.
  • the corresponding depth estimation model based on the depth estimation model to obtain the target depth map, enriches the depth information of the feature points in the current image represented by the target depth map of the current image, based on the target depth map, the depth map of the virtual object, the current image and
  • the target image is obtained from the virtual object image, which solves the problem of flickering and instability between frames in the target image.
  • the method of the present application further includes:
  • a depth estimation model corresponding to the multiple geographic location information is obtained by training respectively, wherein, for any geographic location information S in the multiple geographic location information, the training is performed according to the following steps to obtain the geographic location information S
  • the corresponding depth estimation model :
  • the method of the present application further includes:
  • the method of the present application further includes:
  • the offline map includes a panoramic map, 2D feature points and 3D points corresponding to the 2D feature points.
  • the 2D feature points are 2D feature points obtained by feature extraction on the panoramic map.
  • the panoramic map is obtained based on a multi-frame base map, such as stitching based on a multi-frame base map.
  • the base map is an RGB image, of course, it can also be a map of other forms, which is not limited here.
  • the offline map is collected using a calibrated laser panoramic camera.
  • obtaining a second depth map corresponding to the current image according to the current image and the offline map including:
  • the first pose of the current image is the pose of the current image in the first coordinate system, and the first coordinate system is the coordinate system where the offline map is located;
  • the position in the pose obtains the 3D point cloud corresponding to the local map from the 3D point cloud corresponding to the offline map, and the local map is the area indicated by the first position in the offline map stored by the server;
  • the 3D point cloud corresponding to the map projects the current image to obtain the second depth map.
  • projecting the current image according to the first pose of the current image and the 3D point cloud corresponding to the local map to obtain the second depth map specifically refers to: projecting the 3D point cloud corresponding to the local map according to the first pose of the current image Projected onto the imaging plane of the current image to obtain a second depth map.
  • obtaining a second depth map corresponding to the current image according to the current image and the offline map further comprising:
  • the eleventh 2D feature point is the 2D feature point of the ninth 2D feature point of the current image that matches the tenth 2D feature point in the historical image; Eleven 2D feature points, the 2D feature points in the tenth 2D feature point in the historical image that match the eleventh 2D feature point, and the first pose of the historical image, obtain the eleventh 2D feature point corresponding to the current image. 3D point;
  • the current image is projected according to the first pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the eleventh 2D feature point of the current image to obtain the second depth map.
  • projecting the current image according to the first pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the eleventh 2D feature point of the current image to obtain the second depth map specifically refers to: according to the current image
  • the first pose of the image projects the 3D point cloud corresponding to the local map and the 3D point corresponding to the eleventh 2D feature point of the current image onto the imaging plane of the current image to obtain the second depth map.
  • the ninth 2D feature point and the tenth 2D feature point may be SIFT feature points, may also be speeded up robust feature SURF feature points, or may be SuperPoint feature points obtained based on deep learning, ASLFeat Feature points, R2D2 feature points, or D2Net feature points.
  • the server Since the server has strong computing power, when the server extracts the features of the image, it can extract feature points from both the texture-rich area and the texture-weak area.
  • the feature extraction method can be SIFT or an improved method based on SIFT; the extracted features
  • the points can be called SIFT feature points, and the number can be preset, such as about 10,000.
  • the extracted method may also be SURF or an improved method based on SURF, and the extracted feature points may be referred to as SURF feature points.
  • the second depth map is further enriched.
  • the depth map represents the depth information of the current image.
  • obtaining a second depth map corresponding to the current image according to the current image and the offline map further comprising:
  • the depth map specifically refers to:
  • the 3D point cloud corresponding to the local map, the 3D point corresponding to the eleventh 2D feature point of the current image, and the 3D point corresponding to the sixth 2D feature point of the current image are projected onto the imaging plane of the current image. Get the second depth map.
  • the point clouds are projected together to obtain a second depth map, which further enriches the depth information representing the current image described in the second depth map.
  • the offline map includes a multi-frame base map
  • the method of the present application further includes:
  • the ninth 2D feature point of the current image is matched with the tenth 2D feature point in the multiple historical images to obtain the eleventh 2D feature point of the current image
  • the eleventh 2D feature point is the ninth 2D feature point of the current image.
  • the 2D feature points in the 2D feature points that match the tenth 2D feature point in the historical image; according to the eleventh 2D feature point, the first pose of the current image, the tenth 2D feature point of the multiple historical images and the The 2D feature points that match the eleven 2D feature points and the first pose of the historical image to which the 2D feature points belong are obtained to obtain the 3D points corresponding to the eleventh 2D feature points; the processed 3D points are processed according to the first pose of the current image. Process the 3D point corresponding to the eleventh 2D feature point to obtain the 3D point cloud corresponding to the updated local map;
  • the current image is projected according to the first pose of the current image and the 3D point cloud corresponding to the updated local map to obtain the second depth map, which specifically refers to: according to the first pose of the current image, the updated The 3D point cloud corresponding to the local map is projected onto the imaging plane of the current image to obtain the second depth map.
  • the eleventh 2D feature point may be a SIFT feature point, a SURF feature point, or a SuperPoint feature point obtained based on deep learning, an ASLFeat feature point, an R2D2 feature point, or a D2Net feature point.
  • the offline maps in the server are collected offline, the offline maps in the server are different from the current actual environment. For example, large billboards in shopping malls exist when they are collected offline. After a period of time, when users collect In the current image, the billboard has been removed, which results in that the map issued by the server has 3D point cloud information that is inconsistent with the current environment.
  • the image received by the server may be an image after privacy processing, which will also cause the map issued by the server to have 3D point cloud information that is inconsistent with the current environment.
  • the server updates the 3D point cloud corresponding to the distributed local map to obtain the 3D point cloud corresponding to the updated local map, and then obtains the target depth map based on the 3D point cloud corresponding to the updated local map.
  • the depth information representing the feature points in the current image described in the second depth map is enriched.
  • the method of the present application further includes:
  • the method of the present application further includes:
  • the fourth acquisition request carries the current image and the second pose of the current image
  • the second pose of the current image is the pose of the terminal device when the current image is captured
  • the offline map performs feature point matching to determine the first pose of the current image, the first pose is the pose of the current image on the first coordinate system, and the first coordinate system is the coordinate system of the offline map;
  • the pose and the first pose determine the pose transformation information, and the pose transformation information is used for the transformation between the pose and the first pose of the current image; send a fourth response message to the terminal device for responding to the fourth acquisition request , the fourth response message carries the pose transformation information.
  • feature point matching is performed according to the current image and the offline map to determine the first pose of the current image, including:
  • the terminal device After obtaining the first pose or position transformation information, the terminal device obtains the first pose or pose transformation information of the current image, so that the coordinate system of the current image is the same as the coordinate system of the offline map, which facilitates subsequent calculations.
  • embodiments of the present application provide an electronic device, including a memory and one or more processors; wherein, one or more programs are stored in the memory; the one or more processors are executing the When the one or more programs are described, the electronic device is caused to implement part or all of the method described in the first aspect or the second aspect.
  • an embodiment of the present application provides a computer storage medium, which is characterized in that it includes computer instructions, and when the computer instructions are executed on an electronic device, the electronic device is made to execute the method described in the first aspect or the second aspect. some or all of the methods described above.
  • an embodiment of the present application provides a computer program product, characterized in that, when the computer program product runs on a computer, the computer is caused to execute part of the method according to the first aspect or the second aspect or all.
  • a terminal device including a module for performing the method in the first aspect.
  • a server comprising means for performing the method of the second aspect.
  • Figure 1a is a schematic diagram of the effect of no virtual and real occlusion
  • Figure 1b is a schematic diagram of the effect of virtual and real occlusion
  • FIG. 1c provides a schematic diagram of a system architecture according to an embodiment of the present application.
  • Figure 1d is a schematic structural diagram of a CNN
  • FIG. 1e is a schematic diagram of a chip hardware structure provided by an embodiment of the application.
  • FIG. 1f provides another schematic diagram of the system architecture according to the embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • 3A is a schematic diagram of a first depth map, a second depth map, and a target depth map according to an embodiment of the present application;
  • FIG. 4 is a schematic diagram of the relationship between the base map, the partial map and the processed partial map
  • FIG. 5 is a schematic flowchart of another image processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the effect of virtual and real occlusion using an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another terminal device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a system provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of another system provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another terminal device provided by an embodiment of the application.
  • FIG. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of another terminal device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another server provided by an embodiment of the present application.
  • first, second and the like are used for descriptive purposes only in some cases, and should not be construed as implying or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the “multiple” The meaning is two or more.
  • the pose in this application refers to a pose with 6 degrees of freedom (DOF), including 3 rotation angles (including pitch angle, roll angle and yaw angle) and 3 position-related degrees of freedom, wherein,
  • DOF degrees of freedom
  • the three position-related degrees of freedom can be collectively referred to as geographic location information, referred to as location for short, and the location can be acquired based on GPS, or based on Beidou or other positioning systems.
  • the three rotation angles can be collectively referred to as orientation information.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network, and convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network can be understood as a neural network with many hidden layers. There is no special metric for "many” here. The often said multi-layer neural network and deep neural network are essentially is the same thing. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, in terms of the work of each layer, it is not complicated.
  • the coefficient from the kth neuron in layer L-1 to the jth neuron in layer L is defined as Note that the input layer has no W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way of extracting features independent of location.
  • the convolution kernel can be formalized in the form of a matrix of random size, and the convolution kernel can be learned to obtain reasonable weights during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • an embodiment of the present application provides a system architecture.
  • the data collection device 160 is used to collect training data.
  • the training data in this embodiment of the present application may include: image samples, depth map samples, and real depth maps; after collecting the training data, the data acquisition device 160 stores the training data in the database 130, and the training device 120 is based on the database.
  • the training data maintained in 130 is trained to obtain the depth estimation model 101 .
  • the training device 120 processes the image samples and the depth map samples, and calculates the loss value according to the output predicted depth map, the real depth map, and the loss function, until the calculated loss value converges, thereby completing the depth estimation model 101. Training.
  • the depth estimation model 101 can be used to implement the image processing method provided by the embodiments of the present application, that is, the current image, the first depth map and the second depth map are input into the depth estimation model 101 after relevant preprocessing, that is, the current image target depth map.
  • the depth estimation model 101 in this embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the depth estimation model 101 entirely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. Example limitation.
  • the depth estimation model 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1c, the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook Computer, AR virtual reality (VR), vehicle terminal, etc., it can also be a server or cloud.
  • the execution device 110 is configured with an (input/output, I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input
  • the data may include: the current image, or the current image and the first depth map.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the target depth map of the current image obtained above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate a corresponding depth estimation model 101 based on different training data for different goals or different tasks, and the corresponding depth estimation model 101 can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. , which provides the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1c is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a depth estimation model 101 is obtained by training the training device 120.
  • the depth estimation model 101 may be the neural network in the present application.
  • the neural network in the present application may include CNN or Deep convolutional neural networks (DCNN) and more.
  • CNN is a common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network may include an input layer 11, a convolutional/pooling layer 12 (wherein the pooling layer is optional), and a neural network layer 13 and an output layer 14.
  • the convolutional/pooling layer 12 may include layers 121-126 as examples, for example: in one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, and layer 123 is a convolutional layer Layer 124 is a pooling layer, 125 is a convolutional layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolutional layers, 123 is a pooling layer, and 124 and 125 are convolutional layers. layer, 126 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the following will take the convolutional layer 121 as an example to introduce the inner working principle of a convolutional layer.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 10 can make correct predictions .
  • the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the later convolutional layers eg, 126
  • the features become more and more complex, such as features such as high-level semantics.
  • pooling layer can be a convolutional layer followed by a layer of
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 10 After being processed by the convolutional layer/pooling layer 12, the convolutional neural network 10 is not sufficient to output the required output information. Because as mentioned before, the convolutional/pooling layer 12 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 10 needs to utilize the neural network layer 13 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 13 may include multiple hidden layers (131, 132 to 13n as shown in Fig. 1d), and the parameters contained in the multiple hidden layers may be based on the relevant training data of specific task types Pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 14 After the multi-layer hidden layers in the neural network layer 13, that is, the last layer of the entire convolutional neural network 10 is the output layer 14, the output layer 14 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 10 (as shown in Figure 1d, the propagation from the direction 11 to 14 is forward propagation) is completed, the back propagation (as shown in Figure 1d, the propagation from the direction 14 to 11 is the back propagation) will be Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 10 and the error between the result output by the convolutional neural network 10 through the output layer and the ideal result.
  • the convolutional neural network 10 shown in FIG. 1d is only used as an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, only Including a part of the network structure shown in FIG. 1d , for example, the convolutional neural network used in this embodiment of the present application may only include an input layer 11 , a convolutional layer/pooling layer 12 and an output layer 14 .
  • FIG. 1e is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 30 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 c to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 c to complete the training work of the training device 120 and output the depth map estimation model 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 1d can be implemented in the chip shown in Figure 1f.
  • Both the image fusion method and the training method of the image fusion model in the embodiments of the present application can be implemented in the chip as shown in FIG. 1f.
  • the neural network processor 30 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract the data in the memory (weight memory or input memory) and perform operations.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 fetches the weight data of the matrix B from the weight memory 302 and buffers it on each PE in the arithmetic circuit 303 .
  • the arithmetic circuit 303 fetches the input data of the matrix A from the input memory 301 , performs matrix operation according to the input data of the matrix A and the weight data of the matrix B, and stores the partial result or the final result of the matrix in the accumulator 308 .
  • the vector calculation unit 307 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 307 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • vector computation unit 307 can store the vector of processed outputs to unified memory 306 .
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, merged values, or both.
  • vector computation unit 307 stores the processed vectors to unified memory 306 .
  • the vector processed by the vector computing unit 307 can be used as the activation input of the arithmetic circuit 303, for example, for use in subsequent layers in the neural network, as shown in FIG. 1d, if the current processing layer is the hidden layer 1 (131), the vector processed by the vector calculation unit 307 can also be used for calculation in the hidden layer 2 (132).
  • Unified memory 306 is used to store input data and output data.
  • the weight data is directly stored in the weight memory 302 through a storage unit access controller (direct memory access controller, DMAC) 305.
  • Input data is also stored in unified memory 306 via the DMAC.
  • the bus interface unit (bus interface unit, BIU) 310 is used for the interaction of the DMAC and the instruction fetch buffer (instruction fetch buffer) 309; the bus interface unit 310 is also used for the instruction fetch memory 309 to obtain instructions from the external memory; the bus interface unit 310 also The memory cell access controller 305 acquires the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC 305 is mainly used to store the input data in the external memory DDR into the unified memory 306, or store the weight data into the weight memory 302, or store the input data into the input memory 301.
  • the instruction fetch memory 309 connected with the controller 304 is used to store the instructions used by the controller 304;
  • the controller 304 is configured to call the instructions cached in the instruction fetch memory 309 to control the operation process of the operation accelerator.
  • the unified memory 306 , the input memory 301 , the weight memory 302 and the instruction fetch memory 309 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 1d may be performed by the operation circuit 303 or the vector calculation unit 307 .
  • both the training method of the depth estimation model and the related method of determining the target depth map in the embodiment of the present application may be executed by the operation circuit 303 or the vector calculation unit 307 .
  • the embodiment of the present application provides another system architecture.
  • the system architecture includes a local device 401, a local device 402, the execution device 110 and the data storage system 150 shown in FIG. 1c, wherein the local device 401 and the local device 402 are connected with the execution device 110 through a communication network.
  • the execution device 110 may be implemented by one or more servers.
  • the execution device 110 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 110 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 110 may use the data in the data storage system 150 or call the program code in the data storage system 150 to implement the training method of the time series prediction model in this embodiment of the present application.
  • the execution device 110 may perform the following processes:
  • the first loss value is obtained by calculating the image and the loss function; the parameters in the initial convolutional neural network are adjusted according to the first loss value; the first convolutional neural network is obtained; Perform processing in a convolutional neural network to obtain multiple second predicted depth maps; obtain second loss values according to multiple second predicted depth maps, the real depth maps corresponding to multiple image samples, and the loss function; determine the second loss Whether the value converges; if it converges, the first convolutional neural network is determined as the depth estimation model of the current image; if it does not converge, the parameters in the first convolutional neural network are adjusted according to the second loss value, and the second convolutional neural network is obtained. Neural network, and repeat the above process until the obtained loss value converges, and the convolutional neural network when the loss value converges is determined as the depth estimation model of the
  • a depth estimation model can be obtained, and the depth estimation model can be used to obtain the target depth map of the current image.
  • a user may operate respective user devices (eg, local device 401 and local device 402 ) to interact with execution device 110 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 410 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain the depth estimation model from the execution device 110, deploy the depth estimation model on the local device 401 and the local device 402, and use the depth estimation model to perform depth estimation.
  • the depth estimation model may be directly deployed on the execution device 110.
  • the execution device 410 obtains the current image and the first depth map from the local device 401 and the local device 402, and uses the depth estimation model to analyze the current image and the first depth map.
  • the depth map is used for depth estimation to obtain the target depth map of the current image.
  • the above execution device 110 may also be a cloud device, and in this case, the execution device 110 may be deployed in the cloud; or, the above execution device 110 may also be a terminal device, in this case, the execution device 110 may be deployed on the user terminal side, this embodiment of the present application This is not limited.
  • the application scenario includes a terminal device 100 and a server 200 .
  • the terminal device 100 may be a smart phone, a tablet computer, AR glasses or other smart devices.
  • Server 200 may be a desktop server, rack server, rack server, blade server, or other type of server.
  • the terminal device 100 obtains the current image, and obtains the first depth map corresponding to the current image by calculating according to the current image; obtains the second depth map corresponding to the current image from the server according to the current image; the second depth map represents the depth information of the feature points in the current image
  • the depth information of the feature points in the current image is richer than that of the first depth map; the target depth map of the current image is obtained according to the current image, the first depth map and the second depth map, and the target depth map represents the depth information of the feature points in the current image.
  • the depth information of the feature points in the current image represented by the second depth map is richer; the target image is obtained according to the current image, the target depth map, the virtual object image and the depth map of the virtual object, wherein the target depth map and the depth map of the virtual object are obtained. It is used to determine the presentation and distribution of the pixels of the current image and the pixels of the virtual object image in the target image, so as to achieve virtual and real occlusion.
  • the following describes in detail how the terminal device 100 and the server 200 implement virtual and real occlusion.
  • FIG. 3 is a schematic flowchart of an image processing method provided by an embodiment of the present application. As shown in Figure 3, the method includes:
  • the depth information of the feature points in the current image represented by the second depth map is as rich as the depth information of the feature points in the current image represented by the second depth map, and the first depth map includes the second depth map without feature points. in-depth information.
  • the current image may be an RGB image, a grayscale image, or an image in other forms.
  • the current image is acquired in real time from a camera of the terminal device, or acquired from an image stored in the terminal device, or acquired from other devices, which is not specifically limited herein.
  • obtaining a second depth map corresponding to the current image from the server according to the current image including:
  • Mode 1 Send a first acquisition request to the server to acquire the 3D point cloud corresponding to the local map; receive a first response message sent by the server for responding to the first acquisition request, where the first response message carries the corresponding local map 3D point cloud, the local map includes scene data corresponding to the current image; obtain the second pose of the current image, the second pose of the current image is the pose of the terminal device when the current image is captured; according to the second pose of the current image The 3D point cloud corresponding to the pose and the local map projects the current image to obtain a second depth map corresponding to the current image.
  • the local map is the area indicated by the first geographic location information in the offline map stored by the server; the first geographic location information is whether the geographic location in the first pose of the current image is necessary or not, and the first pose of the current image is the current
  • the pose of the image in the first coordinate system, the first pose of the current image is obtained by processing the current image based on the VPS technology; the first coordinate system is the coordinate system where the offline map is located.
  • the server can obtain the first pose of the current image in the coordinate system where the offline map is located; the server obtains a partial map from the offline map according to the position in the first pose.
  • the partial map is offline In the map, the location is the center and an area within a certain range (for example, with a radius of 50 meters); the offline map includes 2D feature points and their corresponding feature descriptors, so the local map also includes 2D feature points and their corresponding feature descriptors. feature descriptor.
  • 3D points corresponding to the 2D feature points are also stored; the server obtains the 3D points corresponding to the 2D feature points in the local map, that is, the 3D point cloud corresponding to the local map.
  • the server sends a first response message to the terminal device, where the first response message carries the 3D point cloud corresponding to the local map; the terminal device projects the 3D point cloud corresponding to the local map to the imaging plane of the current image according to the second pose of the current image , to get the second depth map.
  • an offline map is stored in the server, and the offline map includes a panoramic map, a 2D feature point in the panoramic map, and a 3D corresponding to the 2D feature point, wherein the panoramic map is obtained based on a multi-frame basic map.
  • the 3D point cloud in the offline map is dense.
  • the offline map is collected by a calibrated laser + panoramic camera device.
  • the 3D points in the offline map are from the laser point cloud, and the 2D feature points in the offline map are from the panoramic image.
  • the feature point in this application refers to a point where the gray value of the image changes drastically or a point with a large curvature on the edge of the image (ie, the intersection of two edges).
  • obtaining a second depth map corresponding to the current image from the server according to the current image including:
  • Method 2 Send a first acquisition request to the server to acquire the 3D point cloud corresponding to the local map; receive a first response message sent by the server for responding to the first acquisition request, where the first response message carries the 3D point corresponding to the local map Cloud, the local map is the area indicated by the first position in the offline map stored by the server; the first position is the position in the first pose of the current image, and the first pose of the current image is the current image in the first coordinate system The first pose of the current image is obtained based on the current image; the first coordinate system is the coordinate system where the offline map is located; the feature extraction is performed on the current image to obtain the first 2D feature point of the current image; The first 2D feature point of the image is matched with the eighth 2D feature point in the multiple historical images to obtain the second 2D feature point of the current image, and the second 2D feature point is the first 2D feature point of the current image and the Describe the 2D feature point that matches the eighth 2D feature point in the historical image; according to the second pose of the current image,
  • the terminal device matches the first 2D feature point of the current image with the eighth 2D feature point of multiple historical images to obtain a second 2D feature point of the current image.
  • the second 2D feature point is the 2D feature point that matches the eighth 2D feature point in the historical image in the first 2D feature point of the current image; the terminal device determines the second pose of the current image and the historical image.
  • the relative pose between the second poses based on the relative pose, triangulate the second 2D feature point of the current image and the eighth 2D feature point in the historical image that matches the second 2D feature point to obtain the current
  • the 3D point corresponding to the second 2D feature point of the image according to the second pose of the current image, the 3D point cloud corresponding to the local map and the 3D point corresponding to the second 2D feature point are projected onto the imaging plane of the current image to obtain the first Two depth maps.
  • the first 2D feature point of the current image and the eighth 2D feature point in the historical image may be ORB feature points, AKAZE feature points, or may be obtained by using knowledge distillation or network search based on deep learning.
  • the ASLFeat feature points or superpoint feature points can also be DOG feature points, HOG feature points, BRIEF feature points, BRISK feature points or FREAK feature points.
  • ORB method or an improved method based on ORB is used to extract textures in images.
  • the features of rich areas; the features extracted based on this method are ORB feature points, and the number is generally several hundred.
  • the AKAZE feature points can be obtained by extracting the features of the image using the AKAZE feature extraction method.
  • Other feature points in the present application can be obtained by using corresponding methods to perform feature extraction on images, and will not be described here.
  • the above-mentioned historical image is an image uploaded to the server by the terminal device before the current image; optionally, the first response message also carries the first pose of the current image; similarly, the terminal device can also obtain the first position of the multiple historical images.
  • One pose when determining the 3D point corresponding to the second 2D feature point of the current image, the terminal device determines the relative pose between the first pose of the current image and the first pose of the historical image; based on the relative pose pair
  • the second 2D feature point of the current image and the eighth 2D feature point in the historical image that match the second 2D feature point are triangulated to obtain a 3D point corresponding to the second 2D feature point of the current image.
  • the second depth map is determined, the 3D points corresponding to the second 2D feature points of the current image are introduced. Compared with the first method, more 3D points are introduced. Therefore, the second depth obtained by the second method is obtained.
  • the depth information of the current image represented by the map is more abundant than the depth information of the current image represented by the second depth map obtained based on the first method.
  • the matching of 2D feature points in this application specifically means that the similarity of the two matched 2D feature points is higher than the preset similarity; or it means that the Euclidean distance between the two matched 2D feature points is low at the preset distance.
  • the 3D point cloud corresponding to the local map obtained by the terminal device may be obtained by the server according to the first method, or the server may update the 3D point cloud after the server obtains the 3D point cloud according to the first method above.
  • the obtained 3D point cloud corresponding to the updated local map please refer to the relevant description of the server-side embodiment for the specific update process.
  • obtaining the second depth map from the server according to the current image including:
  • Mode 3 Send a second acquisition request to the server, where the second acquisition request carries the current image, and the second acquisition request instructs the server to obtain the second depth map according to the current image and the offline map stored by the server; The second response message of the second acquisition request, where the second response message carries the second depth map.
  • the second depth map is obtained by calculation by the server, which does not require the terminal device to consume computing resources for calculation, thereby reducing resource consumption and power consumption of the terminal device.
  • calculating the first depth map corresponding to the current image of the current image according to the current image including:
  • the first pose which is the pose of the current image in the first coordinate system, and process the current image according to the first pose to obtain the first image;
  • the first image is the first image of the current image.
  • the second pose of the current image is the pose when the terminal device captures the current image ;
  • the 3rd 2D feature point is matched with the 8th 2D feature point stored in advance, to obtain the 4th 2D feature point of the first image;
  • the 4th 2D feature point is the 3rd 2D feature point and pre-stored with multiple historical images
  • the 3D point corresponding to the fourth 2D feature point is obtained from the first pose of the image to which the point belongs; the current image is projected according to the first pose of the current image and the 3D point corresponding to the fourth 2D feature point to obtain the first depth map .
  • the pre-stored 2D feature points may be 2D feature points in at least one historical image.
  • the at least one historical image packet is one or more images with a smaller difference between the timestamp of the terminal device and the timestamp of the current image, or one or more images whose timestamp is before the timestamp of the current image.
  • the multiple images may be consecutive frame images or non-consecutive frame images.
  • the time stamp of the image may be the time when the image was collected, and certainly may be other times, which are not limited herein.
  • the third 2D feature point is matched with the pre-stored 2D feature point to obtain the fourth 2D feature point of the first image;
  • the fourth 2D feature point is The 2D feature points in the third 2D feature points that match the pre-stored 2D feature points; the fourth 2D feature points can be from the same image or from different images; for the fourth 2D feature points and the pre-stored 2D feature points
  • For the 2D feature points in the 2D feature points that match the fourth 2D feature point first determine the pose of the image to which the fourth 2D feature point belongs, then determine the relative pose between the pose and the first pose, and finally based on the The phase pose triangulates the fourth 2D feature point and the 2D feature point that matches the fourth 2D feature point in the pre-stored 2D feature point to obtain the 3D point corresponding to the fourth 2D feature point;
  • the pose projects the 3D point corresponding to the fourth 2D feature point onto the imaging plane of the current image to obtain the first depth map.
  • the third 2D feature point may be an ORB feature point, an AKAZE feature point, an ASLFeat feature point or a superpoint feature point based on deep learning obtained by using knowledge distillation or network search, or a DOG feature point.
  • obtain the first pose including:
  • obtain the first pose including:
  • the pose transformation information the pose transformation information is used for the transformation between the second pose and the first pose of the current image, and the pose transformation is performed on the second pose of the current image according to the pose transformation information to obtain first pose.
  • the pose transformation information is essentially a matrix.
  • first acquisition request, the second acquisition request, the third acquisition request and the fourth acquisition request may be the same acquisition request, that is, one acquisition request implements the first acquisition request and the second acquisition request.
  • some or all of the functions implemented in the third acquisition request and the fourth acquisition request may be the same response message.
  • the second depth map represents that the depth information of the feature points in the current image is richer than the first depth map represents that the depth information of the feature points in the current image is specifically reflected from the following three aspects:
  • the number of pixels in the second depth map is greater than the number of pixels in the first depth map.
  • the first depth map is obtained by the terminal device based on the feature points obtained by extracting the features of the image, some of the pixels of the second depth map are obtained based on the offline map pre-stored in the server, and the other part of the pixels are also obtained based on the feature extraction of the image. obtained from the feature points.
  • the number of 2D feature points extracted by the terminal device is lower than the number of 2D feature points extracted by the server; Hundreds, while the number of 2D feature points extracted by the server is generally tens of thousands; for 3D features that correspond one-to-one with 2D feature points, the number of 3D points obtained by the terminal device based on the extracted 2D feature points is less than the number of 2D feature points extracted by the server based on the extracted 2D feature points.
  • the number of obtained 3D points so the number of pixels in the depth map based on the projection of the 3D points determined by the server is greater than the number of pixels in the depth map based on the projection of the 3D points determined by the terminal device, and the second The depth map also includes pixels obtained based on the offline map pre-stored in the server, so it can be known that the number of pixels in the second depth map is greater than the number of pixels in the first depth map.
  • the accuracy of the pixels in the second depth map is higher than that of the pixels in the first depth map.
  • the second depth map is obtained from the offline map pre-stored in the server.
  • the 3D points in the offline map are determined based on the calibrated laser equipment, and the first depth map is obtained based on the feature point matching between the current image and the historical image. , so the accuracy of the pixels in the second depth map is higher than that of the pixels in the first depth map.
  • the distribution of pixels in the second depth map is more uniform than the distribution of pixels in the first depth map.
  • the distribution of pixels in the second depth map is relative to the pixels in the first depth map.
  • the distribution of points is more uniform.
  • the number of pixels in the target depth map can be regarded as the union of the pixels in the first depth map and the pixels in the second depth map, and the number of pixels in the target depth map is equal to the number of pixels in the second depth map. Therefore, the depth information of the feature points in the current image represented by the target depth map is more abundant than the depth information of the feature points in the current image represented by the second depth map.
  • the target depth map is located at the pixel position.
  • the pixel value of S is the maximum value of the pixel value at pixel position S in the first depth map and the pixel value at pixel position S in the second depth map; the pixel value at pixel position S in the first depth map and the second
  • the pixel value at pixel position S in the target depth map is the pixel value at pixel position S in the first depth map and the pixel value at pixel position S in the second depth map.
  • the target depth map representing the depth information of the feature points in the current image is richer than the second depth map representing the depth information of the feature points in the current image.
  • picture a is the first depth map
  • picture b is the second depth map
  • picture c is the target depth map
  • the four-pointed star in the picture represents the pixels of the depth map
  • the gray four-pointed star in the picture a is in The corresponding position in the picture b does not exist, and it can be seen that the first depth map contains depth information that does not have feature points in the second depth map.
  • Picture c can be seen as a superposition of picture a and picture b, that is to say, the number of pixels in the target depth map is greater than the number of pixels in the second depth map.
  • obtaining the target depth map of the current image according to the current image, the first depth map and the second depth map is specifically:
  • the fifth acquisition request carries the geographic location information of the terminal; receive a fifth response message sent by the server for responding to the fifth acquisition request, where the fifth response message carries the depth estimation model;
  • the depth estimation model is the neural network model corresponding to the geographic location information of the terminal;
  • the first depth map and the second depth map are spliced to obtain the third depth map;
  • the current image and the third depth map are input into the depth estimation model to obtain the target depth map.
  • the server determines a depth estimation model for different positions; when the depth estimation model needs to be used later, the server determines based on the position of the terminal device.
  • the corresponding depth estimation model obtains the target depth map based on the depth estimation model, which further improves the density of pixels in the target depth map of the current image, that is, enriches the depth information of the feature points in the current image represented by the target depth map.
  • the above depth estimation model is implemented based on a neural network, such as a convolutional neural network, which is not limited here.
  • the server determines a depth estimation model for different positions.
  • the depth estimation model can be trained by the server, or obtained by the server from the training device after being trained by other training devices.
  • the terminal device can obtain the depth estimation model without training, which reduces the power consumption of the terminal device and improves the real-time performance of virtual and real occlusion.
  • obtaining the target depth map of the current image according to the current image, the first depth map and the second depth map is specifically:
  • the initial convolutional neural network is trained to obtain a depth estimation model, including:
  • the depth estimation model may adopt a network structure such as DiverseDepth, SARPN, or CSPN.
  • the feature extraction function in the above-mentioned depth estimation model can be realized by network structures such as VGGNet, ResNet, ResNeXt, DenseNet, etc.
  • VGGNet All use 3*3 convolution kernels and 2*2 pooling kernels to improve performance by continuously deepening the network structure.
  • the input is an RGB image of size 224*224
  • the average of the three channels is calculated during preprocessing, and the average is subtracted from each pixel (fewer iterations after processing, faster convergence).
  • the image is processed by a series of convolutional layers, and a very small 3*3 convolution kernel is used in the convolutional layer.
  • the 3*3 convolution kernel is chosen because 3*3 is the smallest and can capture pixel 8 neighborhood information size of.
  • the stride of the convolutional layer is set to 1 pixel, and the padding of the 3*3 convolutional layer is set to 1 pixel.
  • the pooling layer adopts max pooling, with a total of 5 layers. After part of the convolutional layer, the max-pooling window is 2*2, and the stride is set to 2.
  • the convolutional layer is followed by three fully-connected layers (FC). The first two fully connected layers each have 4096 channels, and the third fully connected layer has 1000 channels for classification.
  • the fully connected layer configuration is the same for all networks. After the fully connected layer is Softmax, which is used for classification. All hidden layers (in the middle of each conv layer) use ReLU as activation function.
  • ResNet It is a residual network, which can be understood as a sub-network, which can be stacked to form a deep network. Residual networks are characterized by being easy to optimize and capable of increasing accuracy by adding considerable depth. The internal residual block uses skip connections to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.
  • ResNeXt is based on the idea of ResNet, and proposes a structure that can improve the accuracy without increasing the complexity of parameters, while also reducing the number of hyperparameters.
  • Feature replace the original ResNet three-layer convolution block with a parallel stack of blocks of the same topology, which improves the accuracy of the model without significantly increasing the magnitude of the parameters.
  • the hyperparameters are also reduced. It is convenient for model transplantation and becomes a popular recognition task framework.
  • DenseNet (Densely Connected Network): In the traditional convolutional network, each layer only uses the output features of the previous layer as its input. Each layer in DenseNet uses the features of all previous layers as input, and its own features as all subsequent layers. layer input. DenseNet has the following advantages: it alleviates the problem caused by gradient dispersion, enhances the propagation of features, encourages feature reuse, and greatly reduces the amount of parameters.
  • obtain the target depth map of the current image according to the current image, the first depth map and the second depth map including:
  • each first feature map in T first feature maps The resolutions of the T second feature maps are different, and the resolutions of each second feature map in the T second feature maps are different; the T is an integer greater than 1; the T first feature maps and T second feature maps are combined. , the first feature map and the second feature map with the same resolution are superimposed to obtain T third feature maps; the T third feature maps are upsampled and fused to obtain the target depth map of the current image.
  • the three depth maps are obtained by splicing the first depth map and the second depth map.
  • any first feature map in the T first feature maps there is a unique second feature map with the same resolution as the first feature map in the T second feature maps.
  • the multi-scale feature extraction in this application specifically refers to the operation of using multiple different convolution kernels to convolve an image.
  • “superimposition” specifically refers to processing the images to be superimposed at the pixel level, for example, the size of the superimposed two images is H*W, and the size of the superimposed image is H*2W, or 2H*W; For another example, the size of the three superimposed images is H*W, and the size of the superimposed image is H*3W, or 3H*W.
  • obtain the target depth map of the current image according to the current image, the first depth map and the second depth map including:
  • T first feature maps Perform multi-scale feature extraction on the current image to obtain T first feature maps, and perform multi-scale feature extraction on the third depth map to obtain T second feature maps; perform multi-scale feature extraction on the reference depth map to obtain T
  • the fourth feature map the resolutions of each first feature map in T first feature maps are different, the resolutions of each second feature map in T second feature maps are different, and the T fourth feature maps
  • the resolution of each fourth feature map in the figure is different;
  • the reference depth map is obtained from the depth map collected by the TOF camera, and T is an integer greater than 1;
  • T first feature maps, T second feature maps With the T fourth feature maps, the first feature map, the second feature map and the fourth feature map with the same resolution are superimposed to obtain T fifth feature maps;
  • the T fifth feature maps are upsampled and fused processing to obtain the target depth map of the current image, and the third depth map is obtained by splicing the first depth map and the second depth map.
  • the T second feature maps there is a second feature map with the same resolution as the first feature map;
  • the fourth feature map there is a unique fourth feature map with the same resolution as the first feature map.
  • the above-mentioned reference depth map is a depth map collected by the above-mentioned TOF camera.
  • the fourth depth map is back projected onto the reference image.
  • obtain a reference depth map wherein, the reference image is an image adjacent to the current image in acquisition time, the resolution of the depth map collected by the TOF camera is lower than the preset resolution, and the frame rate of the depth map collected by TOF is lower than Default frame rate.
  • the preset frame rate may be 1fps, 2fps, 5fps or other frame rates
  • the preset resolution may be 240*180, 120*90, 60*45, 20*15, or other resolutions.
  • the TOF camera acquires a depth map at a frame rate of 1 fps, and the resolution of the depth map is 20*15.
  • the above-mentioned upsampling and fusion processing specifically includes:
  • the above-mentioned processing objects include the above-mentioned T third feature maps or T fifth feature maps.
  • the above-mentioned upsampling is deconvolution upsampling.
  • the target depth map and the depth map of the virtual object are used to determine the representation and distribution of the pixels of the virtual object image and the pixels of the current image in the target image.
  • the virtual object image may be obtained by projecting a three-dimensional model of the virtual object by a renderer in the terminal device; or obtained from other devices.
  • the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image; If the first depth value is not greater than the second depth value, the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the current image; wherein, the first depth value and the second depth value are the target depths respectively The depth value corresponding to the pixel point P in the target image in the depth map of the image and the virtual object.
  • the pixel values of the pixel points in the target image are determined one by one, thereby obtaining the target image.
  • edge optimization is performed on the target depth map according to the current image to obtain an optimized depth map;
  • the accuracy of the optimized depth map is higher than that of the target depth map;
  • the target image is then determined according to the above method, wherein the first depth value is the depth value corresponding to the pixel point P in the target image in the optimized depth map.
  • the target image is obtained according to the target depth map, the depth map of the virtual object, the current image and the virtual object image, including:
  • the background depth map is the depth map containing the background area in the target depth map
  • the foreground depth map is the depth map containing the foreground area in the target depth map
  • the L poses corresponding to the L background depth maps are fused to the L background depth maps to obtain a fused three-dimensional scene
  • the L background depth maps include the background depth map of the pre-stored image and the background depth of the current image Figure, L poses include the pre-stored image and the first pose of the current image;
  • L is an integer greater than 1;
  • the above-mentioned pre-stored image is received by the server before the current image.
  • superimposing and displaying the virtual object image and the current image according to the target depth map of the current image including:
  • the edge of the target depth map is optimized to obtain the optimized depth map; the accuracy of the optimized depth map is higher than that of the target depth map; the optimized depth map is segmented to obtain the foreground depth map of the current image and the background depth map, the background depth map is the depth map with the background area in the optimized depth map, the foreground depth map is the depth map with the foreground area in the optimized depth map, and the optimized depth map is the target of the current image.
  • the depth map is obtained by performing edge optimization; the L background depth maps are fused according to the L poses corresponding to the L background depth maps to obtain the fused three-dimensional scene; the L background depth maps include pre-stored images.
  • the background depth map and the background depth map of the current image include the pre-stored image and the first pose of the current image; L is an integer greater than 1; according to the first pose of the current image, the fused 3D scene is Perform back projection to obtain the fused background depth map; splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map; according to the updated depth map, the depth map of the virtual object, The target image is obtained from the current image and the virtual object image.
  • the foreground area refers to the area where the object of interest is located, such as people, cars, animals and plants and other salient objects; the background area is the area in the image except the foreground area.
  • the current image is detected, such as person detection, vehicle detection, animal and plant detection, etc., to determine people, cars, animals and plants or other objects in the current image.
  • performing portrait segmentation on the optimized depth map specifically performing portrait segmentation on the target depth map of the current image according to the portrait mask, to obtain the foreground depth map and the background depth map of the current image, wherein the foreground depth map includes a portrait. depth map.
  • the current image includes the target task.
  • task detection is performed on the optimized depth map to obtain the detection result; and then the optimized depth map is segmented based on the detection result to obtain the optimized depth map.
  • the above-mentioned pose may be obtained according to a corresponding image, or may be a SLAM pose, may be a pose obtained by a deep learning method, or may be a pose obtained by other methods, which is not limited herein.
  • the specific fusion method used for the above-mentioned fusion may be a truncated signed distance function (TSDF) fusion method, or a surfel fusion method.
  • TSDF truncated signed distance function
  • edge optimization is performed on the target depth map according to the current image to obtain an optimized depth map, including:
  • the offset map of the target depth map is obtained according to the current image and the target depth map.
  • the current image and the target depth map can be processed through a neural network to obtain the offset map of the target depth map;
  • the offset map of the target depth map is superimposed on the corresponding pixels to obtain a depth map with a sharp edge, and the depth map is the above-mentioned optimized depth map.
  • the second depth map is obtained by projecting the 3D point cloud obtained from the offline map stored on the server.
  • the feature point and the 3D point corresponding to the 2D feature point, the multi-frame basic map is different, so it can be seen that the 3D point cloud of the cost application is obtained based on the multi-frame image, and the second depth map obtained based on the projection of the 3D point cloud is also It can be regarded as based on multiple frames of images, and the target depth map obtained based on the first depth map and the second depth map of the current image can be regarded as obtained based on multi-camera depth estimation, so based on the target depth map, current image, There is no scale discrepancy and inconsistency in the virtual-real occlusion effect of the virtual object's depth map and the target image obtained by the virtual object image.
  • a target depth map with more abundant depth information of the feature points in the current image is obtained, so that the virtual and real occlusion effect of the target image does not exist between frames.
  • the problem of instability; the depth map collected by the TOF camera is introduced in the depth estimation, which further enriches the depth information of the current image represented by the target depth map; the optimized depth map is obtained by performing edge optimization on the target depth map of the current image. Then, the depth maps of multiple frames are fused to obtain a depth map with sharper edges, which is conducive to further improving the effect of virtual and real occlusion.
  • the method of the present application is repeated, so when acquiring the 3D point cloud corresponding to the local map, it can be regarded as traversing the 3D point corresponding to the 2D feature point of the panoramic map in the offline map; Since the 2D feature points in the offline map are 2D feature points in the multi-frame basic map that constitutes the panoramic map, in the process of repeatedly executing the method of the present application, the 3D point clouds corresponding to the obtained partial maps come from the 2D feature points of different basic maps.
  • the 3D point corresponding to the feature point, the second depth map is obtained based on the 3D projection corresponding to the local map, and then the target depth map is obtained based on the current image, the first depth map and the second depth map. It can be seen that the target depth map is based on the multi-eye depth. Therefore, the virtual and real occlusion is performed based on the target depth map, which solves the problem of scale discrepancy and inconsistency in monocular depth estimation.
  • FIG. 5 is a schematic flowchart of another image processing method provided by an embodiment of the present application. As shown in Figure 5, the method includes:
  • S501 Receive a depth estimation model request message sent by a terminal device, where the request message carries geographic location information of the terminal device.
  • the geographic location information of the terminal device is the geographic location information when the terminal device collects the current image.
  • the above-mentioned geographic location information is a coordinate in a world coordinate system, and the world coordinate system may be a UTM coordinate system, a GPS coordinate system, or other world coordinate systems.
  • the depth estimation model of the current image is the depth estimation model corresponding to the terminal device among the depth estimation models stored in the server, and the depth estimation models stored in the server are in one-to-one correspondence with the geographic location information.
  • a depth estimation model is separately trained for each geographic location information stored in the server.
  • the method of this embodiment further includes:
  • a depth estimation model corresponding to the multiple geographic location information is obtained by training, wherein, for any geographic location information S in the multiple locations, the following steps are performed for training, and the geographic location can be obtained.
  • the multiple image samples are collected by the terminal device in the geographic location information S; Calculate the loss value according to the multiple predicted depth maps, the real depth map and the loss function corresponding to the multiple image samples; adjust the parameters in the initial convolutional neural network according to the loss value to obtain the depth estimation model corresponding to the geographic location information S;
  • the loss function is based on the error between the predicted depth map and the real depth map, the error between the gradient of the predicted depth map and the gradient of the real depth map, and the difference between the normal vector of the predicted depth map and the normal vector of the real depth map. The error between is determined.
  • the depth estimation model request message may be regarded as the fifth acquisition request in the embodiment corresponding to FIG. 3
  • the response message in response to the depth estimation model request message may be Considered as the fifth response message.
  • the method of this embodiment further includes:
  • the offline map stored in the server and the 3D points corresponding to the 2D feature points in the offline map the server obtains a partial map from the offline map according to the position in the first pose of the current image, and the partial map is in the offline map
  • the area indicated by the position in the first pose in an example, the local map is an area within a certain range (for example, with a radius of 50 meters) centered on the position in the offline map; the local map includes multiple 2D feature points, based on the correspondence between 2D feature points and 3D points, obtain the 3D points corresponding to the 2D feature points in the local map, and the 3D points corresponding to the 2D feature points in the local map are the 3D point clouds corresponding to the local map.
  • the first pose of the current image is obtained by the server based on the VPS technology according to the current image; the first pose is the pose in the coordinate system where the offline map is located, that is, the first pose of the current image and the coordinates of the offline map system is unified.
  • the first response message also carries the first pose of the current image.
  • the method of this embodiment further includes:
  • the second depth map is determined in the server, and then the second depth map is sent to the terminal device, so that the terminal device does not need to calculate the second depth map, thereby reducing the resource overhead and power consumption of the terminal device.
  • a second depth map according to the current image and the offline map including:
  • the first pose of the current image is the pose of the current image in the first coordinate system, and the first coordinate system is the coordinate system where the offline map is located;
  • the position in the pose obtains the 3D point cloud corresponding to the local map from the 3D point cloud corresponding to the offline map, and the local map is the area indicated by the first position in the offline map stored by the server;
  • projecting the current image according to the first pose of the current image and the 3D point cloud corresponding to the local map to obtain the second depth map specifically refers to: projecting the 3D point cloud corresponding to the local map according to the first pose of the current image Projected onto the imaging plane of the current image to obtain a second depth map.
  • obtaining a second depth map according to the current image and the offline map further comprising:
  • the eleventh 2D feature point is the 2D feature point that matches the tenth 2D feature point in the historical image in the ninth 2D feature point of the current image
  • the tenth 2D feature point is the SIFT feature point
  • the current image is projected according to the first pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the eleventh 2D feature point of the current image to obtain the second depth map.
  • the relative pose between the first pose of the current image and the first pose of the historical image is determined;
  • a 2D feature point is triangulated with the eleventh 2D feature point in the historical image that matches the eleventh 2D feature point to obtain a 3D point corresponding to the eleventh 2D feature point of the current image;
  • the pose projects the 3D point cloud corresponding to the local map and the 3D point corresponding to the eleventh 2D feature point of the current image onto the imaging plane of the current image to obtain a second depth map.
  • the ninth 2D feature point and the tenth 2D feature point may be SIFT feature points, SURF feature points, or SuperPoint feature points obtained based on deep learning, ASLFeat feature points, R2D2 feature points or D2Net feature points. point.
  • the server has strong computing power, when the server extracts the features of the image, it can extract feature points from both the texture-rich area and the texture-weak area.
  • the feature extraction method can be SIFT or an improved method based on SIFT; the extracted features Points can be called SIFT feature points, and the number is about 10,000.
  • the extracted method may also be SURF or an improved method based on SURF, and the extracted feature points may be referred to as SURF feature points.
  • a 3D point obtained based on the ninth 2D feature point of the current image and the tenth 2D feature point of the historical image is introduced, and the 3D point is combined with the 3D point cloud corresponding to the local map.
  • the second depth map is obtained by projection, which further enriches the depth information of the current image represented by the second depth map, thereby enriching the depth information of the current image represented by the target depth map.
  • obtaining a second depth map corresponding to the current image according to the current image and the offline map further comprising:
  • the reference map is obtained from the local map according to the orientation information in the first pose of the current image, and the reference map is the area indicated by the orientation information of the first pose in the local map;
  • the 2D feature point in the reference map is matched with the ninth 2D feature point of the current image to obtain the sixth 2D feature point of the reference map, and the sixth 2D feature point of the reference map is the sixth 2D feature of the above-mentioned local map point.
  • the square area is the offline map
  • the circular area is the local map
  • the center point of the circular area is the location of the terminal device in the offline map
  • the angle range is [-45°, 45°]
  • the fan-shaped area is the target The map, where the angle range is [-45°, 45°] is the yaw angle range in the orientation information of the terminal device in the first pose.
  • the relative pose between the first pose of the current image and the pose of the local map is determined, and then based on the relative pose, the sixth 2D feature point of the local map and the current pose are determined.
  • the first 2D feature point in the image that matches the sixth 2D feature point is triangulated to obtain the 3D point corresponding to the sixth 2D feature point; according to the first pose of the current image, the 3D point corresponding to the local map is The cloud, the 3D point corresponding to the eleventh 2D feature point of the current image, and the 3D point corresponding to the sixth 2D feature point are projected onto the imaging plane of the current image to obtain the second depth map.
  • the 3D point corresponding to the sixth 2D feature point of the local map is introduced, which further enriches the representation of the second depth map.
  • the depth information of the current image thus enriching the depth information of the current image represented by the target depth map.
  • the offline map includes a multi-frame base map
  • the method of the present application further includes:
  • the eleventh 2D feature point is the 2D feature point in the ninth 2D feature point of the current image that matches the 2D feature point in the historical image; according to the eleventh 2D feature point, the first position of the current image, the multiple historical images The 2D feature point that matches the eleventh 2D feature point in the tenth 2D feature point and the first position of the historical image to which the 2D feature point belongs to obtain the 3D point corresponding to the eleventh 2D feature point; The pose processes the processed 3D point and the 3D point corresponding to the eleventh 2D feature point to obtain the 3D point cloud corresponding to the updated local map;
  • the current image is projected according to the first pose of the current image and the 3D point cloud corresponding to the updated local map to obtain the second depth map, which specifically refers to: according to the first pose of the current image, the updated The 3D point cloud corresponding to the local map is projected onto the imaging plane of the current image to obtain the second depth map.
  • the eleventh 2D feature point may be a SIFT feature point, a SURF feature point, or a SuperPoint feature point obtained based on deep learning, an ASLFeat feature point, an R2D2 feature point or a D2Net feature point.
  • the offline maps in the server are collected offline, the offline maps in the server are different from the current actual environment. For example, large billboards in shopping malls exist when they are collected offline. After a period of time, when users collect In the current image, the billboard has been removed, which results in that the map issued by the server has 3D point cloud information that is inconsistent with the current environment.
  • the image received by the server may be an image after privacy processing, which will also cause the map issued by the server to have 3D point cloud information that is inconsistent with the current environment.
  • the server updates the 3D point cloud corresponding to the delivered local map, and obtains the 3D point cloud corresponding to the updated local map.
  • the server performs feature extraction on the current image to obtain the ninth 2D feature point of the current image; the methods for performing feature extraction on the current image include but are not limited to the scale invariant feature transform (SIFT) method and the SIFT method. By improving the obtained method, the extracted 2D feature points can also be called SIFT feature points.
  • SIFT scale invariant feature transform
  • the server matches the ninth 2D feature point in the current image with the pre-stored 2D feature point to obtain the eleventh 2D feature point in the current image.
  • the pre-stored 2D feature point is the tenth 2D feature point in N historical images
  • the above N historical images are acquired by the terminal device
  • the time stamp is located in the image before the time stamp of the current image
  • the time stamp is N images that are closer to the timestamp of the current image, where N is an integer greater than 0.
  • the server removes the noise in the eleventh 2D feature point of the current image, and specifically calculates the verification value of each eleventh 2D feature point in the current image through the homography matrix, the fundamental matrix and the essential matrix; if If the verification value of the eleventh 2D feature point is lower than the second preset threshold, it is determined that the eleventh 2D feature point is a noise point, and the eleventh 2D feature point is deleted.
  • the server determines the relative pose between the first pose of the current image and the first pose of the historical image, and based on the relative pose, compares the eleventh 2D feature point of the current image and the eleventh 2D feature in the historical image with the eleventh 2D feature.
  • the point-matched tenth 2D feature is triangulated to obtain a 3D point corresponding to the eleventh 2D feature point in the current image.
  • the server obtains M base maps from the multi-frame base maps according to the image retrieval method.
  • the similarity between each base map in the M base maps and the current image is greater than the first preset threshold, and M is an integer greater than 0.
  • the image retrieval method includes but Not limited to the bag-of-words tree method or the deep learning-based NetVlad method; match the eleventh 2D feature point in the current image with the twelfth 2D feature point in the M bases to obtain the seventh 2D feature point of the base map; From the 3D point cloud corresponding to the local map, the 3D point corresponding to the seventh 2D feature point is screened out to obtain the processed 3D point.
  • the processed 3D point and the 3D point corresponding to the eleventh 2D feature point are processed according to the first pose of the current image to obtain the 3D point cloud corresponding to the updated local map, so as to realize the 3D point cloud corresponding to the local map in the server. Update of the point cloud.
  • the server has strong computing power, when the server extracts the features of the image, it can extract feature points from both the texture-rich area and the texture-weak area.
  • the feature extraction method can be SIFT or an improved method based on SIFT; the extracted features Points can be called SIFT feature points, and the number is about 10,000.
  • the 3D point cloud corresponding to the offline map stored in the server can be continuously updated by the image uploaded by the terminal device, so that the 3D point cloud corresponding to the offline map is consistent with the image content collected by the terminal device, so that depth estimation is performed.
  • the 3D point cloud corresponding to the offline map is kept consistent with the content in the image collected by the terminal device. The more images uploaded by the terminal device, the more thorough the update of the 3D point cloud corresponding to the offline map. The more accurate the result of the depth estimation.
  • the method of the present application also includes:
  • the method of the present application also includes:
  • the fourth acquisition request carries the current image and the second pose of the current image
  • the second pose of the current image is the pose of the terminal device when the current image is captured
  • the map performs feature point matching to determine the first pose of the current image, the first pose is the pose of the current image on the first coordinate system, and the first coordinate system is the coordinate system of the offline map;
  • the pose and the first pose determine the pose transformation information, and the pose transformation information is used for the transformation between the second pose of the current image and the first pose; sending to the terminal device is used to respond to the fourth pose Obtain a fourth response message of the request, where the fourth response message carries the pose transformation information.
  • feature point matching is performed according to the current image and the offline map to determine the first pose of the current image, including:
  • the local map is the area indicated by the position in the second pose of the current image in the offline map; perform feature point matching according to the current image and the local map , to determine the first pose.
  • the current image is kept consistent with the coordinate system of the offline map stored in the server, thereby facilitating subsequent processing.
  • the second depth map is obtained by projecting the 3D point cloud obtained from the offline map stored on the server.
  • the offline map includes a panoramic map composed of multiple frames of basic maps
  • the 2D feature points of the panoramic map are the same as The 3D point corresponding to the 2D feature point is different from the multi-frame basic map, so it can be seen that the 3D point cloud of the application is obtained based on the multi-frame image
  • the second depth map obtained based on the projection of the 3D point cloud can also be seen as It is based on multiple frames of images, and the target depth map obtained based on the current image, the first depth map and the second depth map can be regarded as obtained based on multi-camera depth estimation, so based on the current image, the first depth map and the second depth map
  • the target depth map is obtained from the image, and it can be seen that the target depth map is obtained based on multi-eye depth estimation.
  • the virtual and real occlusion is performed based on the target depth map, which solves the problem of scale discrepancy and inconsistency in monocular depth estimation.
  • the server updates the 3D point cloud corresponding to the local map based on the current image, which avoids the problem of errors when the 3D point cloud corresponding to the local map is directly used to estimate the target depth map due to changes in the current scene.
  • the virtual panda can be seen through the gaps in the real trees, and the panda can also be blocked by buildings such as walls.
  • the solution of this application also supports the relationship between humans and virtual objects. From the last figure, it can be seen that the method based on this application can accurately estimate the depth map of the entire scene, so that the virtual panda can be between human arms, with a strong sense of overall immersion and excellent user experience. experience.
  • FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • the terminal device 100 includes:
  • the 2D-3D matching module 102 is used to match the first 2D feature point of the current image with the eighth 2D feature point of a plurality of historical images to obtain the second 2D feature point of the current image, and the second 2D feature point is the current
  • the 2D feature points of the first 2D feature points of the image that match the eighth 2D feature points of the multiple historical images determine the relative pose between the second pose of the current stage image and the second pose of the historical image, According to the relative pose, triangulate the second 2D feature point of the current image and the eighth 2D feature point matching the second 2D feature point in the historical image to obtain the corresponding 3D point. Project the point to the plane where the current image is located to obtain the first depth map;
  • the depth estimation module 104 is used to process the current image and the first depth map through the depth estimation model to obtain the target depth map of the current image; specifically, perform feature extraction on the current image to obtain T first feature maps, the The resolutions of each first feature map in the T first feature maps are different; feature extraction is performed on the first depth map to obtain T ninth feature maps, each of which is a ninth feature in the T ninth feature maps.
  • the resolutions of the maps are different; the T first feature maps and T ninth feature maps are superimposed, and the first feature maps and the ninth feature maps with the same resolution are superimposed to obtain T tenth feature maps;
  • the tenth feature map is upsampled and fused to obtain the target depth map of the current image.
  • step S302 the specific process of performing upsampling and fusion processing on the T tenth feature maps to obtain the target depth map of the current image may refer to the relevant description of step S302, which will not be described here.
  • the segmentation module 108 is used to pass the current image through a segmentation network to obtain a segmentation result image, which may be called a mask image, such as a portrait mask image; the segmentation network includes a feature extraction network and a softmax classifier.
  • the feature extraction network is used to extract the features of the current image, and then perform bilinear upsampling on the features of the current image to obtain a feature map with the same size as the input, and finally obtain the label of each pixel through the softmax classifier to obtain the segmentation result map.
  • the foreground area may include an area of a portrait
  • the background area may be an area that does not include a portrait.
  • the depth map edge optimization module 105 is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the corresponding, sharp depth map of the edges is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the virtual and real occlusion application module 107 is used to segment the optimized depth map according to the segmentation result map to obtain the foreground depth map and the background depth map; according to the foreground depth map, the background depth map, the depth map of the virtual object, the current image and the virtual object
  • the image obtains the target image. For any pixel point P in the target image, if the first depth value is greater than the second depth value, the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image.
  • the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the current image; wherein, the first depth value sum is the foreground depth map or The depth value corresponding to the pixel point P in the target image in the background depth map, the second depth value is the depth value corresponding to the pixel point P in the target image in the depth map of the virtual object; the target image is displayed on the terminal device.
  • FIG. 8 is a schematic structural diagram of another terminal device provided by an embodiment of the present application. As shown in FIG. 8, the terminal device 100 includes:
  • the 2D-3D matching module 102 is used to match the first 2D feature point of the current image with the eighth 2D feature point of a plurality of historical images to obtain the second 2D feature point of the current image, and the second 2D feature point is the current
  • the 2D feature points of the first 2D feature points of the image that match the eighth 2D feature points of the multiple historical images determine the relative pose between the second pose of the current stage image and the second pose of the historical image, According to the relative pose, triangulate the second 2D feature point of the current image and the eighth 2D feature point matching the second 2D feature point in the historical image to obtain the corresponding 3D point. Project the point to the plane where the current image is located to obtain the first depth map;
  • the depth estimation module 104 is used to process the current image and the first depth map through the depth estimation model to obtain the target depth map of the current image; specifically, perform feature extraction on the current image to obtain T first feature maps, the The resolutions of each first feature map in the T first feature maps are different; feature extraction is performed on the first depth map to obtain T ninth feature maps, each of which is a ninth feature in the T ninth feature maps.
  • the resolutions of the maps are different; the T first feature maps and T ninth feature maps are superimposed, and the first feature maps and the ninth feature maps with the same resolution are superimposed to obtain T tenth feature maps;
  • the tenth feature map is upsampled and fused to obtain the target depth map of the current image.
  • step S302 the specific process of performing upsampling and fusion processing on the T tenth feature maps to obtain the target depth map of the current image may refer to the relevant description of step S302, which will not be described here.
  • the segmentation module 108 is configured to pass the current image through a segmentation network to obtain a segmentation result graph, and the segmentation result graph may be called a mask graph, such as a portrait mask graph.
  • the segmentation network includes a feature extraction network and a softmax classifier.
  • the feature extraction network is used to extract the features of the current image, and then perform bilinear upsampling on the features of the current image to obtain a feature map with the same size as the input, and finally obtain the label of each pixel through the softmax classifier to obtain the segmentation result map.
  • the foreground area may include an area of a portrait
  • the background area may be an area that does not include a portrait.
  • the depth map edge optimization module 105 is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the multi-view fusion module 106 is used for segmenting the optimized depth map according to the segmentation result map to obtain a foreground depth map and a background depth map, where the background depth map is the depth map containing the background area in the optimized depth map, and the foreground depth map is the optimized depth map containing the depth map of the foreground area; the L background depth maps are fused according to the L second poses corresponding to the L background depth maps to obtain the fused 3D scene; the L background depth maps The image includes the background depth map of the pre-stored image and the background depth map of the current image, and the L second poses include the second pose of the pre-stored image and the second pose of the current image; L is greater than 1 Integer; back-project the fused 3D scene according to the second pose of the current image to obtain the fused background depth map; stitch the fused background depth map and the foreground depth map of the current image to obtain the updated depth map;
  • the virtual and real occlusion application module 107 is used to obtain the target image according to the updated depth map, the depth map of the virtual object, the current image and the virtual object image. For any pixel point P in the target image, if the first depth value is greater than the second depth value, the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image; if the first depth value is not greater than the second depth value, then the pixel value of the pixel point P is the current image. The pixel value of the corresponding pixel of the pixel P in the ; Display the target image on the terminal device.
  • FIG. 9 is a schematic structural diagram of a system provided by an embodiment of the present application.
  • the system includes a terminal device 100 and a server 200, wherein the terminal device 100 includes: a 2D-3D matching module 102, a depth estimation module 104, a depth map edge optimization module 105, a segmentation module 108, and a virtual-real occlusion application Module 107; the server 200 includes a VPS positioning and map issuing module 101 and a local map updating module 103;
  • the VPS positioning and map issuing module 101 is used to calculate the first pose of the current image according to the VPS after receiving the current image, and the first pose is the pose in the coordinate system where the offline map is located; according to the first pose
  • the position of the pose is obtained from the local map from the offline map.
  • the local map is an area in the offline map centered on the position in the first pose above and within a certain range (for example, with a radius of 50 meters); the offline map includes objects The 2D feature points and their corresponding feature descriptors of the first pose.
  • receive the second pose of the current image and determine the pose transformation information between the first pose and the second pose of the current image; send the 3D point cloud and the corresponding 3D point cloud of the local map to the 2D-3D matching module 102.
  • Pose transformation information
  • the 2D-3D matching module 102 is used to process the current image information according to the first pose of the current image to obtain a first image, which is obtained by transforming the second pose of the current image into the first pose. image; after receiving the pose transformation information, process the second pose of the current image according to the pose transformation information to obtain the first pose of the current image; and then process the current image information according to the first pose of the current image, to obtain the first image; perform feature extraction on the first image to obtain the third 2D feature point of the first image; match the third 2D feature point with the eighth 2D feature point of multiple historical images to obtain the first
  • the fourth 2D feature point of the image; the fourth 2D feature point is the 2D feature point that matches the pre-stored 2D feature point in the third 2D feature point; according to the fourth 2D feature point, the first pose, the eighth 2D feature Obtain the 3D point corresponding to the fourth 2D feature point from the 2D feature point that matches the fourth 2D feature point and the pose of the image to which the 2D feature point belongs;
  • the depth estimation module 104 is used to process the current image and the first depth map through the depth estimation model to obtain the target depth map of the current image; specifically, perform feature extraction on the current image to obtain T first feature maps, the The resolutions of each first feature map in the T first feature maps are different; feature extraction is performed on the first depth map to obtain T ninth feature maps, each of which is a ninth feature in the T ninth feature maps.
  • the resolutions of the maps are different; the T first feature maps and T ninth feature maps are superimposed, and the first feature maps and the ninth feature maps with the same resolution are superimposed to obtain T tenth feature maps;
  • the tenth feature map is upsampled and fused to obtain the target depth map of the current image.
  • step S302 the specific process of performing upsampling and fusion processing on the T tenth feature maps to obtain the target depth map of the current image may refer to the relevant description of step S302, which will not be described here.
  • the local map update module 103 is configured to obtain M frames of basic maps from the multi-frame basic maps according to the current image, and the similarity between each frame of the basic map in the M basic maps and the current image is greater than the first Threshold; match the ninth 2D feature point of the current image with the twelfth 2D feature point in the M-frame base map to obtain the seventh 2D feature point in the base map; filter out the 3D point cloud corresponding to the local map The 3D point corresponding to the seventh 2D feature point to obtain the processed 3D point; the ninth 2D feature point of the current image is matched with the tenth 2D feature point in the multiple historical images to obtain the eleventh 2D feature point of the current image.
  • the eleventh 2D feature point is the 2D feature point that matches the tenth 2D feature point in the historical image in the ninth 2D feature point of the current image; according to the eleventh 2D feature point, the current image
  • the 3D point corresponding to the point; the processed 3D point and the 3D point corresponding to the eleventh 2D feature point are processed according to the first pose of the current image to obtain the 3D point cloud corresponding to the updated local map, so as to realize the local
  • the map corresponds to the update of the 3D point; the 3D point cloud corresponding to the updated local map is projected onto the imaging plane of the current image to obtain the second depth map;
  • the depth estimation module 104 is used to process the current image and the third depth map through the depth estimation model of the current image to obtain the target depth map of the current image; specifically, perform multi-scale feature extraction on the current image to obtain T a feature map, and perform multi-scale feature extraction on the third depth map to obtain T second feature maps; the resolutions of each first feature map in the T first feature maps are different, and the T second feature maps The resolution of each second feature map in the The feature maps are superimposed to obtain T third feature maps; the T third feature maps are upsampled and fused to obtain the target depth map of the current image, and the third depth map is the first depth map and the second depth map. Figures are stitched together.
  • the depth estimation module 104 is configured to process the current image, the reference depth map and the third depth map through a depth estimation model of the current image to obtain a target depth map of the current image. Specifically, perform multi-scale feature extraction on the current image to obtain T first feature maps, and perform multi-scale feature extraction on the third depth map to obtain T second feature maps; perform multi-scale feature extraction on the reference depth map, Obtain T fourth feature maps, the resolutions of each first feature map in the T first feature maps are different, and the resolutions of each second feature map in the T second feature maps are different.
  • each fourth feature map in the fourth feature map is different, and T is an integer greater than 1; the resolution of T first feature maps, T second feature maps, and T fourth feature maps is The same first feature map, second feature map and fourth feature map are superimposed to obtain T fifth feature maps; upsampling and fusion processing are performed on T fifth feature maps to obtain the target depth map of the current image,
  • the third depth map is obtained by splicing the first depth map and the second depth map.
  • the segmentation module 108 is configured to pass the current image through a segmentation network to obtain a segmentation result graph, and the segmentation result graph may be called a mask graph, such as a portrait mask graph.
  • the segmentation network includes a feature extraction network and a softmax classifier.
  • the feature extraction network is used to extract the features of the current image, and then perform bilinear upsampling on the features of the current image to obtain a feature map with the same size as the input, and finally obtain the label of each pixel through the softmax classifier to obtain the segmentation result map.
  • the foreground area may include an area of a portrait
  • the background area may be an area that does not include a portrait.
  • the depth map edge optimization module 105 is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the corresponding, sharp depth map of the edges is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the virtual and real occlusion application module 107 is used to segment the optimized depth map according to the segmentation result map to obtain the foreground depth map and the background depth map; according to the foreground depth map, the background depth map, the depth map of the virtual object, the current image and the virtual object
  • the image obtains the target image. For any pixel point P in the target image, if the first depth value is greater than the second depth value, the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image.
  • the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the current image; wherein, the first depth value sum is the foreground depth map or The depth value corresponding to the pixel point P in the target image in the background depth map, the second depth value is the depth value corresponding to the pixel point P in the target image in the depth map of the virtual object; the target image is displayed on the terminal device.
  • FIG. 10 is a schematic structural diagram of another system provided by an embodiment of the present application.
  • the system includes a terminal device 100 and a server 200, wherein the terminal device 100 includes: a 2D-3D matching module 102, a depth estimation module 104, a depth map edge optimization module 105, a segmentation module 108, and a virtual-real occlusion application Module 107; the server 200 includes a VPS positioning and map issuing module 101 and a local map updating module 103;
  • the VPS positioning and map issuing module 101 is used to calculate the first pose of the current image according to the VPS after receiving the current image, and the first pose is the pose in the coordinate system where the offline map is located; according to the first pose
  • the position of the pose is obtained from the local map from the offline map.
  • the local map is an area in the offline map centered on the position in the first pose above and within a certain range (for example, with a radius of 50 meters); the offline map includes objects The 2D feature points and their corresponding feature descriptors of the first pose.
  • receive the second pose of the current image and determine the pose transformation information between the first pose and the second pose of the current image; send the 3D point cloud and the corresponding 3D point cloud of the local map to the 2D-3D matching module 102.
  • Pose transformation information
  • the 2D-3D matching module 102 is used to process the current image information according to the first pose of the current image to obtain a first image, which is obtained by transforming the second pose of the current image into the first pose. image; after receiving the pose transformation information, process the second pose of the current image according to the pose transformation information to obtain the first pose of the current image; and then process the current image information according to the first pose of the current image, to obtain the first image; perform feature extraction on the first image to obtain the third 2D feature point of the first image; match the third 2D feature point with the eighth 2D feature point of multiple historical images to obtain the first
  • the fourth 2D feature point of the image; the fourth 2D feature point is the 2D feature point that matches the eighth 2D feature point in the third 2D feature point; according to the fourth 2D feature point, the first orientation, and the eighth 2D feature point
  • the depth estimation module 104 is used to process the current image and the third depth map through the depth estimation model to obtain the target depth map of the current image; specifically, perform feature extraction on the current image to obtain T first feature maps, the The resolutions of each first feature map in the T first feature maps are different; feature extraction is performed on the third depth map to obtain T second feature maps, and each second feature in the T second feature maps The resolutions of the maps are different; the T first feature maps and T second feature maps are superimposed, and the first feature maps and the second feature maps with the same resolution are superimposed to obtain T third feature maps; The third feature map is up-sampled and fused to obtain the target depth map of the current image. The third depth map is obtained by splicing the first depth map and the second depth map.
  • step S302 the specific process of performing upsampling and fusion processing on the T tenth feature maps to obtain the target depth map of the current image may refer to the relevant description of step S302, which will not be described here.
  • the local map update module 103 is configured to obtain M frames of basic maps from the multi-frame basic maps according to the current image, and the similarity between each frame of the basic map in the M basic maps and the current image is greater than the first Threshold; match the ninth 2D feature point of the current image with the twelfth 2D feature point in the M-frame base map to obtain the seventh 2D feature point in the base map; filter out the 3D point cloud corresponding to the local map The 3D point corresponding to the seventh 2D feature point to obtain the processed 3D point; the ninth 2D feature point of the current image is matched with the tenth 2D feature point in the multiple historical images to obtain the eleventh 2D feature point of the current image.
  • the eleventh 2D feature point is the 2D feature point that matches the tenth 2D feature point in the historical image in the ninth 2D feature point of the current image; according to the eleventh 2D feature point, the current image
  • the first pose, the 2D feature point matching the eleventh 2D feature point in the 2D feature points of the multiple historical images, and the first pose of the historical image to which the 2D feature point belongs are obtained to obtain the eleventh 2D feature point corresponding to 3D point;
  • the processed 3D point and the 3D point corresponding to the eleventh 2D feature point are processed to obtain the 3D point cloud corresponding to the updated local map, so as to realize the corresponding local map 3D point update; project the 3D point cloud corresponding to the updated local map onto the imaging plane of the current image to obtain a second depth map;
  • the depth estimation module 104 is used to process the current image and the third depth map through the depth estimation model of the current image to obtain the target depth map of the current image; specifically, perform multi-scale feature extraction on the current image to obtain T a feature map, and perform multi-scale feature extraction on the third depth map to obtain T second feature maps; the resolutions of each first feature map in the T first feature maps are different, and the T second feature maps The resolution of each second feature map in the The feature maps are superimposed to obtain T third feature maps; the T third feature maps are upsampled and fused to obtain the target depth map of the current image; the third depth map is the first depth map and the second depth map. Figures are stitched together.
  • the depth estimation module 104 is configured to process the current image, the reference depth map and the third depth map through a depth estimation model of the current image to obtain a target depth map of the current image. Specifically, perform multi-scale feature extraction on the current image to obtain T first feature maps, and perform multi-scale feature extraction on the third depth map to obtain T second feature maps; perform multi-scale feature extraction on the reference depth map, Obtain T fourth feature maps, the resolutions of each first feature map in the T first feature maps are different, and the resolutions of each second feature map in the T second feature maps are different.
  • each fourth feature map in the fourth feature map is different, and T is an integer greater than 1; the resolution of T first feature maps, T second feature maps, and T fourth feature maps is The same first feature map, second feature map and fourth feature map are superimposed to obtain T fifth feature maps; upsampling and fusion processing are performed on T fifth feature maps to obtain the target depth map of the current image,
  • the third depth map is obtained by splicing the first depth map and the second depth map.
  • the segmentation module 108 is used for the current image to obtain a segmentation result graph by dividing the network, and the segmentation result graph may be called a mask graph, such as a portrait mask graph.
  • the segmentation network includes a feature extraction network and a softmax classifier.
  • the feature extraction network is used to extract the features of the current image, and then perform bilinear upsampling on the features of the current image to obtain a feature map with the same size as the input, and finally obtain the label of each pixel through the softmax classifier to obtain the segmentation result map.
  • the foreground area may include an area of a portrait
  • the background area may be an area that does not include a portrait.
  • the depth map edge optimization module 105 is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the depth map edge optimization module 105 is used to extract the edge structure of the current image and its target depth map respectively, and obtain the edge structure information of the current image and the edge structure information of the target depth map; taking the edge structure information of the current image as a reference, calculate The difference between the edge structure of the target depth map and the edge structure of the current image, and then modify the edge position of the target depth map through this difference, thereby optimizing the edge of the depth map, and obtaining the optimized depth map, which is the same as the current image.
  • the multi-view fusion module 106 is used for segmenting the optimized depth map according to the segmentation result map to obtain a foreground depth map and a background depth map, where the background depth map is the depth map containing the background area in the optimized depth map, and the foreground depth map is the optimized depth map containing the depth map of the foreground area; the L background depth maps are fused according to the L second poses corresponding to the L background depth maps to obtain the fused 3D scene; the L background depth maps The image includes the background depth map of the pre-stored image and the background depth map of the current image, and the L second poses include the second pose of the pre-stored image and the second pose of the current image; L is greater than 1 Integer; back-project the fused 3D scene according to the second pose of the current image to obtain the fused background depth map; stitch the fused background depth map and the foreground depth map of the current image to obtain the updated depth map;
  • the virtual and real occlusion application module 107 is used to obtain the target image according to the updated depth map, the depth map of the virtual object, the current image and the virtual object image. For any pixel point P in the target image, if the first depth value is greater than the second depth value, then the pixel value of the pixel point P is the pixel value of the corresponding pixel point of the pixel point P in the virtual object image; if the first depth value is not greater than the second depth value, then the pixel value of the pixel point P is the current image. The pixel value of the corresponding pixel of the pixel P in the ; Display the target image on the terminal device.
  • the reference depth map is a depth map collected by the TOF camera.
  • the above-mentioned TOF camera collects the depth map at a frame rate lower than the preset frame rate, and the resolution of the depth map is lower than the preset resolution; the terminal device 100 according to the second pose of the current image
  • the depth map collected by the TOF camera is projected into the three-dimensional space, and the fourth depth map is obtained; the fourth depth map is projected onto the reference image according to the pose of the reference image, and the reference depth map is obtained, and the reference image is at the acquisition time. on the image adjacent to the current image.
  • the TOF camera can be the camera of the terminal device 100, or it can be the camera of other terminal devices; after the depth map collected by the TOF camera of the other terminal device, the other terminal device sends the depth map collected by the TOF camera to the terminal device 100. .
  • the depth map collected by the TOF can also be introduced, thereby improving the accuracy of the target depth map of the current image.
  • FIG. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in Figure 11, the terminal device 1100 includes:
  • an acquisition unit 1101 used to acquire the current image
  • a calculation unit 1102 configured to calculate and obtain the first depth map corresponding to the current image according to the current image
  • the obtaining unit 1101 is further configured to obtain a second depth map corresponding to the current image from the server according to the current image, where the depth information of the current image represented by the second depth map is richer than the depth information of the current image represented by the first depth map, and the depth information of the current image represented by the second depth map is richer.
  • a depth map includes depth information of feature points that are not available in the second depth map;
  • the depth estimation unit 1103 is used to obtain the target depth map of the current image according to the current image, the first depth map and the second depth map.
  • the depth information of the feature points in the middle is rich;
  • the acquiring unit 1101 is further configured to acquire an image of a virtual object and a depth map of the virtual object;
  • the determining unit 1104 is configured to obtain the target image according to the target depth map, the depth map of the virtual object, the current image and the virtual object image.
  • the target depth map and the depth map of the virtual object are used to determine the representation and distribution of the pixels of the virtual object image and the pixels of the current image in the target image.
  • the terminal device 1100 further includes:
  • the sending unit 1105 is configured to send a first obtaining request to the server to obtain the 3D point cloud corresponding to the local map;
  • the receiving unit 1106 is configured to receive a first response message sent by the server for responding to the first acquisition request, where the first response message carries the 3D point cloud corresponding to the local map, and the local map includes scene data corresponding to the current image;
  • the obtaining unit 1101 is specifically configured to:
  • the second pose of the current image is the pose of the terminal device when the current image is captured
  • the obtaining unit 1101 is further specifically configured to:
  • the second 2D feature point is the 2D feature point that matches the eighth 2D feature point in the historical image in the first 2D feature point of the current image; according to the second pose of the current image, the second 2D feature point of the current image , the 2D feature point that matches the second 2D feature point in the eighth 2D feature point in the historical image and the second pose of the historical image to obtain the 3D point corresponding to the second 2D feature point of the current image; the second pose is the pose of the terminal device 1100 when the current image is captured;
  • the acquiring unit 1101 is specifically used for:
  • the current image is projected according to the second pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the second 2D feature point of the current image to obtain a second depth map.
  • the first 2D feature point and the eighth 2D feature point include ORB feature points, AKAZE feature points, DOG feature points, HOG feature points, BRIEF feature points, BRISK feature points, FREAK feature points, ASLFeat feature points, or SuperPoint feature points. .
  • the terminal device 1100 further includes:
  • the sending unit 1105 is configured to send a second acquisition request to the server, where the second acquisition request carries the current image, and the second acquisition request instructs the server to obtain the second depth map according to the current image and the offline map stored by the server;
  • the receiving unit 1106 is configured to receive a second response message sent by the server in response to the second acquisition request, where the second response message carries the second depth map.
  • the computing unit 1102 is specifically configured to:
  • the first pose of the current image which is the pose of the current image in the first coordinate system;
  • the first coordinate system is the coordinate system where the offline map stored by the server is located, and the current image is determined according to the first pose process to obtain a first image;
  • the first image is an image obtained by transforming the second pose of the current image into the first pose;
  • the second pose of the current image is the pose when the terminal device shoots the current image;
  • a feature extraction is performed on an image to obtain the third 2D feature point of the first image;
  • the third 2D feature point is matched with the eighth 2D feature point of a plurality of historical images to obtain the fourth 2D feature point of the first image;
  • the fourth 2D feature point is the 2D feature point that matches the eighth 2D feature point of the multiple historical images in the third 2D feature point; according to the fourth 2D feature point, the first pose of the current image, the The 2D feature point in the eighth 2D feature point that matches the fourth 2D feature point and the first position of the historical image to which the 2
  • the third 2D feature points include ORB feature points, AKAZE feature points, DOG feature points, HOG feature points, BRIEF feature points, BRISK feature points, FREAK feature points, ASLFeat feature points or SuperPoint feature points.
  • the terminal device 1100 includes:
  • the sending unit 1105 is configured to send a third acquisition request to the server, where the third acquisition request carries the current image, and the third acquisition request is used for requesting to acquire the first pose of the current image;
  • the receiving unit 1106 is configured to receive a third response message sent by the server for responding to the third acquisition request, where the third response message carries the first pose of the current image.
  • the terminal device 1100 includes:
  • the sending unit 1105 is configured to send a fourth acquisition request to the server, where the fourth acquisition request carries the current image and the second pose of the current image,
  • the receiving unit 1106 is configured to receive a fourth response message sent by the server for responding to the fourth acquisition request, where the fourth response message carries pose transformation information, and the pose transformation information is used for the second pose of the current image and the current image The transformation between the first poses of ,
  • the obtaining unit 1101 is specifically used for:
  • the terminal device includes:
  • the sending unit 1105 is configured to send a fifth acquisition request to the server; the fifth acquisition request carries the geographic location information of the terminal;
  • a receiving unit 1106, configured to receive a fifth response message sent by the server for responding to the fifth acquisition request, where the fifth response message carries a depth estimation model;
  • the depth estimation model is a neural network model corresponding to the geographic location information of the terminal;
  • the determining unit 1104 is specifically used for:
  • the first depth map and the second depth map are spliced to obtain a third depth map; the current image and the third depth map are input into the depth estimation model to obtain the target depth map.
  • the determining unit 1104 is specifically configured to:
  • the initial convolutional neural network is trained to obtain a depth estimation model, including:
  • the depth estimation unit 1103 is specifically configured to:
  • each first feature map in T first feature maps The resolutions of T are different, and the resolution of each second feature map in T second feature maps is different; T is an integer greater than 1; the third depth map is the first depth map and the second depth map.
  • the image is upsampled and fused to obtain the target depth map of the current image.
  • the depth estimation unit 1103 is specifically configured to:
  • T fifth feature maps perform up-sampling and fusion processing on the T fifth feature maps to obtain the target depth map of the current image.
  • the reference depth map is obtained according to the image collected by the TOF camera, and specifically includes:
  • the reference image is the image adjacent to the current image in acquisition time.
  • the upsampling and fusion processing includes:
  • the determining unit 1104 is specifically configured to:
  • the edge of the target depth map is optimized according to the current image to obtain the optimized depth map; the accuracy of the optimized depth map is higher than that of the target depth map; the optimized depth map is segmented to obtain the foreground depth map of the current image and Background depth map, the background depth map is the depth map containing the background area in the optimized depth map, and the foreground depth map is the depth map containing the foreground area in the optimized depth map; according to the L background depth maps corresponding to L
  • the pose fuses L background depth maps to obtain a fused three-dimensional scene; the L background depth maps include the background depth map of the pre-stored image and the background depth map of the current image, and the L poses include the pre-stored image.
  • L is an integer greater than 1; the fused 3D scene is back-projected according to the first pose of the current image to obtain the fused background depth map; the fused background depth map Splicing with the foreground depth map of the current image to obtain an updated depth map; processing the virtual object and the current image according to the updated depth map and the depth map of the virtual object to obtain the target image.
  • the current image includes the target person
  • the terminal device 1100 further includes:
  • a detection unit 1107 configured to perform target person detection on the optimized depth map to obtain a detection result
  • the determining unit 1104 is specifically used for:
  • the optimized depth map is segmented according to the detection result to obtain a foreground depth map and a background depth map of the current image, wherein the foreground depth map of the current image includes the depth map corresponding to the target person.
  • acquisition unit 1101, calculation unit 1102, depth estimation unit 1103, determination unit 1104, transmission unit 1105, reception unit 1106, and detection unit 1107) are used to execute the relevant steps of the above method.
  • the terminal device 1100 is presented in the form of a unit.
  • a "unit” here may refer to an application-specific integrated circuit (ASIC), a processor and memory executing one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above-described functions .
  • ASIC application-specific integrated circuit
  • the above acquisition unit 1101, calculation unit 1102, depth estimation unit 1103, determination unit 1104 and detection unit 1107 may be implemented by the processor 1301 of the terminal device shown in FIG. 13 .
  • FIG. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
  • the server 1200 includes:
  • a receiving unit 1201 configured to receive a depth estimation model request message sent by a terminal device, where the request message carries geographic location information of the terminal device;
  • the obtaining unit 1202 is used to obtain the depth estimation model of the current image from the plurality of depth estimation models stored in the server 1200 according to the geographic location information of the terminal device; the depth estimation model of the current image is the geographic location of the terminal device in the multiple depth estimation models.
  • the depth estimation model corresponding to the location information; the multiple depth estimation models are in one-to-one correspondence with multiple geographic location information;
  • the sending unit 1203 is configured to send a response message in response to the depth estimation model request message to the terminal device, where the response message carries the depth estimation model of the current image.
  • the server 1200 further includes:
  • the training unit 1204 is used for obtaining a depth estimation model corresponding to one-to-one correspondence of the plurality of geographic locations by training, respectively, for a plurality of geographic locations. Steps are trained to obtain the depth estimation model corresponding to the geographic location information S:
  • the receiving unit 1201 is further configured to receive a first acquisition request sent by the terminal device;
  • the obtaining unit 1202 is further configured to obtain a partial map from the offline map according to the position in the first pose of the current image, where the partial map is the area indicated by the pose in the first pose of the current image in the offline map; Obtain the 3D point cloud corresponding to the local map; the first pose of the current image is the pose of the current image in the first coordinate system, and the first coordinate system is the coordinate system where the offline map is located;
  • the sending unit 1203 is further configured to send a first response message for responding to the first acquisition request to the terminal device, where the first response message carries the 3D point cloud corresponding to the local map.
  • the receiving unit 1201 is further configured to receive a second acquisition request sent by the terminal device, where the second acquisition request carries the current image;
  • Server 1200 also includes:
  • the sending unit 1203 is further configured to send a second response message for responding to the second acquisition request to the terminal device, where the second response message carries the second depth map.
  • the determining unit 1205 is specifically configured to:
  • the first pose of the current image is the pose of the current image in the first coordinate system, and the first coordinate system is the coordinate system where the offline map is located;
  • the position in the pose obtains the 3D point cloud corresponding to the local map from the 3D point cloud corresponding to the offline map, and the local map is the area indicated by the first position in the offline map stored by the server 1200; according to the first pose and The 3D point cloud corresponding to the local map projects the current image to obtain the second depth map.
  • the determining unit 1205 is further configured to:
  • the eleventh 2D feature point is the 2D feature point of the ninth 2D feature point of the current image that matches the tenth 2D feature point in the historical image; Eleven 2D feature points, the 2D feature points in the tenth 2D feature point in the historical image that match the eleventh 2D feature point, and the first pose of the historical image, obtain the eleventh 2D feature point corresponding to the current image. 3D point;
  • the current image is projected according to the first pose of the current image, the 3D point cloud corresponding to the local map, and the 3D point corresponding to the eleventh 2D feature point of the current image to obtain the second depth map.
  • the determining unit 1205 is further configured to:
  • the 3D point cloud corresponding to the local map and the 3D point corresponding to the eleventh 2D feature point of the current image are projected on the current image to obtain the second depth map.
  • the offline map includes a multi-frame base map
  • the server 1200 further includes an update unit 1206, and the update unit update unit 1206 is specifically configured to:
  • the ninth 2D feature point of the current image is matched with the tenth 2D feature point in the multiple historical images to obtain the eleventh 2D feature point of the current image
  • the eleventh 2D feature point is the ninth 2D feature point of the current image.
  • the 2D feature points in the 2D feature points that match the tenth 2D feature point in the historical image; according to the eleventh 2D feature point, the first pose of the current image, the tenth 2D feature point of the multiple historical images and the The 2D feature points that match the eleven 2D feature points and the first pose of the historical image to which the 2D feature points belong are obtained to obtain the 3D points corresponding to the eleventh 2D feature points; the processed 3D points are processed according to the first pose of the current image. Process the 3D point corresponding to the eleventh 2D feature point to obtain the 3D point cloud corresponding to the updated local map;
  • the determining unit 1205 is specifically used for:
  • the ninth 2D feature point, the eleventh 2D feature point, and the twelfth 2D feature point all include SIFT feature points, SURF feature points, ASLFeat feature points, SuperPoint feature points, R2D2 feature points, or D2Net feature points.
  • the receiving unit 1201 is further configured to receive a third acquisition request from the terminal device, where the third acquisition request carries the current image;
  • the determining unit 1205 is also used to perform feature point matching according to the current image and the offline map to determine the first pose of the current image, where the first pose is the pose of the current image on the first coordinate system, and the first coordinate system is The coordinate system of the offline map;
  • the sending unit 1203 is further configured to send a third response message for responding to the third acquisition request to the terminal device, where the third response message carries the first pose.
  • the receiving unit 1201 is further configured to receive a fourth acquisition request from the terminal device, where the fourth acquisition request carries the current image and the second pose of the current image; the second pose of the current image is The pose of the terminal device when capturing the current image;
  • the determining unit 1205 is also used to perform feature point matching according to the current image and the offline map to determine the first pose of the current image, where the first pose is the pose of the current image on the first coordinate system, and the first coordinate system is The coordinate system of the offline map; the pose transformation information is determined according to the second pose and the first pose of the current image, and the pose transformation information is used for the transformation between the second pose and the first pose of the current image;
  • the sending unit 1203 is further configured to send a fourth response message for responding to the fourth acquisition request to the terminal device, where the fourth response message carries the pose transformation information.
  • the above units (receiving unit 1201, acquiring unit 1202, transmitting unit 1203, training unit 1204, determining unit 1205, and updating unit 1206) are used to execute the relevant steps of the above method.
  • the receiving unit 1201 is used to execute the relevant content of step S501
  • the acquiring unit 1202, the training unit 1204, the determining unit 1205 and the updating unit 1206 are used to execute the relevant content of step S502
  • the sending unit 1203 is used to execute the relevant content of step S503.
  • the server 1200 is presented in the form of a unit.
  • a "unit” here may refer to an application-specific integrated circuit (ASIC), a processor and memory executing one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above-described functions .
  • ASIC application-specific integrated circuit
  • the above obtaining unit 1202 , training unit 1204 , determining unit 1205 and updating unit 1206 may be implemented by the processor 1401 of the server shown in FIG. 14 .
  • the terminal device 1300 can be implemented with the structure shown in FIG. 13 .
  • the terminal device 1300 includes at least one processor 1301 , at least one memory 1302 and at least one communication interface 1303 .
  • the processor 1301, the memory 1302, and the communication interface 1303 are connected through the communication bus and complete communication with each other.
  • the processor 1301 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the above programs.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication interface 1303 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN).
  • RAN radio access network
  • WLAN Wireless Local Area Networks
  • Memory 1302 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM), or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 1302 is used to store the application code for executing the above solutions, and the execution is controlled by the processor 1301 .
  • the processor 1301 is used for executing the application code stored in the memory 1302 .
  • the code stored in the memory 1302 can execute an image processing method provided above, such as: obtaining the current image, calculating the first depth map according to the current image, obtaining the second depth map from the server according to the current image, and obtaining the second depth map from the server according to the current image.
  • the density of pixels is higher than the density of pixels in the first depth map
  • the second depth map represents that the depth information of the current image is richer than the first depth map
  • the first depth map is the same as the first depth map.
  • the number of pixels in the union of the two depth maps is greater than the number of pixels in the second depth map; the target depth map of the current image is obtained according to the current image, the first depth map and the second depth map, and the target depth map represents the current image
  • the depth information of the current image is richer than the second depth map to represent the depth information of the current image; obtain the virtual object image and the depth map of the virtual object; obtain the target image according to the target depth map, the depth map of the virtual object, the current image and the virtual object image, that is, it has The virtual and real occlusion effect image; the target depth map and the depth map of the virtual object are used to determine the pixel corresponding to the position P in the target image from the pixels of any position P in the current image and the pixels corresponding to the position P in the virtual object image.
  • the server 1400 can be implemented with the structure shown in FIG. 14 .
  • the server 1400 includes at least one processor 1401 , at least one memory 1402 and at least one communication interface 1403 .
  • the processor 1401 , the memory 1402 , the display 1404 and the communication interface 1403 are connected through the communication bus and complete mutual communication.
  • the processor 1401 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the above programs.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication interface 1403 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN).
  • RAN radio access network
  • WLAN Wireless Local Area Networks
  • the memory 1402 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 1402 is used to store the application code for executing the above solution, and the execution is controlled by the processor 1401 .
  • the processor 1401 is used for executing the application code stored in the memory 1402 .
  • the code stored in the memory 1402 can execute an image processing method provided above, such as: receiving a depth estimation model request message sent by the terminal device, the request message carrying the location of the terminal device; Obtain the depth estimation model of the current image from the multiple depth estimation models; the depth estimation model of the current image is the depth estimation model corresponding to the position of the terminal device in the multiple depth estimation models; the multiple depth estimation models are in one-to-one correspondence with the multiple positions ; Send a response message in response to the depth estimation model request message to the terminal device, where the response message carries the depth estimation model of the current image.
  • Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium may store a program, and when the program is executed, the program includes part or all of the steps of any of the image processing methods described in the above method embodiments.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable memory.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

La présente invention relève du domaine de la réalité augmentée et concerne en particulier un procédé de traitement d'image et un dispositif associé. Le procédé comprend les étapes consistant à : obtenir une image actuelle, effectuer un calcul de façon à obtenir, en fonction de l'image actuelle, une première carte de profondeur correspondant à l'image actuelle et obtenir, en fonction de l'image actuelle, une seconde carte de profondeur correspondant à l'image actuelle et provenant d'un serveur, les informations sur la profondeur de points caractéristiques dans l'image actuelle représentées par la seconde carte de profondeur étant plus riches que les informations sur la profondeur de points caractéristiques dans l'image actuelle représentées par la première carte de profondeur et la première carte de profondeur contenant des informations de carte de profondeur des points caractéristiques qui ne sont pas intégrées dans la seconde carte de profondeur ; obtenir une carte de profondeur cible de l'image actuelle en fonction de l'image actuelle, de la première carte de profondeur et de la seconde carte de profondeur ; obtenir une image d'un objet virtuel et une carte de profondeur d'un objet virtuel ; et obtenir une image cible en fonction de la carte de profondeur cible, de la carte de profondeur de l'objet virtuel, de l'image actuelle et de l'image de l'objet virtuel. La présente invention résout le problème de l'ambiguïté et de l'incohérence d'échelle dans un effet d'occlusion de réalité virtuelle.
PCT/CN2021/113635 2020-09-10 2021-08-19 Procédé de traitement d'image et dispositif associé WO2022052782A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180062229.6A CN116097307A (zh) 2020-09-10 2021-08-19 图像的处理方法及相关设备

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010950951.0 2020-09-10
CN202010950951.0A CN114170290A (zh) 2020-09-10 2020-09-10 图像的处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2022052782A1 true WO2022052782A1 (fr) 2022-03-17

Family

ID=80475882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/113635 WO2022052782A1 (fr) 2020-09-10 2021-08-19 Procédé de traitement d'image et dispositif associé

Country Status (2)

Country Link
CN (2) CN114170290A (fr)
WO (1) WO2022052782A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115436488A (zh) * 2022-08-31 2022-12-06 南京智慧基础设施技术研究院有限公司 一种基于视觉与声纹融合的自引导自调适移动检测系统及方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578432B (zh) * 2022-09-30 2023-07-07 北京百度网讯科技有限公司 图像处理方法、装置、电子设备及存储介质
CN115620181B (zh) * 2022-12-05 2023-03-31 海豚乐智科技(成都)有限责任公司 基于墨卡托坐标切片的航拍图像实时拼接方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107483845A (zh) * 2017-07-31 2017-12-15 广东欧珀移动通信有限公司 拍照方法及其装置
US20190102938A1 (en) * 2017-09-29 2019-04-04 Baidu Online Network Technology (Beijing) Co., Ltd Method and Apparatus for Presenting Information
US20190356905A1 (en) * 2018-05-17 2019-11-21 Niantic, Inc. Self-supervised training of a depth estimation system
CN110599533A (zh) * 2019-09-20 2019-12-20 湖南大学 适用于嵌入式平台的快速单目深度估计方法
CN110889890A (zh) * 2019-11-29 2020-03-17 深圳市商汤科技有限公司 图像处理方法及装置、处理器、电子设备及存储介质
CN110895822A (zh) * 2018-09-13 2020-03-20 虹软科技股份有限公司 深度数据处理系统的操作方法
CN111612831A (zh) * 2020-05-22 2020-09-01 创新奇智(北京)科技有限公司 一种深度估计方法、装置、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107483845A (zh) * 2017-07-31 2017-12-15 广东欧珀移动通信有限公司 拍照方法及其装置
US20190102938A1 (en) * 2017-09-29 2019-04-04 Baidu Online Network Technology (Beijing) Co., Ltd Method and Apparatus for Presenting Information
US20190356905A1 (en) * 2018-05-17 2019-11-21 Niantic, Inc. Self-supervised training of a depth estimation system
CN110895822A (zh) * 2018-09-13 2020-03-20 虹软科技股份有限公司 深度数据处理系统的操作方法
CN110599533A (zh) * 2019-09-20 2019-12-20 湖南大学 适用于嵌入式平台的快速单目深度估计方法
CN110889890A (zh) * 2019-11-29 2020-03-17 深圳市商汤科技有限公司 图像处理方法及装置、处理器、电子设备及存储介质
CN111612831A (zh) * 2020-05-22 2020-09-01 创新奇智(北京)科技有限公司 一种深度估计方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115436488A (zh) * 2022-08-31 2022-12-06 南京智慧基础设施技术研究院有限公司 一种基于视觉与声纹融合的自引导自调适移动检测系统及方法
CN115436488B (zh) * 2022-08-31 2023-12-15 南京智慧基础设施技术研究院有限公司 一种基于视觉与声纹融合的自引导自调适移动检测系统及方法

Also Published As

Publication number Publication date
CN116097307A (zh) 2023-05-09
CN114170290A (zh) 2022-03-11

Similar Documents

Publication Publication Date Title
US11232286B2 (en) Method and apparatus for generating face rotation image
CN111598998B (zh) 三维虚拟模型重建方法、装置、计算机设备和存储介质
WO2021043168A1 (fr) Procédé d'entraînement de réseau de ré-identification de personnes et procédé et appareil de ré-identification de personnes
WO2022052782A1 (fr) Procédé de traitement d'image et dispositif associé
WO2021018163A1 (fr) Procédé et appareil de recherche de réseau neuronal
JP6798183B2 (ja) 画像解析装置、画像解析方法およびプログラム
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
WO2022179581A1 (fr) Procédé de traitement d'images et dispositif associé
CN111797983A (zh) 一种神经网络构建方法以及装置
CN110222718B (zh) 图像处理的方法及装置
WO2021042774A1 (fr) Procédé de récupération d'image, procédé d'entraînement de réseau de récupération d'image, dispositif, et support de stockage
TWI643137B (zh) 物件辨識方法及物件辨識系統
WO2022100419A1 (fr) Procédé de traitement d'images et dispositif associé
WO2021249114A1 (fr) Dispositif de suivi de cible et procédé de suivi de cible
WO2023083030A1 (fr) Procédé de reconnaissance de posture et dispositif associé
WO2022165722A1 (fr) Procédé, appareil et dispositif d'estimation de profondeur monoculaire
US20230401799A1 (en) Augmented reality method and related device
CN116194951A (zh) 用于基于立体视觉的3d对象检测与分割的方法和装置
CN113284055A (zh) 一种图像处理的方法以及装置
CN104463962B (zh) 基于gps信息视频的三维场景重建方法
US20220215617A1 (en) Viewpoint image processing method and related device
CN117237547B (zh) 图像重建方法、重建模型的处理方法和装置
WO2022083118A1 (fr) Procédé de traitement de données et dispositif associé
CN113886510A (zh) 一种终端交互方法、装置、设备及存储介质
Zhang et al. Video extrapolation in space and time

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865838

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865838

Country of ref document: EP

Kind code of ref document: A1