CN114170290A

CN114170290A - Image processing method and related equipment

Info

Publication number: CN114170290A
Application number: CN202010950951.0A
Authority: CN
Inventors: 曾柏伟; 陈兵; 柳跃天; 王国毅
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2022-03-11
Also published as: CN116097307A; WO2022052782A1

Abstract

The application relates to the field of AR, in particular to a method and a device for processing an image, wherein the method comprises the following steps: acquiring a current image and a virtual object image, and acquiring a first depth map and a second depth map of the current image according to the current image, wherein the second depth map is acquired from a server; performing depth estimation according to the current image, the first depth map and the second depth map to obtain a target depth map of the current image; and displaying the virtual object image and the current image in an overlapping manner according to the target depth map of the current image. The high-precision target depth map is obtained by performing depth estimation based on the current image, the first depth map and the second depth map, so that the problems of inter-frame flicker and instability in the subsequent virtual and real shielding effect are solved.

Description

Image processing method and related equipment

Technical Field

The present application relates to the field of Augmented Reality (AR), and in particular, to an image processing method and related device.

Background

AR is a supplement to the real world with computer-generated virtual information that appears to coexist in the same space as the real world. Most of the existing AR applications only simply superimpose a virtual object in front of a real scene, and do not correctly process the shielding relation between the virtual object and the real world, so that the spatial position of the sense of a user is disordered easily, and the sense experience cannot exceed the reality. The effect of no virtual or real occlusion is shown in fig. 1 a. In AR applications, the real relationship between the virtual object and the real scene, i.e. virtual-real occlusion, needs to be handled. The effect of the virtual and real occlusion is shown in FIG. 1 b. The correct occlusion relationship can enable the user to generate a natural correct spatial perception in the AR application; false occlusion relationships can reduce the realism of AR applications.

When false virtual and real occlusion relations exist in the fusion image, an observer is difficult to correctly judge the relative position relation between the virtual and real objects, a vivid virtual and real fusion effect is difficult to obtain, the false occlusion relations easily cause the occlusion of the observer in the sensory direction and the disorder of the spatial position, and the virtual and real fusion result is not true. Therefore, in order to enhance the sense of reality of the virtual object in the real scene, the realistic virtual-real fusion effect is obtained, and the important significance is achieved in solving the problem of occlusion in the AR. The virtual-real occlusion processing in AR focuses on enabling a virtual object to be correctly occluded by an object located in front of the virtual object in a real scene, and generally employs means such as depth extraction or scene modeling to obtain an occlusion relationship between the real object and the virtual object in a virtual-real fusion scene, extract an occlusion edge of a foreground object in a real image, and finally generate a virtual-real fusion image with a correct occlusion relationship.

The apple publishes the latest IPad Pro 2020, and uses the RGB picture and the depth map collected by the direct time of flight (dtofs) camera as input, so as to realize the virtual and real shielding of the whole scene. Virtual and real shielding is performed based on a depth map, portrait segmentation and monocular depth estimation acquired by a dToF camera, but the monocular depth estimation has the problems of different scales and inconsistency, so that the final virtual and real shielding effect has the phenomena of inter-frame flicker, instability and unsharpness of shielding edges.

Disclosure of Invention

The embodiment of the application provides an image processing method and related equipment, wherein a cloud local map updating algorithm is adopted in the method, so that the problem that errors are introduced when an offline map and a current using scene are changed and the offline map points are directly used to participate in depth map estimation is solved, and meanwhile, the problems of scale diversity and interframe instability of monocular depth estimation are solved, and further the problems of flicker due to virtual and real shielding and inconsistency of interframe shielding phenomena are solved; the optimized depth map is obtained by performing edge optimization on the target depth map of the current image, and then the multi-frame depth maps are fused to obtain the depth map with sharper portrait edges, so that the virtual and real shielding effects are favorably improved.

In a first aspect, an embodiment of the present application provides an image processing method, including:

acquiring a current image and a virtual object image, and acquiring a first depth map and a second depth map of the current image according to the current image, wherein the second depth map is acquired from a server; and performing feature extraction according to the current image, the first depth map and the second depth map of the current image, and obtaining a target depth map of the current image according to the result of the feature extraction.

And performing depth estimation on the current image, the first depth map and the second depth map to obtain a high-precision target depth map.

In one possible embodiment, the method of this implementation further includes:

and displaying the virtual object image and the current image in an overlapping manner according to the target depth map of the current image.

By introducing a high-precision target depth map, the problems of inter-frame flicker and instability of virtual and real shielding effects are solved.

In one possible embodiment, obtaining a first depth map of a current image from the current image includes:

performing feature extraction on the current image to obtain a first 2D feature point of the current image; matching the first 2D characteristic point of the current image with a pre-stored 2D characteristic point to obtain a second 2D characteristic point in the current image; acquiring a 3D point corresponding to a second 2D feature point in the current image according to the second 2D feature point in the current image and the pre-stored corresponding relation between the 2D feature point and the 3D point; wherein the first depth map of the current image comprises 3D points corresponding to the second 2D feature points in the current image.

The feature point refers to a point where the image gray value changes drastically or a point where the curvature is large on the edge of the image (i.e., the intersection of two edges).

The correspondence relationship between the 2D feature points and the 3D points stored in the preselection mode indicates that, for each 2D feature point stored in the preselection mode, there is a 3D point corresponding to the 2D feature point.

The 2D feature point matching referred to in the present application specifically means that the similarity of two matched 2D feature points is higher than a preset similarity.

The 2D feature points of the current image are matched with the pre-stored 2D feature points, the first depth map is obtained according to the matching result, and then depth estimation can be carried out on the basis of the first depth map so as to obtain the high-precision target depth map.

performing feature extraction on the current image to obtain a first 2D feature point of the current image; matching the first 2D feature point of the current image according to the 2D feature point of the local map acquired from the server to obtain a third 2D feature point in the local map; acquiring a 3D point corresponding to a third 2D feature point in the local map according to the corresponding relation between the second 2D feature point in the local map and the 2D feature point and the 3D point in the local map; wherein the first depth map of the current image comprises 3D points corresponding to the third 2D feature points in the local map.

The current image and the local map are subjected to 2D feature point matching, a first depth map is obtained according to a matching result, and then depth estimation can be carried out on the basis of the first depth map so as to obtain a high-precision target depth map.

In a possible embodiment, matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map includes:

acquiring a target map from the local map according to the first pose, wherein the position of the target map in the local map is associated with the position indicated by the angle information in the first pose; the first pose is obtained by converting the pose obtained by the terminal equipment according to the current image into a pose under a world coordinate system, and the 2D feature points in the target map are matched with the first 2D feature points of the current image to obtain third 2D feature points of the target map, wherein the third 2D feature points of the local map comprise the third 2D feature points of the target map.

The world coordinate system is an absolute coordinate system of the system, and the coordinates of all points on the picture before the user coordinate system is not established are determined by the origin of the coordinate system.

The target map is determined from the local map through the first pose, the matching range is narrowed, and the efficiency of determining the first depth map is improved.

In a possible embodiment, performing feature extraction according to the current image, the first depth map and the second depth map, and obtaining a target depth map of the current image according to a result of the feature extraction includes:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing feature extraction on the third depth map to obtain T second feature maps; the resolution ratio of each first feature map in the T first feature maps is different, and the resolution ratio of each second feature map in the T second feature maps is different; t is an integer greater than 1; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; carrying out up-sampling and fusion processing on the T third feature maps to obtain a target depth map of the current image; the third depth map is the first depth map, or the third depth map is obtained by stitching the first depth map and the second depth map.

The multi-scale feature extraction specifically refers to an operation of convolving an image by using a plurality of different convolution kernels.

The term "superimposing" in this application specifically refers to processing the superimposed images at the pixel level, for example, two superimposed images include size H × W, and the size of the superimposed image is H × 2W or 2H × W; for another example, the three superimposed images include a size H × W, and the superimposed images have a size H × 3W, or 3H × W.

The depth estimation is carried out based on the first depth map and the second depth map to obtain a high-precision target depth map, so that the problems of inter-frame flicker and instability of subsequent virtual and real shielding effects are solved.

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; performing multi-scale feature extraction on the reference depth map to obtain T fourth feature maps, wherein the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained from a depth map collected by a time of flight (TOF) camera, T being an integer greater than 1; superposing the first feature map, the second feature map and the fourth feature map with the same resolution in the T first feature maps, the T second feature maps and the T fourth feature maps to obtain T fifth feature maps; carrying out upsampling and fusion processing on the T fifth feature maps to obtain a target depth map of the current image; the third depth map is obtained by splicing the first depth map and the second depth map, or the third depth map is the first depth map.

The depth map acquired by the TOF camera is introduced during depth estimation, so that the precision of the target depth map of the current image is further improved, and the problems of inter-frame flicker and instability of subsequent virtual and real shielding effects are solved.

In a possible embodiment, the reference depth map is acquired from an image acquired by a TOF camera, and specifically includes:

projecting the depth map acquired by the TOF camera into a three-dimensional space according to the pose of the current image to obtain a corresponding fourth depth map; back projecting the fourth depth map onto the reference image according to the pose of the reference image to obtain a reference depth map; the reference image is an image adjacent to the current image in acquisition time; the resolution ratio of the depth map acquired by the TOF camera is lower than the preset resolution ratio, and the frame rate of the TOF camera acquiring the depth map is lower than the preset frame rate.

In order to reduce the power consumption of the terminal equipment, the frame rate of the depth map acquired by the TOF camera and the resolution of the depth map are reduced.

In a possible embodiment, the upsampling and merging process includes:

s1: to feature map P'_jUp-sampling to obtain characteristic diagram P'_jThe characteristic diagram P "_jThe resolution of (1) and the (j + 1) th feature map P in the processing object_j+1Has the same resolution, feature map P_j+1Is j +1 times the width of the feature map with the minimum resolution in the processing object, and j is an integer larger than 0 and smaller than T; t is the number of characteristic graphs in the processing object;

s2: will feature map P "_jAnd a characteristic map P_j+1Fusing to obtain a third feature map P'_j+1，

S3: j is j +1, and S1-S3 are repeatedly executed until j is T-1;

wherein, when j is 1, the third feature map P'_jThe third feature map P 'is the feature map with the lowest resolution in the object to be processed, and when j is T-1'_j+1As a result of the upsampling and fusion process.

and inputting the current image and the third depth map into a depth estimation model of the current image for feature extraction, and obtaining a target depth map of the current image according to a feature extraction result, wherein the depth estimation model is realized on the basis of a convolutional neural network.

In one possible embodiment, the method of the present application further comprises:

sending a depth estimation model acquisition request to a server, wherein the depth estimation model acquisition request carries a current image and the position of the terminal equipment; and receiving a response message which is sent by the server and corresponds to the depth estimation model acquisition request, wherein the response message carries the depth estimation model of the current image, and the depth estimation model of the current image is acquired by the server according to the current image and the position of the terminal equipment in the world coordinate system.

Alternatively, the world coordinate system may be a universal transverse graphite mesh grid system (UTM) coordinate system or a GPS coordinate system, etc.

By acquiring the depth estimation model from the server, the terminal equipment does not need to train by itself to obtain the depth estimation model, so that the power consumption of the terminal equipment is reduced, and the real-time performance of virtual and real shielding is improved.

training the initial convolutional neural network model to obtain a depth estimation model;

wherein training the initial convolutional neural network to obtain a depth estimation model comprises:

inputting a plurality of image samples and a plurality of depth map samples corresponding to the image samples into an initial convolutional neural network for processing to obtain a plurality of predicted depth maps; calculating to obtain a loss value according to the multiple predicted depth maps, the real depth maps corresponding to the multiple image samples and a loss function; adjusting parameters in the initial convolutional neural network according to the loss value to obtain a depth estimation model of the current image; wherein the loss function is determined based on an error between the predicted depth map and the true depth map, an error between a gradient of the predicted depth map and a gradient of the true depth map, and an error between a normal vector of the predicted depth map and a normal vector of the true depth map.

It should be noted here that the above is only one training process; in practical application, iteration is carried out for multiple times according to the mode until the calculated loss value is converged; and determining the convolutional neural network model when the loss value is converged as the depth estimation model of the current image.

In one possible embodiment, the displaying of the virtual object image and the current image in superposition according to the target depth map of the current image includes:

performing edge optimization on a target depth map of a current image to obtain an optimized depth map; and displaying the virtual object image and the current image in an overlapping manner according to the optimized depth map.

The target depth map of the current image is subjected to edge optimization to obtain the depth map with sharp portrait edges, and the virtual and real shielding effects are favorably improved.

segmenting the optimized depth map to obtain a foreground depth map and a background depth map of the current image, wherein the background depth map is a depth map containing a background region in the optimized depth map, the foreground depth map is a depth map containing a foreground region in the optimized depth map, and the optimized depth map is obtained by performing edge optimization on a target depth map of the current image; fusing the L background depth maps according to L positions corresponding to the L background depth maps respectively to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a prestored image and a background depth map of a current image, and the L positions and postures comprise positions and postures of the prestored image and the current image; l is an integer greater than 1; carrying out back projection on the fused three-dimensional scene according to the pose of the current image to obtain a fused background depth map; splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map; and overlapping and displaying the virtual object image and the current image according to the updated depth map.

It should be noted that the foreground area is the area where the object of interest is located, such as a human, an automobile, an animal, a plant, or other salient object; the background region is a region in the image other than the foreground region.

The optimized depth map is obtained by performing edge optimization on the target depth map of the current image, and then the multi-frame depth maps are fused to obtain the depth map with sharper edges, so that the virtual and real shielding effects are favorably improved.

In a second aspect, an embodiment of the present application provides another image processing method, including:

receiving a depth estimation model acquisition request sent by terminal equipment, wherein the depth estimation model acquisition request carries a current image acquired by the terminal equipment and the position of the terminal equipment; acquiring a depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the position of the current image; and sending a response message responding to the depth estimation model acquisition request to the terminal equipment, wherein the response message carries the depth estimation model of the current image.

In order to improve the accuracy of depth estimation, in a server, a depth estimation model is obtained by training for each position independently; when the depth estimation is carried out, a depth estimation model of the current image is obtained from the server based on the current image and the position of the terminal equipment.

In one possible embodiment, obtaining the depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the position of the current image includes:

acquiring a plurality of frames of first images according to the position of the terminal equipment, wherein the plurality of frames of first images are images in a preset range taking the position of the terminal equipment as the center in a basic map, and acquiring a target image from the plurality of frames of first images, wherein the target image is the image with the highest similarity with the current image in the plurality of frames of first images; and determining the depth estimation model corresponding to the target image as the depth estimation model of the current image.

Specifically, a plurality of frames of first images are obtained according to the position of the terminal device, the plurality of frames of first images are images in a preset range with the position of the terminal device as the center in a basic map, a target image is obtained from the plurality of frames of first images, and the target image is an image with the highest similarity with the current image in the plurality of frames of first images; obtaining the pose of the current image according to the pose of the target image; and determining a depth estimation model corresponding to the position from the server according to the position in the pose of the current image, wherein the depth estimation model is the depth estimation model of the current image.

respectively training a plurality of frames of first images to obtain a depth estimation module of each frame of first image in the plurality of frames of first images,

the method comprises the following steps of training each frame of first image in a plurality of frames of first images to obtain a depth estimation model of each first image:

inputting a plurality of image samples and a plurality of depth map samples corresponding to the image samples into an initial convolutional neural network for processing to obtain a plurality of predicted depth maps; calculating to obtain a loss value according to the multiple predicted depth maps, the real depth maps corresponding to the multiple image samples and a loss function; adjusting parameters in the initial convolutional neural network according to the loss value to obtain a depth estimation model of the first image of each frame; wherein the loss function is determined based on an error between the predicted depth map and the true depth map, an error between a gradient of the predicted depth map and a gradient of the true depth map, and an error between a normal vector of the predicted depth map and a normal vector of the true depth map.

In a possible embodiment, the method of this embodiment further includes:

acquiring an initial depth map of a current image according to the current image and a pre-stored image; obtaining a fifth depth map according to the current image and the 3D point corresponding to the local map; and optimizing the initial depth map and the fifth depth map according to the pose of the current image to obtain a second depth map.

Optionally, the pre-stored image is uploaded by the terminal device, and the timestamp of the pre-stored image is located before the timestamp of the current image.

In one possible embodiment, obtaining an initial depth map of a current image from the current image and a pre-stored image includes:

matching the first 2D characteristic point of the current image with a pre-stored 2D characteristic point of the image to obtain a sixth 2D characteristic point of the current image; removing noise points in the sixth 2D characteristic point of the current image to obtain a seventh 2D characteristic point of the current image; performing triangularization calculation on each 2D feature point in the seventh 2D feature point of the current image to obtain an initial 3D point of the seventh 2D feature point of the current image in space; the initial depth map of the current image includes an initial 3D point in space of a seventh 2D feature point of the current image.

In one possible embodiment, the fifth depth map is obtained from the 3D points corresponding to the current image and the local map, including

Acquiring M maps from a multi-frame basic map, wherein the similarity between each map in the M maps and a current image is greater than a first preset threshold; m is an integer greater than 0; matching the 2D characteristic points of the M maps with the first 2D characteristic point of the current image to obtain a plurality of characteristic point matching pairs; each feature point matching pair in the multiple feature point matching pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the fifth 2D feature point are mutually matched feature points, the fourth 2D feature point is a first 2D feature point of the current image, and the fifth 2D feature point is a 2D feature point in the M maps; acquiring a 3D point corresponding to each fourth 2D feature point in a plurality of feature point matching pairs according to the corresponding relation between each fifth 2D feature point and the 3D point in the M maps; and acquiring a fifth depth map according to the 3D point corresponding to the local map and the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs, wherein the fifth depth map comprises the 3D point which is matched with the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs in the 3D points corresponding to the local map.

By the method, the off-line map in the server can be continuously updated by the images uploaded by the terminal equipment, so that the off-line map is consistent with the surrounding environment when the user uses the off-line map, and high-precision 3D point cloud information is provided. The cloud local map obtained in the mode can keep consistent with the environment used by the user, the more images uploaded by the user are, the more thorough the updating is, the more accurate the updating is, and therefore the problems that the subsequent virtual and real shielding effect can cause inter-frame flicker and instability are solved.

In a third aspect, an embodiment of the present application provides a terminal device, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a current image and acquiring a first depth map and a second depth map of the current image according to the current image, and the second depth map is acquired from a server;

and the estimation unit is used for extracting features according to the current image, the first depth map and the second depth map of the current image and obtaining a target depth map of the current image according to the result of the feature extraction.

In a possible embodiment, the obtaining unit is further configured to obtain a virtual object image;

the terminal device further includes:

and the superposition display unit is used for displaying the virtual object image and the current image in a superposition mode according to the target depth map of the current image.

In a possible embodiment, in terms of acquiring the first depth map of the current image from the current image, the acquisition unit is specifically configured to:

In a feasible embodiment, in the aspect that the first 2D feature point of the current image is matched according to the 2D feature point of the local map to obtain a third 2D feature point in the local map, the obtaining unit is specifically configured to:

In a possible embodiment, the estimation unit is specifically configured to:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing feature extraction on the third depth map to obtain T second feature maps; the resolution ratio of each first feature map in the T first feature maps is different, and the resolution ratio of each second feature map in the T second feature maps is different; t is an integer greater than 1; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; carrying out up-sampling and fusion processing on the T third feature maps to obtain a target depth map of the current image; the third depth map is the first depth map or is obtained by splicing the first depth map and the second depth map.

In a possible embodiment, the estimation unit is specifically configured to:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; performing multi-scale feature extraction on the reference depth map to obtain T fourth feature maps, wherein the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained according to a depth map acquired by a TOF camera, and T is an integer greater than 1; superposing the first feature map, the second feature map and the fourth feature map with the same resolution in the T first feature maps, the T second feature maps and the T fourth feature maps to obtain T fifth feature maps; carrying out upsampling and fusion processing on the T fifth feature maps to obtain a target depth map of the current image; the third depth map is obtained by splicing the first depth map and the second depth map, or the third depth map is the first depth map.

projecting the depth map acquired by the TOF camera into a three-dimensional space according to the pose of the current image to obtain a fourth depth map; back projecting the fourth depth map onto the reference image according to the pose of the reference image to obtain a reference depth map; the reference image is an image adjacent to the current image in acquisition time; the resolution ratio of the depth map acquired by the TOF camera is lower than the preset resolution ratio, and the frame rate of the TOF camera acquiring the depth map is lower than the preset frame rate.

In a possible embodiment, the upsampling and merging process includes:

s1: to feature map P'_jUp-sampling to obtain characteristic diagram P'_jThe characteristic diagram P "_jResolution ofAnd j +1 th feature map P in the processing object_j+1Has the same resolution, feature map P_j+1Is j +1 times the width of the feature map with the minimum resolution in the processing object, and j is an integer larger than 0 and smaller than T; t is the number of characteristic graphs in the processing object;

S3: j is j +1, and S1-S3 are repeatedly executed until j is T-1;

In a possible embodiment, the estimation unit is specifically configured to:

In a possible embodiment, the terminal device further includes:

a sending unit, configured to send a depth estimation model acquisition request to a server, where the depth estimation model acquisition request carries a current image and a location of the terminal device;

and the receiving unit is used for receiving a response message which is sent by the server and corresponds to the depth estimation model obtaining request, wherein the response message carries the depth estimation model of the current image, and the depth estimation model of the current image is obtained by the server according to the current image and the position of the terminal equipment in the world coordinate system.

In a possible embodiment, the terminal device further includes:

the training unit is used for training the initial convolutional neural network model to obtain a depth estimation model; wherein, the training unit is specifically configured to:

In one possible embodiment, the overlay display unit is specifically configured to:

In a fourth aspect, an embodiment of the present application provides a server, including:

the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving a depth estimation model acquisition request sent by terminal equipment, and the depth estimation model acquisition request carries a current image acquired by the terminal equipment and the position of the terminal equipment;

an obtaining unit, configured to obtain a depth estimation model of a current image from a plurality of depth estimation models stored in a server according to a position of the current image;

and the sending unit is used for sending a response message responding to the depth estimation model obtaining request to the terminal equipment, wherein the response message carries the depth estimation model of the current image.

In a possible embodiment, the obtaining unit is specifically configured to:

In one possible embodiment, the server further comprises:

a training unit for respectively training a plurality of frames of first images to obtain a depth estimation module of each frame of first image in the plurality of frames of first images,

inputting a plurality of image samples and a plurality of depth map samples corresponding to the image samples into an initial convolutional neural network for processing to obtain a plurality of predicted depth maps; calculating to obtain a loss value according to the multiple predicted depth maps, the real depth maps corresponding to the multiple image samples and a loss function; adjusting parameters in the initial convolutional neural network according to the loss value to obtain a depth estimation model of the first image of each frame;

wherein the loss function is determined based on an error between the predicted depth map and the true depth map, an error between a gradient of the predicted depth map and a gradient of the true depth map, and an error between a normal vector of the predicted depth map and a normal vector of the true depth map.

In a possible embodiment, the obtaining unit is further configured to obtain an initial depth map of the current image according to the current image and a pre-stored image; obtaining a fifth depth map according to the current image and the 3D point corresponding to the local map;

the server further comprises:

and the optimization unit is used for optimizing the initial depth map and the fifth depth map according to the pose of the current image to obtain a second depth map.

In a possible embodiment, in obtaining the initial depth map of the current image from the current image and the pre-stored image, the obtaining unit is specifically configured to:

In a possible embodiment, in terms of obtaining the fifth depth map from the 3D points corresponding to the current image and the local map, the obtaining unit is specifically configured to:

In a fifth aspect, an embodiment of the present application provides a terminal device, including a memory, one or more processors; wherein the one or more programs are stored in the memory; the one or more processors, when executing the one or more programs, cause the terminal device to implement part or all of the method as described in the first aspect.

In a sixth aspect, an embodiment of the present application provides a server, including a memory, one or more processors; wherein the one or more programs are stored in the memory; the one or more processors, when executing the one or more programs, cause the server to implement part or all of the method as described in the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer storage medium, which is characterized by comprising computer instructions, and when the computer instructions are executed on an electronic device, the electronic device is caused to perform part or all of the method according to the first aspect or the second aspect.

In an eighth aspect, the present application provides a computer program product, which when run on a computer, causes the computer to perform part or all of the method according to the first aspect or the second aspect.

It should be understood that any of the above possible implementations can be freely combined without violating the natural law, and details are not described in this application.

It should be appreciated that the description of technical features, solutions, benefits, or similar language in this application does not imply that all of the features and advantages may be realized in any single embodiment. Rather, it is to be understood that the description of a feature or advantage is intended to include the specific features, aspects or advantages in at least one embodiment. Therefore, the descriptions of technical features, technical solutions or advantages in the present specification do not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions and advantages described in the present embodiments may also be combined in any suitable manner. One skilled in the relevant art will recognize that an embodiment may be practiced without one or more of the specific features, aspects, or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments.

Drawings

FIG. 1a is a schematic diagram illustrating the effect of no virtual/real occlusion;

FIG. 1b is a schematic diagram illustrating the effect of occlusion due to real and imaginary components;

FIG. 1c is a system architecture diagram according to an embodiment of the present application;

FIG. 1d is a schematic structural diagram of a CNN;

fig. 1e is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 1f provides another system architecture diagram according to an embodiment of the present application;

fig. 2 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the relationship of a base map, a local map and a processed local map;

fig. 5 is a schematic flowchart of another image processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an effect of virtual-real shielding according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another terminal device provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a system according to an embodiment of the present application;

FIG. 10 is a schematic diagram of another system configuration provided by embodiments of the present application;

fig. 11 is a schematic structural diagram of another terminal device provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of another terminal device according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of another server according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be described in detail and removed with reference to the accompanying drawings.

In the following, the terms "first", "second", and the like are used in some instances for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is the spiritThe activation functions of the units are used for carrying out nonlinear transformation on the features acquired in the neural network, and converting input signals in the neural units into output signals. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs) are understood to be neural networks having many hidden layers, where "many" have no particular metric, and often the multilayer neural networks and the deep neural networks are essentially the same thing. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein

Is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () isThe function is activated. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. Then, how are the specific parameters defined in DNN? First we look at the definition of the coefficient W. Taking a three-layer DNN as an example, such as: the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way features are extracted is location independent. The convolution kernel may be formalized as a matrix of random size, and may be learned to obtain reasonable weights during the training of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, the process of changing the weight vector before the first updating, namely presetting parameters for each layer in the deep neural network) for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continued until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The system architecture provided by the embodiments of the present application is described below.

Referring to fig. 1c, the present application provides a system architecture. As shown in the system architecture, a data collection device 160 is used to collect training data. For example, the training data in the embodiment of the present application may include: an image sample, a depth map sample and a true depth map; after the training data is collected, the data collection device 160 stores the training data in the database 130, and the training device 120 trains the depth estimation model 101 based on the training data maintained in the database 130.

The following describes the training device 120 deriving the depth estimation model 101 based on the training data. Illustratively, the training device 120 processes the image sample and the depth map sample, calculates a loss value according to the output predicted depth map and the real depth map, and the loss function until the calculated loss value converges, thereby completing the training of the depth estimation model 101.

The depth estimation model 101 can be used to implement the image processing method provided by the embodiment of the present application, that is, the current image, the first depth map and the second depth map are input into the depth estimation model 101 after being subjected to the relevant preprocessing, that is, the target depth map of the current image. The depth estimation model 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the collection of the data collection device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the depth estimation model 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The depth estimation model 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1c, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1c, the execution device 110 is configured with an (input/output, I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include, in an embodiment of the present application: the current image, or the current image and the first depth map.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the target depth map of the current image obtained as described above, to the client device 140, thereby providing it to the user.

It should be noted that the training device 120 may generate the corresponding depth estimation model 101 based on different training data for different targets or different tasks, and the corresponding depth estimation model 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in FIG. 1c, the user may manually specify the input data, which may be manipulated through an interface provided by I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 1c is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1c, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1c, the depth estimation model 101 is obtained by training according to the training device 120, where the depth estimation model 101 may be a neural network in the present application in this embodiment, and specifically, the neural network in the present application may include a CNN or a Deep Convolutional Neural Network (DCNN), and the like.

Since CNN is a common neural network, the structure of CNN will be described in detail below with reference to fig. 1 d. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 1d, the Convolutional Neural Network (CNN) may include an input layer 11, a convolutional/pooling layer 12 (where pooling layer is optional), and a neural network layer 13 and an output layer 14.

Convolutional/pooling layer 12:

and (3) rolling layers:

the convolutional layer/pooling layer 12 shown in FIG. 1d may comprise layers as in examples 121 and 126, for example: in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The internal operation of a convolutional layer will be described below by taking convolutional layer 121 as an example.

Convolution layer 121 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … … depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 10 can make correct prediction.

When the convolutional neural network 10 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 10 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 121-126 as illustrated in fig. 1d at 12 may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 13:

after processing by convolutional/pooling layer 12, convolutional neural network 10 is not sufficient to output the required output information. Since, as previously mentioned, the convolutional layer/pooling layer 12 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 10 needs to generate one or a set of the required number of classes of output using the neural network layer 13. Thus, a plurality of hidden layers (131, 132 to 13n shown in fig. 1 d) may be included in the neural network layer 13, and parameters included in the hidden layers may be pre-trained according to the related training data of a specific task type, for example, the task type may include … … for image recognition, image classification, image super-resolution reconstruction, and the like

After the hidden layers in the neural network layer 13, i.e. the last layer of the whole convolutional neural network 10 is the output layer 14, the output layer 14 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation of the whole convolutional neural network 10 (i.e. the propagation from the direction 11 to 14 in fig. 1d is the forward propagation) is completed, the backward propagation (i.e. the propagation from the direction 14 to 11 in fig. 1d is the backward propagation) will start to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 10, and the error between the result output by the convolutional neural network 10 through the output layer and the ideal result.

It should be noted that the convolutional neural network 10 shown in fig. 1d is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, only includes a part of the network structure shown in fig. 1d, for example, the convolutional neural network employed in the embodiment of the present application may only include the input layer 11, the convolutional layer/pooling layer 12 and the output layer 14.

A hardware structure of a chip provided in an embodiment of the present application is described below.

Fig. 1f is a hardware structure of a chip provided in the embodiment of the present application, where the chip includes a neural network processor 30. The chip may be provided in an execution device 110 as shown in fig. 1c to perform the calculation work of the calculation module 111. The chip may also be disposed in the training device 120 as shown in fig. 1c to complete the training work of the training device 120 and output the depth map estimation model 101. The algorithm for each layer in the convolutional neural network shown in fig. 1d can be implemented in a chip as shown in fig. 1 f. The image fusion method and the training method of the image fusion model in the embodiment of the present application can be implemented in a chip as shown in fig. 1 f.

The neural network processor 30 may be any processor suitable for large-scale exclusive or operation processing, such as a neural-Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or a Graphics Processing Unit (GPU). Taking NPU as an example: the neural network processor NPU30 is mounted as a coprocessor on a main processing unit (CPU) (host CPU) and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 303, and the controller 304 controls the arithmetic circuit 303 to extract data in a memory (weight memory or input memory) and perform an operation. Wherein, the TPU is an artificial intelligence accelerator application specific integrated circuit which is completely customized for machine learning by google (google).

In some implementations, the arithmetic circuitry 303 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 303 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 303 fetches the weight data of the matrix B from the weight memory 302 and buffers on each PE in the arithmetic circuit 303. The arithmetic circuit 303 acquires input data of the matrix a from the input memory 301, performs matrix arithmetic on the input data of the matrix a and weight data of the matrix B, and stores a partial result or a final result of the obtained matrix in an accumulator (accumulator) 308.

The vector calculation unit 307 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 307 may be used for network calculation of a non-convolution/non-FC layer in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 307 can store the processed output vector to the unified buffer 306. For example, the vector calculation unit 307 may apply a non-linear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 307 generates normalized values, combined values, or both. In some implementations, the vector calculation unit 307 stores the processed vectors to the unified memory 306. In some implementations, the vectors processed by the vector calculation unit 307 can be used as activation inputs for the arithmetic circuitry 303, e.g., for use in subsequent layers in a neural network, as shown in fig. 1d, if the current processing layer is the hidden layer 1(131), then the vectors processed by the vector calculation unit 307 can also be used for calculations in the hidden layer 2 (132).

The unified memory 306 is used to store input data as well as output data.

The weight data is directly stored in the weight memory 302 through a memory access controller (DMAC) 305. The input data is also stored into the unified memory 306 by the DMAC.

A Bus Interface Unit (BIU) 310, configured to interact between the DMAC and an instruction fetch memory (instruction fetch buffer) 309; bus interface unit 310 is also used to fetch instructions from external memory by instruction fetch memory 309; the bus interface unit 301 is also used for the memory unit access controller 305 to obtain the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC 305 is mainly used to store input data in the external memory DDR into the unified memory 306, or to store weight data into the weight memory 302, or to store input data into the input memory 301.

An instruction fetch memory 309 coupled to the controller 304 for storing instructions used by the controller 304;

and the controller 304 is configured to call the instruction cached in the instruction fetch memory 309, so as to control the working process of the operation accelerator.

Generally, the unified memory 306, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 1d can be performed by the operation circuit 303 or the vector calculation unit 307. For example, the training method of the depth estimation model and the related method of determining the target depth map in the embodiment of the present application may be performed by the arithmetic circuit 303 or the vector calculation unit 307.

As shown in fig. 1e, the present application provides another system architecture. The system architecture comprises a local device 401, a local device 402, and the execution device 110 and the data storage system 150 shown in fig. 1c, wherein the local device 401 and the local device 402 are connected with the execution device 110 through a communication network.

The execution device 110 may be implemented by one or more servers. Optionally, the execution device 110 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 110 may be disposed on one physical site or distributed across multiple physical sites. The execution device 110 may use data in the data storage system 150 or call program code in the data storage system 150 to implement the training method of the time series prediction model of the embodiment of the present application.

Specifically, in one implementation, the execution device 110 may perform the following processes:

inputting a plurality of depth map samples corresponding to a plurality of image samples into an initial convolutional neural network for processing to obtain a plurality of first prediction depth maps; calculating to obtain a first loss value according to the first prediction depth maps, the real depth maps corresponding to the image samples and the loss function; adjusting parameters in the initial convolutional neural network according to the first loss value; obtaining a first convolution neural network; inputting a plurality of depth map samples corresponding to the plurality of image samples into the first convolutional neural network for processing to obtain a plurality of second predicted depth maps; obtaining a second loss value according to the plurality of second predicted depth maps, the real depth maps corresponding to the plurality of image samples and the loss function; judging whether the second loss value is converged; if the image is converged, determining the first convolution neural network as a depth estimation model of the current image; if not, adjusting parameters in the first convolutional neural network according to the second loss value to obtain a second convolutional neural network, repeatedly executing the process until the obtained loss value is converged, and determining the convolutional neural network when the loss value is converged as the depth estimation model of the current image.

The process described above enables the performing device 110 to obtain a depth estimation model that can be used to obtain a target depth map for the current image.

The user may operate respective user devices (e.g., local device 401 and local device 402) to interact with the execution device 110. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 410 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the

local devices

401 and 402 acquire the depth estimation model from the execution device 110, deploy the depth estimation model on the

local devices

401 and 402, and perform depth estimation using the depth estimation model.

In another implementation, the execution device 110 may directly deploy the depth estimation model, and the execution device 410 obtains the target depth map of the current image by obtaining the current image and the first depth map from the local device 401 and the local device 402, and performing depth estimation on the current image and the first depth map by using the depth estimation model.

The execution device 110 may also be a cloud device, and at this time, the execution device 110 may be deployed in a cloud; alternatively, the execution device 110 may also be a terminal device, in which case, the execution device 110 may be deployed at a user terminal side, which is not limited in this embodiment of the application.

The following explains an application scenario of the present application. As shown in fig. 2, the application scenario includes the terminal device 100 and the server 200.

The terminal device 100 may be a smart phone, a tablet computer, AR glasses, or other smart devices.

The server 200 may be a desktop server, a rack server, a blade server, or other type of server.

The terminal device 100 acquires a current image and a virtual object image, and acquires a first depth map and a second depth map of the current image according to the current image, wherein the second depth map is acquired by the terminal device 100 from the server 200 as shown in fig. 2; performing depth estimation according to the current image, the first depth map and the second depth map to obtain a target depth map of the current image; and displaying the virtual object image and the current image in an overlapping manner according to the target depth map of the current image, thereby realizing virtual occlusion.

How the terminal device 100 and the server 200 implement virtual-real occlusion is described in detail below.

Referring to fig. 3, fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure. As shown in fig. 3, the method includes:

s301, acquiring a current image and a virtual object image, and acquiring a first depth map of the current image according to the current image.

Alternatively, the current image may be an RGB image, a grayscale image, or other form of image.

The current image is obtained from a camera of the terminal device in real time, or obtained from an image stored in the terminal device, or obtained from another device, which is not limited herein.

Optionally, the virtual object image may be obtained by projecting a three-dimensional model of the virtual object by a renderer in the terminal device; or obtained from other devices.

Optionally, obtaining a first depth map of the current image according to the current image includes:

performing feature extraction on the current image to obtain a first 2D feature point of the current image; matching the first 2D characteristic point of the current image with a pre-stored 2D characteristic point to obtain a second 2D characteristic point in the current image; acquiring a 3D point corresponding to a second 2D feature point in the current image according to the second 2D feature point in the current image and a pre-stored corresponding relation between the 2D feature point and the 3D point; wherein the first depth map of the current image comprises 3D points corresponding to the second 2D feature points in the current image.

The correspondence relationship between the pre-stored 2D feature points and the 3D points indicates that there is a 3D point corresponding to each of the pre-stored 2D feature points.

It should be noted that, after obtaining the 3D point corresponding to the second 2D feature point in the current image, the 3D point is projected onto the two-dimensional plane to obtain the first depth map, where the first depth map includes information of the 3D point corresponding to the second 2D feature point in the current image.

The local map is acquired from the basic map by the server according to the pose of the terminal equipment.

The pre-stored 2D feature points may be 2D feature points in at least one history image. And the at least one historical image packet is one or more images with smaller difference between the timestamp of the terminal equipment and the timestamp of the current image, or one or more images with timestamps before the timestamp of the current image. Further, the plurality of images may be consecutive frame images or non-consecutive frame images.

performing feature extraction on the current image to obtain a first 2D feature point of the current image; matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map; the local map is obtained by the server according to the current image and is obtained from the server; acquiring a 3D point corresponding to a third 2D feature point in the local map according to the third 2D feature point in the local map and the corresponding relation between the 2D feature point and the 3D point in the local map; wherein the first depth map of the current image comprises 3D points corresponding to the third 2D feature points in the local map.

And after a 3D point corresponding to the third 2D characteristic point in the local map is obtained, projecting the 3D point onto a two-dimensional plane to obtain the first depth map, wherein the first depth map comprises 3D point information corresponding to the third 2D characteristic point in the local map.

In an optional embodiment, the terminal device sends a local map acquisition request to the server, the local map is a map based on which information of objects around a location point where the terminal device is located is included, the local map acquisition request carries a current image, and the basic map is a map including 3D point cloud information, 2D feature points and feature descriptors thereof of the objects. After receiving a local map acquisition request, a server calculates and obtains position and angle information of a current image in a global map according to a Visual Positioning System (VPS), wherein the position and angle information is collectively called pose information which is based on a pose in a world coordinate system; acquiring a local map from a basic map according to the pose information, wherein the local map is an area which takes the position as the center and is within a certain range (such as the radius of 50 meters) around the position in the basic map; the basic map comprises 2D feature points of the object and feature descriptors corresponding to the feature points and 3D point cloud information, so that the local map also comprises the 2D feature points of the object and the feature descriptors corresponding to the feature points and the 3D point cloud information. And the server sends a response message for responding to the local map acquisition request to the terminal equipment, wherein the response message carries the local map.

The terminal equipment extracts the features of the current image to obtain a first 2D feature point of the current image and a feature descriptor corresponding to the first 2D feature point; and matching the first 2D feature point of the current image according to the 2D feature point in the local map to obtain a third 2D feature point in the local map, and obtaining a 3D point corresponding to the third 2D feature point in the local map according to the corresponding relation between the 2D feature point and the 3D point in the local map, wherein the first depth map of the current image comprises the 3D point corresponding to the third 2D feature point in the local map.

In an optional embodiment, matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map includes:

In an optional embodiment, to improve the matching efficiency, the terminal device calculates a pose of the current image in a preset coordinate system by using a simultaneous localization and mapping (SLAM) system, where the preset coordinate system is a coordinate system with a current position of the terminal device as an origin, and performs alignment transformation on the pose of the current image and a pose acquired by the server based on the VPS to obtain a pose of the terminal device in a world coordinate system, where the pose is the first pose; and processing the local map according to the pose of the terminal equipment in the world coordinate system to obtain a target map, wherein the position of the target map in the local map is consistent with the position indicated by the angle information in the pose of the terminal equipment in the world coordinate system.

As shown in fig. 4, the square area is a basic map, the circular area is a local map, and the central point of the circular area is a coordinate point of the terminal device in the pose under the world coordinate system; the sector area with the angle range of [ -45 degrees, 45 degrees ] is the processed local map, wherein the angle range of [ -45 degrees, 45 degrees ] is the angle information of the terminal device in the pose under the world coordinate system.

And then matching the first 2D feature point of the current image with the 2D feature point in the target map to obtain a third 2D feature point of the target map, and acquiring a 3D point cloud corresponding to the third 2D feature point in the target map according to the corresponding relation between the 2D feature point and the 3D point cloud information in the target map, wherein the first depth map of the current image comprises the 3D point cloud corresponding to the third 2D feature point in the target map.

It should be noted that, because the basic map in the server is acquired offline, the basic map in the server exists in a different place from the current actual environment, for example, a large billboard in a shopping mall exists during offline acquisition, and after a period of time, when a user acquires a current image, the billboard is removed, which results in that the map sent by the server exists in 3D point cloud information inconsistent with the current environment. In addition, the image received by the server may be an image subjected to privacy processing, and may also result in 3D point cloud information that is inconsistent with the current environment and exists in a map sent by the server.

Based on the reason, the server updates the issued local map to obtain a second depth map.

Specifically, the server performs feature extraction on the current image to obtain a 2D feature point of the current image; the way of extracting features from the current image includes, but is not limited to, Scale Invariant Feature Transform (SIFT) method, Oriented FAST and Rotated FAST method, Speeded Up Robust Features (SURF) method, super integral (speed point) method; wherein FAST is called acceleration segmentation detection feature (Features from accessed Segment Test) and BRIEF is called Binary Robust Independent element feature (Binary Robust Independent elements). And the server matches the first 2D feature point in the current image with the pre-stored 2D feature point to obtain a sixth 2D feature point in the current image. Optionally, the pre-stored 2D feature points are 2D feature points in N historical images, the N historical images are acquired by the terminal device, a timestamp of the N images is located before a timestamp of the current image, the N images with the timestamps closer to the timestamp of the current image are provided, and N is an integer greater than 0. Then, the server eliminates noise points in the sixth 2D characteristic points of the current image, and specifically calculates the verification value of each sixth 2D characteristic point in the current image through a homography matrix, a basic matrix and an essential matrix; and if the verification value of the sixth 2D feature point is lower than a second preset threshold, determining that the sixth 2D feature point is a noise point, and deleting the sixth 2D feature point to obtain a seventh 2D feature point of the current image.

The server performs triangularization calculation on the seventh 2D feature point of the current image to obtain an initial position of the seventh 2D feature point in the current image in space, and further obtains an initial depth map corresponding to the seventh 2D feature point in the current image.

The method comprises the steps that a server obtains M images from a multi-frame basic map according to an image retrieval mode, the similarity between each map in the M maps and a current image is greater than a first preset threshold value, M is an integer greater than 0, and the image retrieval method comprises but is not limited to a bag-of-words tree method or a NetVlad method based on deep learning; matching a first 2D feature point in the current image with 2D feature points in M images to obtain a plurality of matched feature point pairs, wherein each matched feature point pair in the plurality of matched feature point pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the fifth 2D feature point are mutually matched feature points, the fourth 2D feature point is the first 2D feature point in the current image, and the fifth 2D feature point is the feature point in the M images; each 2D feature point in the M images corresponds to one 3D point, so that the corresponding relation between the fifth 2D feature point and the 3D point in the M images and the plurality of matched feature point pairs determine the 3D point corresponding to the fourth 2D feature point of the current image; and acquiring a fifth depth map from the 3D points corresponding to the local map, wherein the fifth depth map comprises 3D points which are matched with the 3D points corresponding to the fourth 2D feature point in the plurality of feature point matching pairs in the 3D points corresponding to the local map.

Combining the initial depth map and the fifth depth map with the initial pose information of each image in the current image and the pre-stored image, adjusting the position information of the optimized point cloud by using a Beam Adjustment (BA), calculating the reprojection error of each 3D in the optimized point cloud, and deleting the 3D points with the reprojection errors exceeding the error threshold; and repeating the steps of BA optimization and point cloud deletion for multiple times to finally obtain an optimized high-precision depth map, wherein the optimized high-precision depth map is the second depth map, so that the basic map in the server is updated.

By the method, the basic map stored in the server can be continuously updated by the image uploaded by the terminal equipment, so that the basic map is consistent with the image content acquired by the terminal equipment, and high-precision 3D point cloud information is provided during depth estimation. In addition, the content in the basic map is consistent with the content in the image acquired by the terminal equipment in the above mode, and the more images are uploaded by the terminal equipment, the more thorough the update of the map is, so that the more accurate the depth estimation result is.

S302, determining a target depth map of the current image according to the current image and the first depth map of the current image.

In an alternative embodiment, determining the target depth map of the current image according to the current image and the first depth map of the current image includes:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and T is an integer greater than 1; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; and performing upsampling and fusion processing on the T third feature maps to obtain a target depth map of the current image, wherein the third depth map is the first depth map, or the third depth map is obtained by splicing the first depth map and the second depth map.

Wherein, for any one first feature map in the T first feature maps, a second feature map with the resolution which is only the same as that of the first feature map exists in the T second feature maps.

performing multi-scale feature extraction on a current image to obtain T first feature maps, wherein the resolution of each first feature map in the T first feature maps is different; performing multi-scale feature extraction on the third depth map to obtain T second feature maps, wherein the resolution of each second feature map in the T second feature maps is different; performing multi-scale feature processing on the reference depth map to obtain T fourth feature maps, wherein the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained according to a depth map acquired by a TOF camera, and T is an integer larger than 1; superposing the first feature map, the second feature map and the fourth feature map with the same resolution in the T first feature maps, the T second feature maps and the T fourth feature maps to obtain T fifth feature maps;

and performing upsampling and fusion processing on the T fifth feature maps to obtain a target depth map of the current image, wherein the third depth map is the first depth map, or the third depth map is obtained by splicing the first depth map and the second depth map.

Optionally, the reference depth map is a depth map acquired by the TOF camera.

Optionally, projecting the obtained depth map acquired by the TOF camera into a three-dimensional space according to the pose of the current image to obtain a fourth depth map; back projecting the fourth depth map onto the reference image according to the pose of the reference image to obtain a reference depth map; the reference image is an image adjacent to the current image in the acquisition time, the resolution of a depth map acquired by the TOF camera is lower than a preset resolution, and the frame rate of the depth map acquired by the TOF camera is lower than a preset frame rate.

Optionally, the preset frame rate may be 1fps, 2fps, 5fps or other frame rates, and the preset resolution may be 240 × 180, 120 × 90, 60 × 45, 20 × 15 or other resolutions.

Optionally, the TOF acquires a depth map with a resolution of 20 × 15 at a frame rate of 1 fps.

Optionally, the third depth map is the first depth map of the current image, or the third depth map is obtained by stitching the first depth map and the second depth map of the current image.

Specifically, the upsampling and fusing process specifically includes:

S3: j is j +1, and S1-S3 are repeatedly executed until j is T-1;

Wherein the processing object includes the T third feature maps or the T fifth feature maps.

If there are 5 third feature maps, each feature map is a feature map P₁Characteristic map P₂Characteristic map P₃Characteristic map P₄And a characteristic map P₅And the resolution ratio is increased in turn; for feature map P₁Performing up-sampling to obtain a characteristic graph P₂Feature map P of the same resolution "₁The feature map P "₁And a characteristic map P₂Fusing to obtain a feature map P'₂(ii) a To feature map P'₂Performing up-sampling to obtain a characteristic graph P₃Feature map P of the same resolution "₂The feature map P "₂And a characteristic map P₃Fusing to obtain a feature map P'₃(ii) a To feature map P'₃Performing up-sampling to obtain a characteristic graph P₄Feature map P of the same resolution "₃The feature map P "₃And a characteristic map P₄Fusing to obtain a feature map P'₄(ii) a To feature map P'₄Performing up-sampling to obtain a characteristic graph P₅Feature map P of the same resolution "₄The feature map P "₄And a characteristic map P₅And fusing to obtain a target depth map of the current image.

Optionally, the upsampling is deconvolution upsampling.

the first method is as follows: inputting the current image and the first depth map of the current image into a depth estimation model for feature extraction, and obtaining a target depth map of the current image according to the result of the feature extraction, or,

the second method comprises the following steps: splicing the first depth map and the second depth map of the current image to obtain a third depth map, inputting the current image and the third depth map of the current image into a depth estimation model for feature extraction, and obtaining a target depth map of the current image according to the result of the feature extraction, or,

the third method comprises the following steps: and splicing the first depth map and the second depth map of the current image to obtain a third depth map, inputting the current image, the third depth map of the current image and the reference depth map into a depth estimation model for feature extraction, and obtaining a target depth map of the current image according to a feature extraction result.

It should be noted that, the specific implementation process of the target depth map of the current image obtained by using the depth estimation model may refer to the above-mentioned related description of "determining the target depth map of the current image according to the current image and the first depth map of the current image", and will not be described here.

In an optional embodiment, before a target depth map of a current image obtained by using a depth estimation model, a depth estimation model obtaining request is sent to a server, where the depth estimation model obtaining request carries the current image and a position of a terminal device, and a response message sent by the server and responding to the depth estimation model obtaining request is received, where the response message carries a depth estimation model of the current image, and the depth estimation model of the current image is obtained by the server according to the current image and the position of the terminal device.

The position of the terminal device is the position of the terminal device when the current image is collected, and the position is the coordinate under the world coordinate system.

In an optional embodiment, before obtaining the target depth map of the current image by using the depth estimation model, the method of this embodiment further includes:

and training the initial convolutional neural network model to obtain a depth estimation model of the current image.

Specifically, training an initial convolutional neural network to obtain a depth estimation model of a current image includes:

inputting a plurality of depth map samples corresponding to a plurality of image samples into an initial convolutional neural network for processing to obtain a plurality of first prediction depth maps; calculating to obtain a first loss value according to the first prediction depth maps, the real depth maps corresponding to the image samples and the loss function; adjusting parameters in the initial convolutional neural network according to the first loss value; obtaining a first convolution neural network;

inputting a plurality of depth map samples corresponding to the plurality of image samples into the first convolutional neural network for processing to obtain a plurality of second predicted depth maps; obtaining a second loss value according to the plurality of second predicted depth maps, the real depth maps corresponding to the plurality of image samples and the loss function; judging whether the second loss value is converged; if the image is converged, determining the first convolution neural network as a depth estimation model of the current image; if not, adjusting parameters in the first convolutional neural network according to the second loss value to obtain a second convolutional neural network, repeatedly executing the process until the obtained loss value is converged, and determining the convolutional neural network when the loss value is converged as the depth estimation model of the current image.

The above loss function is:

wherein

Depth map d representing the output of a depth estimation model of scale i_iDepth map g corresponding to true values_iThe error of (a) is detected,

depth estimation model output depth map gradient dg with scale i_iCorresponding to true value depth map gradient gg_iThe error of (a) is detected,

depth estimation model output depth map normal vector dn with scale i_iCorresponding to true value depth map normal vector gn_iThe error of (2).

Alternatively, the depth estimation model may adopt a network structure such as DiverseDepth, SARPN, or CSPN.

The feature extraction function in the depth estimation model can be implemented by network structures such as VGGNet, ResNet, resenext, densneet, and the like.

VGGNet: all using 3 x 3 convolution kernels and 2 x 2 pooling kernels, performance is enhanced by continuously deepening the network structure. For VGG-16, whose input is an RGB image of size 224 x 224, the average of the three channels is calculated during pre-processing and subtracted on each pixel (fewer iterations after processing, faster convergence). The image is processed through a series of convolutional layers in which very small 3 x 3 convolutional kernels are used, the choice of using 3 x 3 convolutional kernels is because 3 x 3 is the smallest size that can capture information in the neighborhood of the pixel 8. The convolutional layer step (stride) is set to 1 pixel and the padding (padding) of the 3 x 3 convolutional layer is set to 1 pixel. The pooling layer used max pooling for a total of 5 layers, and after a portion of the convolutional layer, the window for max pooling was 2 x 2 with a step size set to 2. The convolutional layer is followed by three fully-connected layers (FC). The first two full link layers all have 4096 channels and the third full link layer has 1000 channels for classification. The full connectivity layer configuration is the same for all networks. The full connectivity layer is followed by Softmax for classification. All hidden layers (in the middle of each conv layer) use the ReLU as an activation function.

ResNet: is a residual network, which can be understood as a sub-network, which can be stacked to form a very deep network. The residual network is characterized by easy optimization and can improve accuracy by adding considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved.

ResNeXt is based on the idea of ResNet, and provides a structure which can improve the accuracy rate on the premise of not increasing the complexity of parameters and simultaneously reduce the number of hyper-parameters, namely, the idea of expanding the network width by using increment is used for reference, different characteristics are learned by a plurality of branches, blocks with the same topological structure are stacked in parallel to replace the original block of three-layer convolution of ResNet, the accuracy rate of a model is improved under the condition of not obviously increasing the number of parameters, and meanwhile, because the topological structures are the same, the hyper-parameters are also reduced, so that the model is convenient to transplant and becomes a popular identification task frame.

DenseNet (dense connection network): each layer in a conventional convolutional network uses only the output features of the previous layer as its input, and each layer in DenseNet uses the features of all previous layers as inputs and its own features as inputs to all subsequent layers. DenseNet has several advantages: the method reduces the problems caused by gradient diffusion, enhances the spread of the characteristics, encourages the reuse of the characteristics and greatly reduces the parameter quantity.

By adopting a multi-scale feature fusion strategy, adding time sequence consistency and scale consistency constraints, and improving the generalization of the model under different scenes by adopting multi-dataset joint training.

And S303, overlapping and displaying the virtual object image and the current image according to the target depth map of the current image.

performing edge optimization on the target depth map of the current image to obtain an optimized depth map of the current image;

and displaying the virtual object image and the current image in an overlapping manner according to the optimized depth map.

Wherein, the virtual object image and the current image are displayed in an overlapping manner according to the optimized depth map, and the method comprises the following steps:

judging the relation between the depth value corresponding to each pixel of the current image and the depth value of the virtual object according to the optimized depth map, and if the depth value corresponding to any pixel point A in the virtual object image is larger than the depth value corresponding to a pixel point B in the current image, wherein the position of the pixel point B is the same as that of any pixel point A in the virtual object image, displaying the color of the current image; otherwise, displaying the color of the virtual object image; after traversing all the pixel points according to the method, the shielding effect is displayed on the terminal equipment.

segmenting a target depth map of a current image to obtain a foreground depth map and a background depth map of the current image, wherein the background depth map is a depth map of a background area contained in the target depth map of the current image, and the foreground depth map is a depth map of a foreground area contained in the target depth map of the current image; fusing the L background depth maps according to L positions corresponding to the L background depth maps respectively to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a prestored image and a background depth map of a current image, and the L positions and postures comprise positions and postures of the prestored image and the current image; l is an integer greater than 1;

carrying out back projection on the fused three-dimensional scene according to the background depth map of the current image to obtain a fused background depth map; splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map; and overlapping and displaying the virtual object image and the current image according to the updated depth map.

Specifically, when the object of interest is a person, segmenting the target depth map of the current image, specifically, segmenting the target depth map of the current image according to the portrait mask, to obtain a foreground depth map and a background depth map of the current image.

Specifically, the method includes performing portrait segmentation on the optimized depth map of the current image, specifically performing portrait segmentation on a target depth map of the current image according to a portrait mask, and obtaining a third depth map and a fourth depth map of the current image.

Optionally, the pose may be acquired according to a corresponding image, or may be an SLAM pose, a pose obtained by a deep learning method, or a pose obtained by another method.

Optionally, the specific fusion mode used for the fusion may be a Truncated Signed Distance Function (TSDF) fusion mode, and may also be a surfel fusion mode.

In an alternative embodiment, the target depth map of the current image is edge-optimized to obtain an optimized depth map. The method comprises the following steps:

extracting edge structure information of a current image and extracting edge structure information of a target depth map of the current image; calculating a difference value between the edge structure information of the target depth map of the current image and the edge structure information of the current image, and modifying the edge position of the target depth map of the current image based on the difference value to obtain the edge of the optimized depth map; and according to the edge of the optimized depth map, obtaining a sharp depth map corresponding to the edge of the current image, wherein the depth map is the optimized depth map.

Displaying the virtual object image and the current image in an overlapping manner according to the updated depth map, specifically comprising:

judging the relation between the depth value corresponding to each pixel of the current image and the depth value of the virtual object image according to the updated depth map, and if the depth value corresponding to any pixel point A in the virtual object is larger than the depth value corresponding to a pixel point B in the current image, wherein the position of the pixel point B is the same as that of any pixel point A in the virtual object, displaying the color of the current image; otherwise, displaying the color of the virtual object; after traversing all the pixel points according to the method, the shielding effect is displayed on the terminal equipment.

It can be seen that in the embodiment of the application, the target depth map with higher precision is obtained by performing depth estimation on the current image and the first depth map and the second depth map of the current image, so that the problems of inter-frame flicker and instability of subsequent virtual and real shielding effects are solved; the depth map acquired by the TOF camera is introduced during depth estimation, so that the precision of the target depth map of the current image is further improved, and the problems of inter-frame flicker and instability of subsequent virtual and real shielding effects are solved; the optimized depth map is obtained by performing edge optimization on the target depth map of the current image, and then the multi-frame depth maps are fused to obtain the depth map with sharper portrait edges, so that the virtual and real shielding effects are favorably improved.

Referring to fig. 5, fig. 5 is a schematic flowchart of another image processing method according to an embodiment of the present disclosure. As shown in fig. 5, the method includes:

s501, receiving a depth estimation model request message sent by the terminal device, wherein the request message carries a current image acquired by the terminal device and a position of the terminal device when the current image is acquired.

The position is a coordinate in a world coordinate system, which may be a UTM coordinate system, a GPS coordinate system, or other world coordinate systems.

S502, obtaining a depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the current image and the position of the terminal equipment.

In one possible embodiment, obtaining the depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the current image and the position of the terminal device includes:

acquiring a plurality of frames of first images according to the position of the terminal equipment, wherein the plurality of frames of first images are images in a preset range taking the position of the terminal equipment as the center in a basic map; acquiring a target image from the multiple frames of first images, wherein the target image is the image with the highest similarity with the current image in the multiple frames of first images; and determining the depth estimation model corresponding to the target image as the depth estimation model of the current image.

Specifically, in order to improve the accuracy of depth estimation, in the server, a depth estimation model is obtained by training separately for each position; after receiving a depth estimation model acquisition request of a terminal device, a server acquires a plurality of frames of first images according to the position of the terminal device, wherein the plurality of frames of first images are images in a preset range with the position of the terminal device as the center in a basic map, and a target image is acquired from the plurality of frames of first images, and the target image is the image with the highest similarity with a current image in the plurality of frames of first images; obtaining the pose of the current image according to the pose of the target image; and determining a depth estimation model corresponding to the position from the server according to the position in the pose of the current image, wherein the depth estimation model is the depth estimation model of the current image.

In a possible embodiment, the method of this embodiment further includes:

It should be noted here that the above is only one training process; in practical application, iteration is carried out for multiple times according to the mode until the calculated loss value is converged; and determining the convolutional neural network model when the loss value is converged as the depth estimation model of the current image. The above specific training process can be referred to the related description in S302, and will not be described here.

S503, sending a response message responding to the depth estimation model request message to the terminal equipment, wherein the response message carries the depth estimation model of the current image.

In an optional embodiment, a local map acquisition request sent by a terminal device is received, where the local map acquisition request carries a current image; acquiring a local map from a basic map stored in a server according to the pose of the current image; and sending a response message for responding to the local map acquisition request to the server, wherein the response message carries the local map.

Specifically, after receiving a local map request message, a server calculates and obtains position and angle information of a current image in a global map according to VPS (virtual private server), wherein the position and angle information is called as pose information, and the pose information is based on a pose in a world coordinate system; acquiring a local map from a basic map according to the pose information, wherein the local map is an area which takes the position as the center and is within a certain range (such as the radius of 50 meters) around the position in the basic map; the basic map comprises 2D feature points of the object and feature descriptors corresponding to the feature points and 3D point cloud information, so that the local map also comprises the 2D feature points of the object and the feature descriptors corresponding to the feature points and the 3D point cloud information.

In an optional embodiment, the method of the present application further comprises:

acquiring an initial depth map of the current image according to the current image and a pre-stored image, and acquiring a fifth depth map according to the current image and a 3D point corresponding to the local map; and optimizing the initial depth map and the fifth depth map according to the pose of the current image to obtain a second depth map.

Specifically, matching a first 2D feature point of the current image with a pre-stored 2D feature point of the image to obtain a sixth 2D feature point of the current image; removing noise points in the sixth 2D characteristic point of the current image to obtain a seventh 2D characteristic point of the current image; performing triangularization calculation on each 2D feature point in the seventh 2D feature point of the current image to obtain an initial 3D point of the seventh 2D feature point of the current image in space; the initial depth map of the current image includes an initial 3D point in space of a seventh 2D feature point of the current image.

Because the basic map in the server is acquired offline, the basic map in the server exists in a place different from the current actual environment, for example, a large billboard in a market exists during offline acquisition, and after a period of time, the billboard is removed when a user acquires a current image, which results in that the map sent by the server exists in 3D point cloud information inconsistent with the current environment. In addition, the image received by the server may be an image subjected to privacy processing, and may also result in 3D point cloud information that is inconsistent with the current environment and exists in a map sent by the server.

Specifically, the server performs feature extraction on a current image to obtain a first 2D feature point of the current image; the feature extraction method for the current image includes, but is not limited to, a SIFT method, an ORB method, a SURF method, and a super integral (spuer point) method. And the server matches the 2D characteristic points in the current image with the pre-stored 2D characteristic points of the image to obtain a sixth 2D characteristic point in the current image. Optionally, the image stored in advance is obtained by the terminal device, and the timestamp is located in an image before the timestamp of the current image, and the timestamp is at least one image closer to the timestamp of the current image. Then, the server eliminates noise points in the sixth 2D characteristic points of the current image, and specifically calculates the verification value of each sixth 2D characteristic point in the current image through a homography matrix, a basic matrix and an essential matrix; and if the verification value of the sixth 2D feature point is lower than a second preset threshold, determining that the sixth 2D feature point is a noise point, and deleting the sixth 2D feature point to obtain a seventh 2D feature point of the current image.

The method comprises the steps that a server obtains M images from a multi-frame basic map in an image retrieval mode, wherein the similarity between each map in the M maps and a current image is greater than a first preset threshold value, and the image retrieval method comprises but is not limited to a bag-of-words tree method or a NetVlad method based on deep learning; matching a first 2D feature point in the current image with 2D feature points in M images to obtain a plurality of matched feature point pairs, wherein each matched feature point pair in the plurality of matched feature point pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the fifth 2D feature point are mutually matched feature points, the fourth 2D feature point is a 2D feature point in the current image, and the fifth 2D feature point is a feature point in the M images; each 2D feature point in the M images corresponds to one 3D point, so that the corresponding relation between the fifth 2D feature point and the 3D point in the M images and the plurality of matched feature point pairs determine the 3D point corresponding to the fourth 2D feature point of the current image; and acquiring a fifth depth map according to the 3D point corresponding to the local map and the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs, wherein the fifth depth map comprises the 3D point which is matched with the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs in the 3D points corresponding to the local map.

Combining the initial depth map and the fifth depth map with the initial pose information of each image in the current image and the historical image, using BA to adjust the position information of the 3D points in the optimized depth map, calculating the reprojection error of each 3D point in the optimized point cloud, and deleting the 3D points with the reprojection errors exceeding the error threshold; and repeating the steps of BA optimization and point cloud deletion for multiple times to finally obtain an optimized high-precision depth map, wherein the optimized high-precision depth map is the second depth map, so that the basic map in the server is updated.

The cloud local map updating algorithm can solve the problem that errors are introduced when the offline map and the current using scene change to directly use offline map points to participate in depth map estimation, and meanwhile solves the problems of scale ambiguity and interframe instability of monocular depth estimation, and further solves the problems of virtual and real shielding flicker and inconsistent interframe shielding phenomena.

In a specific example, step S10, obtaining the color image collected by the terminal device and information measured by an Inertial Measurement Unit (IMU); initializing the SLAM system according to the color image and the information measured by the IMU; after the SLAM system is initialized successfully, the terminal equipment calculates the pose of the terminal equipment in a local coordinate system according to the SLAM system in real time; after acquiring the current image, the terminal device performs face detection on the current image, and performs privacy processing on a face area in the current image, for example, the color of the face area is filled to black; then transmitting the current image after privacy processing to a server; the server calls a VPS algorithm to position the current image, and if the positioning is successful, the pose of the current image in a basic map stored by the server is obtained; the server acquires a local map from the basic image according to the pose of the current image in the basic image, wherein the local map is an area which takes the position of the current image as the center and has a radius of 150m in the basic map; transmitting the 3D point cloud, the 2D feature points and the feature descriptors of the local map to the terminal equipment; setting a current effective uploaded image counter to be 1, and entering step S20; if the positioning fails, setting the counter of the currently and effectively uploaded image to be 0, returning a failure message, and re-entering the step S10;

step S20:

step S20.1: if the current effective uploaded image counter is 0, directly entering the step S20.2; otherwise, the server extracts the features of the current image to obtain a first 2D feature point of the current image; the feature extraction method for the current image includes, but is not limited to, a SIFT method, an ORB method, a SURF method, and a super integral (spuer point) method. And the server matches the first 2D feature point in the current image with the pre-stored 2D feature point of the image to obtain a sixth 2D feature point in the current image. Optionally, the pre-stored image is obtained by the terminal device, and the timestamp is located in an image before the timestamp of the current image, and the timestamp is at least one image with a shorter distance from the timestamp of the current image. Then, the server eliminates noise points in the sixth 2D characteristic points of the current image, and specifically calculates the verification value of each sixth 2D characteristic point in the current image through a homography matrix, a basic matrix and an essential matrix; and if the verification value of the sixth 2D feature point is lower than a second preset threshold, determining that the sixth 2D feature point is a noise point, and deleting the sixth 2D feature point to obtain a seventh 2D feature point of the current image. The server performs triangularization calculation on the seventh 2D feature point of the current image to obtain an initial position of a fourth 2D feature point in the current image in space, and further obtains an initial depth map corresponding to the seventh 2D feature point in the current image.

Step S20.2: the server acquires the closest 10 images from the local map obtained in step S10 according to the NetVlad algorithm. Extracting features of a current image uploaded by the terminal equipment according to the 2D feature point category of the local map; the 2D feature points of the local map in this example include, but are not limited to, SIFT feature points, SURF feature points, ORB feature points, superpoint feature points, D2Net feature points, aslpeak feature points, R2D2 feature points, and the like. Obtaining a 2D-2D feature point matching relation by using a violence matching method to obtain a first 2D feature point in the current image and a 2D feature point in the 10 images; because the 2D characteristic points in the local map all have 3D points which are in one-to-one correspondence with the 2D characteristic points, the 3D points which are in matching relation with the first 2D characteristic point of the current image can be screened out from the 3D points of the local image according to the obtained 2D-2D characteristic point matching relation;

step S20.3: adjusting and optimizing the position information of the 3D points by combining the initial depth map and a depth map obtained by matching 3D points in the 3D points of the local image with the 2D feature points of the current image, using BA by combining the initial pose information of each image in the current image and the historical image, calculating the reprojection error of the optimized 3D points, and deleting the 3D points with the reprojection error exceeding the error threshold; repeating the steps of BA optimization and deletion for multiple times to finally obtain an optimized high-precision depth map, thereby realizing the updating of the basic map in the server;

step S20.3: and setting a counter of the currently effective uploaded image to be 1.

And step S30, performing depth estimation according to the current image, the optimized high-precision depth map and a depth map obtained by 3D points corresponding to 2D feature points matched with the 2D feature points in the current image in the local map to obtain a target depth map of the current image. The specific implementation process can be referred to the related description of S302, and will not be described here.

And step S40, obtaining a portrait segmentation result picture from the current image through a portrait segmentation network. The portrait segmentation network comprises a feature extraction network and a softmax classifier, and the structure of the feature extraction network can be FCN, ParseNet, DeepLab v1, DeepLab v2, DeepLab v3, RefineNet, SegNet, PSPNet, ENet, ICNet, BiSeNet and other network structures. The feature extraction network is used for extracting features of a current image, then bilinear up-sampling is carried out on the features of the current image to obtain a feature map with the same size as an input size, and finally a label of each pixel is obtained through a softmax classifier, so that a portrait segmentation result map, also called a portrait mask, is obtained. For example, a current image is input to BiSeNet to obtain a feature map, and each pixel of the feature map is classified by a Softmax classifier, wherein the pixel of a portrait region is 255, and the pixel of a non-portrait region is 0.

It should be noted that the segmentation of the current image in the present application is not limited to human image segmentation, but may also be object segmentation, such as object segmentation of automobiles, airplanes, kittens, and the like, and the segmentation method may refer to human image segmentation, and is not described herein.

Step S50, the terminal device carries out portrait segmentation on the target depth map of the current image according to the portrait mask to obtain a foreground depth map and a background depth map of the current image, wherein the background depth map of the current image is a depth map containing a background area in the target depth map of the current image, and the foreground depth map of the current image is a depth map containing a foreground area in the target depth map of the current image; performing Truncated Signed Distance Function (TSDF) fusion on the L background depth maps according to L positions corresponding to the L background depth maps respectively to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a prestored image and a background depth map of a current image, and the L positions and postures comprise positions and postures of the prestored image and the current image; l is an integer greater than 1; carrying out back projection on the fused three-dimensional scene according to the pose of the current image to obtain a fused background depth map; splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map

Step S60, sending the depth map obtained in step S50 and the current image together into a renderer, determining, in the renderer, a relationship between a depth value of each pixel of the current image and a depth map of the virtual object, displaying a color of the current image if the depth value of the virtual object is greater than the depth map, otherwise displaying the color of the virtual object, and displaying a shielding effect on the terminal device after traversing each pixel point one by one.

As shown in fig. 6, by accurately estimating the depth map of the tree, the virtual pandas can be seen through the real tree gaps, and can also be blocked by buildings such as walls, and the like, and certainly, the algorithm also supports the blocking of people and virtual objects, and the depth map of the whole scene can be accurately estimated by the algorithm as seen from the last map, so that the virtual pandas can be between the arms of people, the whole immersion is strong, and the virtual pandas have excellent user experience.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 7, the terminal device 100 includes:

the 2D-3D matching module 102 is configured to match a first 2D feature point of the current image with a pre-stored 2D feature point to obtain a second 2D feature point of the current image, and obtain a 3D point corresponding to the second 2D feature point in the current image according to the second 2D feature point in the current image and a relationship between the pre-stored 2D feature point and the 3D point; the 3D points corresponding to the second 2D feature points form a first depth map;

the depth estimation module 104 is configured to perform feature extraction on a current image to obtain T first feature maps, where resolution of each first feature map in the T first feature maps is different; performing feature extraction on the first depth map to obtain T second feature maps, wherein the resolution of each second feature map in the T second feature maps is different; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; and performing upsampling and fusion processing on the T third feature maps to obtain a target depth map of the current image.

It should be noted that, the specific process of performing upsampling and fusion processing on T third feature maps to obtain the target depth map of the current image may refer to the related description of step S302, and will not be described here.

And the portrait segmentation module 108 is configured to obtain a portrait segmentation result map from the current image through a portrait segmentation network. The portrait segmentation network comprises a feature extraction network and a softmax classifier. The feature extraction network is used for extracting features of a current image, then bilinear up-sampling is carried out on the features of the current image to obtain a feature map with the same size as an input size, and finally a label of each pixel is obtained through a softmax classifier, so that a portrait segmentation result map, also called a portrait mask, is obtained. For example, a current image is input to BiSeNet to obtain a feature map, and each pixel of the feature map is classified by a Softmax classifier, wherein the pixel of a portrait region is 255, and the pixel of a non-portrait region is 0.

The depth map edge optimization module 105 is configured to perform edge structure extraction on the current image and the target depth map thereof, respectively, to obtain edge structure information of the current image and edge structure information of the target depth map; calculating a difference value from the edge structure of the target depth map to the edge structure of the current image by taking the edge structure information of the current image as reference, and then modifying the edge position of the target depth map according to the difference value so as to optimize the edge of the depth map and obtain the optimized depth map, wherein the depth map is a sharp depth map corresponding to the edge of the current image;

a virtual and real occlusion application module 107, configured to perform portrait segmentation on the optimized depth map according to the portrait mask, so as to obtain a depth map including a portrait area and a depth map of a non-portrait area; then, overlapping and displaying the current image and the virtual object image according to the relation between the depth values in the depth map containing the portrait area and the depth map containing the non-portrait area and the depth values of the virtual object; if the depth value corresponding to any pixel point A in the virtual object is larger than the depth value corresponding to a pixel point B in the current image, wherein the position of the pixel point B is the same as that of any pixel point A in the virtual object, displaying the color of the current image; otherwise, displaying the color of the virtual object image; after traversing all the pixel points according to the method, the shielding effect is displayed on the terminal equipment. Due to the fact that the portrait segmentation is carried out on the optimized depth map, when the current image comprises the portrait, the shielding effect between the person and the virtual object can be displayed on the terminal device.

Referring to fig. 8, fig. 8 is a schematic structural diagram of another terminal device provided in the embodiment of the present application. As shown in fig. 8, the terminal device 100 includes:

the 2D-3D matching module 102 is configured to match a first 2D feature point of the current image with a pre-stored 2D feature point to obtain a second 2D feature point of the current image; acquiring a 3D point corresponding to a second 2D feature point in the current image according to the second 2D feature point in the current image and a pre-stored relationship between the 2D feature point and the 3D point; the 3D points corresponding to the second 2D feature points form a first depth map;

the multi-view fusion module 106 is configured to perform portrait segmentation on the optimized depth map according to the portrait mask to obtain a foreground depth map and a background depth map, where the background depth map is a depth map in which the optimized depth map includes a non-human region, and the foreground depth map is a depth map in which the optimized depth map includes a portrait region; fusing the L background depth maps according to L poses respectively corresponding to the L background depth maps to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a prestored image and a background depth map of a current image, and the L positions and postures comprise positions and postures of the prestored image and the current image; l is an integer greater than 1; carrying out back projection on the fused three-dimensional scene according to the pose of the current image to obtain a fused background depth map; splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map;

the virtual-real shielding application module 107 is configured to determine, according to the updated depth map, a relationship between a depth value corresponding to each pixel of the current image and a depth value of the virtual object, and if a depth value corresponding to any pixel a in the virtual object is greater than a depth value corresponding to a pixel B in the current image, which is at the same position as any pixel a in the virtual object, display a color of the current image; otherwise, displaying the color of the virtual object; after traversing all the pixel points according to the method, the shielding effect is displayed on the terminal equipment.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a system according to an embodiment of the present disclosure. As shown in fig. 9, the system includes a terminal device 100 and a server 200, wherein the terminal device 100 includes: the system comprises a 2D-3D matching module 102, a depth estimation module 104, a depth map edge optimization module 105, a portrait segmentation module 108 and a virtual and real occlusion application module 107; the server 200 comprises a VPS positioning and map issuing module 101 and a local map updating module 103;

the VPS positioning and map issuing module 101 is configured to, after receiving a current image, obtain position and angle information of the current image in a global map according to VPS calculation, where the position and angle information are collectively referred to as pose information, and the pose information is based on a pose in a world coordinate system; acquiring a local map from a basic map according to the pose information, for example, taking the position in the pose information as a center in the local map and a region within a certain range (for example, with a radius of 50 meters) around the position in the pose information; the basic map comprises 2D feature points of the object and feature descriptors corresponding to the 2D feature points and 3D point cloud information, so that the local map also comprises the 2D feature points of the object and the feature descriptors corresponding to the 2D feature points and the 3D point cloud information;

the 2D-3D matching module 102 is used for extracting features of the current image to obtain a first 2D feature point of the current image; matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map; the local map is obtained by the server according to the current image; acquiring a 3D point corresponding to a third 2D feature point in the local map according to the corresponding relation between the second 2D feature point in the local map and the 2D feature point and the 3D point in the local map; the first depth map of the current image comprises 3D points corresponding to the third 2D feature points in the local map;

the depth estimation module 104 is configured to perform feature extraction on the current image to obtain T first feature maps, where resolution of each first feature map in the T first feature maps is different; performing feature extraction on the first depth map to obtain T second feature maps, wherein the resolution of each second feature map in the T second feature maps is different; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; carrying out up-sampling and fusion processing on the T third feature maps to obtain a target depth map of the current image;

in an optional embodiment, the local map updating module 103 is configured to match the first 2D feature point of the current image with a pre-stored 2D feature point of the image to obtain a sixth 2D feature point of the current image; removing noise points in the sixth 2D characteristic point of the current image to obtain a seventh 2D characteristic point of the current image; performing triangularization calculation on each 2D feature point in the seventh 2D feature point of the current image to obtain an initial 3D point of the seventh 2D feature point of the current image in space; the initial depth map of the current image comprises an initial 3D point cloud of a seventh 2D feature point of the current image in space; acquiring M maps from a multi-frame basic map, wherein the similarity between each map in the M maps and a current image is greater than a first preset threshold; matching the 2D characteristic points of the M maps with the first 2D characteristic point of the current image to obtain a plurality of characteristic point matching pairs; each feature point matching pair in the multiple feature point matching pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the fifth 2D feature point are mutually matched feature points, the fourth 2D feature point is a first 2D feature point of the current image, and the fifth 2D feature point is a 2D feature point in the M maps; acquiring a 3D point corresponding to each fourth 2D feature point in a plurality of feature point matching pairs according to the corresponding relation between the fifth 2D feature point and the 3D point in the M maps; acquiring a fifth depth map according to the 3D point corresponding to the local map and the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs, wherein the fifth depth map comprises the 3D point which is matched with the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs in the 3D points corresponding to the local map; combining the initial depth map and the third depth map, the initial depth map and the fifth depth map with the current image and the initial pose information of each image in the pre-stored images, using BA to adjust the position information of the 3D points in the optimized depth map, calculating the reprojection error of each 3D point in the optimized point cloud, and deleting the 3D points with the reprojection error exceeding the error threshold; repeating the steps of BA optimization and cloud deletion for multiple times to finally obtain an optimized high-precision depth map, wherein the optimized high-precision depth map is the second depth map, so that the basic map in the server is updated;

the depth estimation module 104 is configured to perform feature extraction on the current image to obtain T first feature maps, where resolution of each first feature map in the T first feature maps is different; performing feature extraction on the third depth map to obtain T second feature maps, wherein the resolution of each second feature map in the T second feature maps is different, and the third depth map is obtained by splicing the first depth map and the second depth map; superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams; carrying out up-sampling and fusion processing on the T third feature maps to obtain a target depth map of the current image;

a virtual and real occlusion application module 107, configured to perform portrait segmentation on the optimized depth map according to the portrait mask, so as to obtain a depth map including a portrait area and a depth map of a non-portrait area; then, overlapping and displaying the current image and the virtual object image according to the relation between the depth values in the depth map containing the portrait area and the depth map containing the non-portrait area and the depth values of the virtual object; if the depth value corresponding to any pixel point A in the virtual object image is larger than the depth value corresponding to a pixel point B in the current image, wherein the position of the pixel point B is the same as that of any pixel point A in the virtual object image, displaying the color of the current image; otherwise, displaying the color of the virtual object image; after traversing all the pixel points according to the method, the shielding effect is displayed on the terminal equipment. Due to the fact that the portrait segmentation is carried out on the optimized depth map, when the current image comprises the portrait, the shielding effect between the person and the virtual object can be displayed on the terminal device.

Referring to fig. 10, fig. 10 is a schematic structural diagram of another system provided in the embodiment of the present application. As shown in fig. 10, the system includes a terminal device 100 and a server 200, wherein the terminal device 100 includes: the system comprises a 2D-3D matching module 102, a depth estimation module 104, a depth map edge optimization module 105, a portrait segmentation module 108 and a virtual and real occlusion application module 107; the server 200 comprises a VPS positioning and map issuing module 101 and a local map updating module 103;

in an optional embodiment, the local map updating module 103 is configured to match the first 2D feature point of the current image with a pre-stored 2D feature point of the image to obtain a sixth 2D feature point of the current image; removing noise points in the sixth 2D characteristic point of the current image to obtain a seventh 2D characteristic point of the current image; performing triangularization calculation on each 2D feature point in the seventh 2D feature point of the current image to obtain an initial 3D point of the seventh 2D feature point of the current image in space; the initial depth map of the current image comprises an initial 3D point cloud of a seventh 2D feature point of the current image in space; acquiring M maps from a multi-frame basic map, wherein the similarity between each map in the M maps and a current image is greater than a first preset threshold; matching the 2D characteristic points of the M maps with the first 2D characteristic point of the current image to obtain a plurality of characteristic point matching pairs; each feature point matching pair in the multiple feature point matching pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the sixth 2D feature point are mutually matched feature points, the fourth 2D feature point is a first 2D feature point of the current image, and the fifth 2D feature point is a 2D feature point in the M maps; acquiring a 3D point corresponding to each fourth 2D feature point in a plurality of feature point matching pairs according to the corresponding relation between the fifth 2D feature point and the 3D point in the M maps; acquiring a fifth depth map according to the 3D point corresponding to the local map and the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs, wherein the fifth depth map comprises the 3D point which is matched with the 3D point corresponding to the fourth 2D feature point in the plurality of feature point matching pairs in the 3D points corresponding to the local map; combining the initial depth map and the third depth map, the initial depth map and the fifth depth map with the current image and the initial pose information of each image in the pre-stored images, using BA to adjust the position information of the 3D points in the optimized depth map, calculating the reprojection error of each 3D point in the optimized point cloud, and deleting the 3D points with the reprojection error exceeding the error threshold; repeating the steps of BA optimization and cloud deletion for multiple times to finally obtain an optimized high-precision depth map, wherein the optimized high-precision depth map is the second depth map, so that the basic map in the server is updated;

It should be noted here that, in order to improve the accuracy of the target depth map of the current image, for the depth estimation module 104 in fig. 7-10, a depth map acquired by a TOF camera is introduced; the depth estimation module 104 is configured to perform feature extraction on the current image to obtain T first feature maps, and perform feature extraction on the third depth map to obtain T second feature maps; performing feature extraction on the reference depth map to obtain T fourth feature maps, wherein the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained according to a depth map acquired by a time of flight (TOF) camera, and T is an integer greater than 1; superposing the first feature map, the second feature map and the fourth feature map with the same resolution in the T first feature maps, the T second feature maps and the T fourth feature maps to obtain T fifth feature maps; carrying out upsampling and fusion processing on the T fifth feature maps to obtain a target depth map of the current image;

the third depth map may be the first depth map, or may be obtained by stitching the first depth map and the second depth map.

Optionally, the reference depth map is a depth map acquired by a TOF camera.

In order to reduce the power consumption of the terminal equipment, the TOF camera acquires a depth map according to a frame rate lower than a preset frame rate, and the resolution of the depth map is lower than a preset resolution; the terminal device 100 projects the depth map acquired by the TOF camera into a three-dimensional space according to the pose of the current image to obtain a fourth depth map; and projecting the fourth depth map onto a reference image according to the pose of the reference image to obtain the reference depth map, wherein the reference image is an image adjacent to the current image in the acquisition time.

Optionally, the TOF camera may be a camera of the terminal device 100, and may also be a camera of another terminal device; after the depth maps collected by the TOF cameras of other terminal devices, the other terminal devices send the depth maps collected by the TOF cameras to the terminal device 100. In this way, on the premise that the terminal device 100 does not include a TOF camera, a depth map acquired by TOF may also be introduced, so as to improve the accuracy of the target depth map of the current image.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a terminal device provided in the embodiment of the present application. As shown in fig. 11, the terminal device 1100 includes:

an obtaining unit 1101, configured to obtain a current image and a virtual object image, and obtain a first depth map and a second depth map of the current image according to the current image, where the second depth map is obtained from a server;

the estimating unit 1102 is configured to perform feature extraction according to the current image, the first depth map and the second depth map of the current image, and obtain a target depth map of the current image according to a result of the feature extraction.

In a possible embodiment, the acquiring unit 1101 is further configured to acquire a virtual object image;

the terminal device 1100 further includes:

and an overlay display unit 1103, configured to display the virtual object image and the current image in an overlay manner according to the target depth map of the current image.

In a possible embodiment, in terms of acquiring the first depth map of the current image from the current image, the acquisition unit 1101 is specifically configured to:

performing feature extraction on the current image to obtain a first 2D feature point of the current image; matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map; the local map is obtained by the server according to the current image; acquiring a 3D point corresponding to a third 2D feature point in the local map according to the corresponding relation between the second 2D feature point in the local map and the 2D feature point and the 3D point in the local map; wherein the first depth map of the current image comprises 3D points corresponding to the third 2D feature points in the local map.

In a possible embodiment, in terms of matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map, the obtaining unit 1101 is specifically configured to:

In a possible embodiment, the estimating unit 1102 is specifically configured to:

In a possible embodiment, the upsampling and merging process includes:

S3: j is j +1, and S1-S3 are repeatedly executed until j is T-1;

In one possible embodiment, the terminal device 1100 further includes:

a sending unit 1104, configured to send a depth estimation model acquisition request to a server, where the depth estimation model acquisition request carries a current image and a location of the terminal device;

a receiving unit 1105, configured to receive a response message sent by the server and corresponding to the depth estimation model obtaining request, where the response message carries a depth estimation model of the current image, and the depth estimation model of the current image is obtained by the server according to the current image and a position of the terminal device in a world coordinate system.

In one possible embodiment, the terminal device 1100 further includes:

a training unit 1106, configured to train the initial convolutional neural network model to obtain a depth estimation model;

wherein, the training unit 1106 is specifically configured to:

In one possible embodiment, the overlay display unit 1103 is specifically configured to:

The units (the acquisition unit 1101, the estimation unit 1102, the superimposition display unit 1103, the transmission unit 1104, the reception unit 1105, and the training unit 1106) are configured to execute the steps related to the method. Such as an acquisition unit 1101, an estimation unit 1102, a transmission unit 1104, and a reception unit 1105, for executing the relevant content of steps S301 and S302, and an overlay display unit 1103 for executing the relevant content of step S303.

In the present embodiment, the terminal device 1100 is presented in the form of a unit. An "element" may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the above acquisition unit 1101, estimation unit 1102, superimposition display unit 1103, and training unit 1106 may be realized by the processor 1301 of the terminal device shown in fig. 13.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 12, the server 1200 includes:

a receiving unit 1201, configured to receive a depth estimation model acquisition request sent by a terminal device, where the depth estimation model acquisition request carries a current image acquired by the terminal device and a position of the terminal device;

an obtaining unit 1202, configured to obtain a depth estimation model of a current image from a plurality of depth estimation models stored in a server according to a position of the current image;

a sending unit 1203, configured to send a response message in response to the depth estimation model obtaining request to the terminal device, where the response message carries the depth estimation model of the current image.

In a possible embodiment, the obtaining unit 1202 is specifically configured to:

In one possible embodiment, the server 1200 further includes:

a training unit 1204, configured to train a depth estimation module for obtaining each frame of the first images of the plurality of frames of the first images respectively for the plurality of frames of the first images,

In a possible embodiment, the obtaining unit 1202 is further configured to obtain an initial depth map of the current image according to the current image and a pre-stored image; obtaining a fifth depth map according to the current image and the 3D point corresponding to the local map;

the server 1200 further includes:

and an optimizing unit 1205, configured to optimize the initial depth map and the fifth depth map according to the pose of the current image to obtain a second depth map.

In a possible embodiment, in terms of obtaining an initial depth map of the current image from the current image and the pre-stored image, the obtaining unit 1202 is specifically configured to:

In a possible embodiment, in terms of obtaining the fifth depth map according to the 3D points corresponding to the current image and the local map, the obtaining unit 1202 is specifically configured to:

The units (the transmitting unit 1301, the obtaining unit 1302, the receiving unit 1303, the training unit 1304, and the optimizing unit 1305) are configured to execute the relevant steps of the method. For example, the sending unit 1301 is configured to execute the relevant content of step S501, the obtaining unit 1302, the training unit 1304, and the optimizing unit 1305 are configured to execute the relevant content of step S502, and the receiving unit 1303 is configured to execute the relevant content of step S503.

In the present embodiment, the server 1300 is presented in the form of a unit. An "element" may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the above acquisition unit 1302, training unit 1304, and optimization unit 1305 may be implemented by the processor 1401 of the server shown in fig. 14.

The terminal device 1300 shown in fig. 13 can be implemented with the structure shown in fig. 13, and the terminal device 1300 includes at least one processor 1301, at least one memory 1302, at least one communication interface 1303 and at least one display 1304. The processor 1301, the memory 1302, the display 1304 and the communication interface 1303 are connected through the communication bus to complete communication therebetween.

The processor 1301 may be a general purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to control the execution of programs according to the above schemes.

Communication interface 1303 allows communications with other devices or communication Networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc.

The Memory 1302 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 1302 is used for storing application program codes for executing the above schemes, and the processor 1301 controls the execution. The processor 1301 is configured to execute application program code stored in the memory 1302.

The memory 1302 stores code that performs one of the image processing methods provided above, such as: acquiring a current image and a virtual object image, and acquiring a first depth map and a second depth map of the current image according to the current image, wherein the second depth map is acquired from a server; performing feature extraction according to the current image, the first depth map and the second depth map of the current image, and obtaining a target depth map of the current image according to a feature extraction result; and displaying the virtual object image and the current image in an overlapping manner according to the target depth map of the current image.

The server 1400 shown in fig. 14 may be implemented in the architecture of fig. 14, the server 1400 comprising at least one processor 1401, at least one memory 1402, and at least one communication interface 1403. The processor 1401, the memory 1402, the display 1404 and the communication interface 1403 are connected through the communication bus and communicate with each other.

Processor 1401 may be a general purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs according to the above schemes.

Communication interface 1403 is used for communicating with other devices or communication Networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc.

The Memory 1402 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 1402 is used for storing application program codes for executing the above schemes, and is controlled by the processor 1401. The processor 1401 is configured to execute application program code stored in the memory 1402.

The code stored in the memory 1402 may perform one of the image processing methods provided above, such as: receiving a depth estimation model acquisition request sent by terminal equipment, wherein the depth estimation model acquisition request carries a current image acquired by the terminal equipment and the position of the terminal equipment; acquiring a depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the position of the current image; and sending a response message responding to the depth estimation model acquisition request to the terminal equipment, wherein the response message carries the depth estimation model of the current image.

The present application further provides a computer storage medium, wherein the computer storage medium may store a program, and the program includes some or all of the steps of any one of the image processing methods described in the above method embodiments when executed.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of processing an image, the method comprising:

acquiring a current image, and acquiring a first depth map and a second depth map according to the current image, wherein the second depth map is acquired from a server;

and extracting features according to the current image, the first depth map and the second depth map, and obtaining a target depth map of the current image according to the result of feature extraction.

2. The method of claim 1, further comprising:

acquiring a virtual object image;

3. The method according to claim 1 or 2, wherein the obtaining the first depth map from the current image comprises:

performing feature extraction on the current image to obtain a first 2D feature point of the current image;

matching the first 2D characteristic point of the current image with a pre-stored 2D characteristic point to obtain a second 2D characteristic point in the current image;

acquiring a 3D point corresponding to a second 2D feature point in the current image according to the second 2D feature point in the current image and the corresponding relation between the pre-stored 2D feature point and the 3D point;

wherein the first depth map comprises 3D points corresponding to a second 2D feature point in the current image.

4. The method according to claim 1 or 2, wherein the obtaining a first depth map of the current image from the current image comprises:

matching the first 2D feature point of the current image according to the 2D feature point of the local map acquired from the server to obtain a third 2D feature point in the local map;

acquiring a third 2D feature point in the local map and a corresponding relation between the 2D feature point and the 3D point in the local map, wherein the 3D point corresponds to the third 2D feature point in the local map;

wherein the first depth map of the current image comprises 3D points corresponding to a third 2D feature point in the local map.

5. The method according to claim 4, wherein the matching the first 2D feature point of the current image according to the 2D feature point of the local map to obtain a third 2D feature point in the local map comprises:

acquiring a target map from the local map according to the first pose, wherein the position of the target map in the local map is associated with the position indicated by the angle information in the first pose; the first pose is a pose obtained by the terminal equipment according to the current image and converted into a pose in a world coordinate system,

and matching the 2D feature points in the target map with the first 2D feature points of the current image to obtain third 2D feature points of the target map, wherein the third 2D feature points of the local map comprise the third 2D feature points of the target map.

6. The method according to any one of claims 1 to 5, wherein the extracting features from the current image, the first depth map and the second depth map, and obtaining the target depth map of the current image according to the result of feature extraction comprises:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; the resolution of each first feature map in the T first feature maps is different, and the resolution of each second feature map in the T second feature maps is different; t is an integer greater than 1;

superposing the first characteristic diagram and the second characteristic diagram with the same resolution in the T first characteristic diagrams and the T second characteristic diagrams to obtain T third characteristic diagrams;

carrying out upsampling and fusion processing on the T third feature maps to obtain a target depth map of the current image; wherein the third depth map is obtained by stitching the first depth map and the second depth map.

7. The method according to any one of claims 1 to 5, wherein the extracting features from the current image, the first depth map and the second depth map, and obtaining the target depth map of the current image according to the result of feature extraction comprises:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; performing multi-scale feature extraction on the reference depth map to obtain T fourth feature maps, wherein the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained according to a depth map collected by a time of flight (TOF) camera, and T is an integer greater than 1;

superposing the first feature map, the second feature map and the fourth feature map with the same resolution in the T first feature maps, the T second feature maps and the T fourth feature maps to obtain T fifth feature maps;

performing upsampling and fusion processing on the T fifth feature maps to obtain a target depth map of the current image; wherein the third depth map is obtained by stitching the first depth map and the second depth map.

8. The method according to claim 7, wherein the reference depth map is obtained from images acquired by a time of flight TOF camera, in particular comprising:

projecting the depth map acquired by the TOF camera into a three-dimensional space according to the pose of the current image to obtain a fourth depth map;

back projecting the fourth depth map onto the reference image according to the pose of the reference image to obtain the reference depth map; the reference image is an image adjacent to the current image in acquisition time;

and the resolution ratio of the depth map acquired by the TOF camera is lower than the preset resolution ratio, and the frame rate of the TOF camera acquiring the depth map is lower than the preset frame rate.

9. The method according to any of claims 6-8, wherein the upsampling and fusion process comprises:

to feature map P'_jUp-sampling to obtain characteristic diagram P'_jThe characteristic diagram P "_jAnd the j +1 th feature map P in the processing object_j+1The resolution of (2) is the same; the width of the (j + 1) th feature map is j +1 times of the width of the feature map with the minimum resolution in the processing object, and j is greater than or equal to 1 and less than or equal to T-1;

will feature map P "_jAnd the characteristic map P_j+1Fusing to obtain a feature map P'_j+1，

J is enabled to be j +1, and the steps are repeatedly executed until j is enabled to be T-1; the T is the number of the characteristic graphs in the processing object;

wherein, when j is 1, the characteristic diagram P'_jIs the feature map with the minimum resolution in the processing object, and when j is T-1, the feature map P'_j+1Is the result of the upsampling and fusion process.

10. The method according to any one of claims 6 to 9, wherein the extracting features from the current image, the first depth map and the second depth map, and obtaining the target depth map of the current image according to the result of feature extraction comprises:

inputting the current image and the third depth map into a depth estimation model of the current image for feature extraction, and obtaining a target depth map of the current image according to a feature extraction result;

wherein the depth estimation model is implemented based on a convolutional neural network.

11. The method of claim 10, further comprising:

sending a depth estimation model acquisition request to the server, wherein the depth estimation model acquisition request carries the current image and the position of the terminal equipment;

and receiving a response message which is sent by the server and responds to the depth estimation model acquisition request, wherein the response message carries the depth estimation model of the current image, and the depth estimation model of the current image is acquired by the server according to the current image and the position of the terminal equipment in a world coordinate system.

12. The method of claim 10, further comprising:

training an initial convolutional neural network model to obtain the depth estimation model;

wherein the training the initial convolutional neural network to obtain the depth estimation model comprises:

inputting a plurality of image samples and a plurality of depth map samples corresponding to the image samples into the initial convolutional neural network for processing to obtain a plurality of predicted depth maps;

calculating to obtain a loss value according to the multiple predicted depth maps, the real depth maps corresponding to the multiple image samples and a loss function;

adjusting parameters in the initial convolutional neural network according to the loss value to obtain a depth estimation model of the current image;

wherein the loss function is determined based on an error between a predicted depth map and a true depth map, an error between a gradient of the predicted depth map and a gradient of the true depth map, and an error between a normal vector of the predicted depth map and a normal vector of the true depth map.

13. The method of any one of claims 2-12, wherein said displaying a virtual object and said current image superimposed according to a target depth map of said current image comprises:

segmenting the optimized depth map to obtain a foreground depth map and a background depth map of the current image, wherein the background depth map is a depth map containing a background area in the optimized depth map, and the foreground depth map is a depth map containing a foreground area in the optimized depth map; the optimized depth map is obtained by performing edge optimization on the target depth map of the current image;

fusing the L background depth maps according to L positions corresponding to the L background depth maps respectively to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a pre-stored image and a background depth map of a current image, and the L positions and postures comprise the positions and postures of the pre-stored image and the current image; l is an integer greater than 1;

carrying out back projection on the fused three-dimensional scene according to the pose of the current image to obtain a fused background depth map;

splicing the fused background depth map and the foreground depth map of the current image to obtain an updated depth map;

and displaying the virtual object and the current image in an overlapping manner according to the updated depth map.

14. A method of processing an image, comprising:

receiving a depth estimation model request message sent by a terminal device, wherein the request message carries a current image acquired by the terminal device and the position of the terminal device;

acquiring a depth estimation model of the current image from a plurality of depth estimation models stored in a server according to the current image and the position of the terminal equipment;

and sending a response message responding to the depth estimation model request message to the terminal equipment, wherein the response message carries the depth estimation model of the current image.

15. The method of claim 14, wherein obtaining the depth estimation model for the current image from a plurality of depth estimation models stored in a server according to the current image and the location of the terminal device comprises:

acquiring multiple frames of first images according to the position of the terminal equipment, wherein the multiple frames of first images are images in a preset range taking the position of the terminal equipment as the center in a basic map;

acquiring a target image from the multiple frames of first images, wherein the target image is the image with the highest similarity between the multiple frames of first images and the current image;

and determining the depth estimation model corresponding to the target image as the depth estimation model of the current image.

16. The method of claim 15, further comprising:

respectively training the plurality of frames of first images to obtain a depth estimation module of each frame of first image in the plurality of frames of first images,

training each frame of first image in the multiple frames of first images according to the following steps to obtain a depth estimation model of each first image:

inputting a plurality of image samples and a plurality of depth map patterns corresponding to the image samples into an initial convolutional neural network for processing to obtain a plurality of predicted depth maps;

adjusting parameters in the initial convolutional neural network according to the loss value to obtain a depth estimation model of the first image of each frame;

17. The method according to any one of claims 14-16, further comprising:

acquiring an initial depth map of the current image according to the current image and a pre-stored image;

obtaining a fifth depth map according to the current image and the 3D point corresponding to the local map;

and optimizing the initial depth map and the fifth depth map according to the pose of the current image to obtain a second depth map.

18. The method of claim 17, wherein obtaining a third depth map from the 3D points corresponding to the current image and the local map comprises

Acquiring M maps from a multi-frame basic map, wherein the similarity between each map in the M maps and the current image is greater than a first preset threshold value; m is an integer greater than 0;

matching the 2D characteristic points of the M maps with the first 2D characteristic point of the current image to obtain a plurality of characteristic point matching pairs; each of the plurality of feature point matching pairs comprises a fourth 2D feature point and a fifth 2D feature point, the fourth 2D feature point and the fifth 2D feature point are mutually matched feature points, the fourth 2D feature point is a first 2D feature point of the current image, and the fifth 2D feature point is a 2D feature point in the M maps;

acquiring a 3D point corresponding to each fourth 2D feature point in the multiple feature point matching pairs according to the corresponding relation between each fifth 2D feature point and the 3D point in the M maps;

and acquiring the fifth depth map according to the 3D points corresponding to the local map and the 3D points corresponding to the fourth 2D feature points in the plurality of feature point matching pairs, wherein the fifth depth map comprises the 3D points which are matched with the 3D points corresponding to the fourth 2D feature points in the plurality of feature point matching pairs in the 3D points corresponding to the local map.

19. A terminal device, comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a current image and a virtual object image and acquiring a first depth map and a second depth map of the current image according to the current image, and the second depth map is acquired from a server;

and the estimation unit is used for extracting features according to the current image, the first depth map and the second depth map and obtaining a target depth map of the current image according to a feature extraction result.

20. The terminal device of claim 19,

the acquisition unit is also used for acquiring a virtual object image;

the terminal equipment also comprises

And the superposition display unit is used for displaying the virtual object image and the current image in a superposition manner according to the target depth map of the current image.

21. The terminal device according to claim 19 or 20, wherein, in the aspect of obtaining the first depth map of the current image from the current image, the obtaining unit is specifically configured to:

matching the first 2D characteristic point of the current image with a pre-stored 2D characteristic point to obtain a second 2D characteristic point in the current image,

acquiring a 3D point corresponding to a second 2D feature point in the current image according to the first 2D feature point in the current image and the corresponding relation between the pre-stored 2D feature point and the 3D point;

wherein the first depth map of the current image comprises 3D points corresponding to the second 2D feature points in the current image.

22. The terminal device according to claim 19, wherein, in the aspect of acquiring the first depth map of the current image from the current image, the acquiring unit is specifically configured to:

23. The terminal device of claim 22, wherein in terms of obtaining a third 2D feature point in the local map by matching the first 2D feature point of the current image with the 2D feature point of the local map, the obtaining unit is specifically configured to:

24. The terminal device according to any of claims 19-23, wherein the estimating unit is specifically configured to:

carrying out upsampling and fusion processing on the T third feature maps to obtain a target depth map of the current image; wherein the third depth map is stitched to the first depth map and the second depth map.

25. The terminal device according to any of claims 19-23, wherein the estimating unit is specifically configured to:

performing multi-scale feature extraction on the current image to obtain T first feature maps, and performing multi-scale feature extraction on the third depth map to obtain T second feature maps; performing feature extraction on the reference depth map to obtain T fourth feature maps, wherein the resolution of each first feature map in the T first feature maps is different, the resolution of each second feature map in the T second feature maps is different, and the resolution of each fourth feature map in the T fourth feature maps is different; the reference depth map is obtained according to a depth map collected by a time of flight (TOF) camera, and T is an integer greater than 1;

26. The terminal device according to claim 25, wherein the reference depth map is obtained from an image acquired by a time of flight TOF camera, and specifically comprises:

27. The terminal device according to any of claims 24-26, wherein the upsampling and merging process comprises:

28. The terminal device according to any of claims 24-27, wherein the estimating unit is specifically configured to:

29. The terminal device of claim 28, wherein the terminal device further comprises:

a sending unit, configured to send a depth estimation model acquisition request to the server, where the depth estimation model acquisition request carries the current image and the location of the terminal device;

a receiving unit, configured to receive a response message sent by the server and corresponding to the depth estimation model acquisition request, where the response message carries the depth estimation model of the current image, and the depth estimation model of the current image is acquired by the server according to the current image and a position of the terminal device in a world coordinate system.

30. The terminal device of claim 28, wherein the terminal device further comprises:

the training unit is used for training an initial convolutional neural network model to obtain the depth estimation model;

wherein the training module is specifically configured to:

inputting a plurality of image samples and a plurality of depth map samples corresponding to the image samples into an initial convolutional neural network for processing to obtain a plurality of predicted depth maps;

31. The terminal device according to any of claims 19-29, wherein the overlay display unit is specifically configured to:

segmenting the optimized depth map to obtain a foreground depth map and a background depth map of the current image, wherein the background depth map is a depth map of a background area contained in the optimized depth map of the current image, and the foreground depth map is a depth map of a foreground area contained in the optimized depth map of the current image; the optimized depth map is obtained by performing edge optimization on the target depth map of the current image;

fusing the L background depth maps according to L positions corresponding to the L depth maps respectively to obtain a fused three-dimensional scene; the L pieces of background depth maps comprise a background depth map of a pre-stored image and a background depth map of a current image, and the L positions and postures comprise the positions and postures of the pre-stored image and the current image; l is an integer greater than 1;

32. A server, comprising:

the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving a depth estimation model request message sent by a terminal device, and the request message carries a current image acquired by the terminal device and the position of the terminal device;

an obtaining unit, configured to obtain a depth estimation model of a current image from a plurality of depth estimation models stored in a server according to the current image and a location of the terminal device;

a sending unit, configured to send, to the terminal device, a response message in response to the depth estimation model request message, where the response message carries the depth estimation model of the current image.

33. The server according to claim 30, wherein the obtaining unit is specifically configured to:

34. The server of claim 33, further comprising:

a training unit, for respectively training the multiple frames of first images to obtain a depth estimation module of each frame of first image in the multiple frames of first images,

35. The server according to any one of claims 32-34,

the acquiring unit is further configured to acquire an initial depth map of the current image according to the current image and a pre-stored image, and acquire a fifth depth map according to the current image and a 3D corresponding to the local map;

the server further comprises:

36. The server according to claim 35, wherein, in terms of obtaining a third depth map from the 3D points corresponding to the current image and the local map, the obtaining unit is specifically configured to:

37. A terminal device comprising a memory, one or more processors; wherein the memory stores one or more programs; wherein the one or more processors, when executing the one or more programs, cause the terminal device to implement the method of any of claims 1-13.

38. A server comprising a memory, one or more processors; wherein the memory stores one or more programs; one or more processors, when executing the one or more programs, cause the electronic device to implement the method of any of claims 14-18.

39. A computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-18.

40. A computer program product, which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 18.