CN113344997B

CN113344997B - Method and system for rapidly acquiring high-definition foreground image only containing target object

Info

Publication number: CN113344997B
Application number: CN202110655267.4A
Authority: CN
Inventors: 陈冠宇; 王磊; 王飞
Original assignee: Fangtian Shenghua Beijing Digital Technology Co ltd
Current assignee: Fangtian Shenghua Beijing Digital Technology Co ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-07-26
Anticipated expiration: 2041-06-11
Also published as: CN113344997A

Abstract

A depth map with a very small loss rate is obtained by optimizing a depth neural network model, and a high-definition foreground map which only contains a target object is obtained by performing layer depth interception, contour edge calculation and target contour extraction on the depth map. If the foreground image only containing the target object is synthesized with the background image, a high-definition image for shielding the non-target foreground can be obtained. The problem that when a picture is taken in a scenic spot or a red card punching place or any picture taking place, non-target objects such as other passers-by, tourists and the like can be shot into the picture, and a high-definition picture only containing a specified target object cannot be obtained in a short time is solved.

Description

Method and system for rapidly acquiring high-definition foreground image only containing target object

Technical Field

The invention relates to the technical field of image real-time processing, in particular to a method and a system for quickly acquiring a high-definition foreground image only containing a target object.

Background

When a photo is taken in some places such as scenic spots or a network photo card, a target photo containing only all information of a specified object cannot be obtained due to the existence of a large number of tourists, and therefore the user cannot enjoy the beautiful scenery. The current common solution is to remove non-target objects such as passers-by and the like through drawing software at the later stage, so that the time and labor cost are greatly increased.

Disclosure of Invention

The invention provides a method and a system for rapidly acquiring a high-definition foreground image only containing a target object, wherein a depth image with extremely low loss rate is acquired by optimizing a neural network model, and layer depth interception, contour edge calculation and target contour extraction are performed on the depth image to obtain the high-definition foreground image only containing the target object, the target object not only can comprise tourists, but also can perfectly reserve information such as articles carried by the tourists and shadows, and the like, so that the subsequent processing of foreground and background combination and the like can not be distorted.

The technical scheme of the invention is as follows:

a method for rapidly acquiring a high-definition foreground image only containing a target object is characterized by comprising the following steps:

s1, acquiring a target processing picture and an auxiliary processing picture which are shot at the same time and have parallax;

s2, directly inputting the target processing picture and the auxiliary processing picture into an optimized depth neural network model, and acquiring parallax information of the target processing picture and the auxiliary processing picture to further obtain a depth map of the target processing picture;

s3, according to the maximum depth of field level and the minimum depth of field level of the target object, a foreground level depth map is intercepted from the depth map;

s4, performing contour edge calculation on the foreground layer depth map to obtain all foreground contours therein to obtain a foreground contour depth map;

and S5, performing target contour extraction on the foreground contour depth map to obtain a target foreground contour of the target object, wherein the pixel point set of the target processing picture corresponding to the pixel point set contained in the target foreground contour is the high-definition foreground map only containing the target object.

Preferably, the target object includes a target object and all objects in direct contact with the target object.

Preferably, in S3, the depth map occupied by the target object in the depth map is intercepted according to the minimum distance and the maximum distance between the target object and the camera lens, so as to obtain a foreground depth map;

preferably, in S4, the contour edge calculation includes identifying all feature point sets in the foreground depth-level map and calculating an edge of each feature point set to obtain contours of all feature point sets on the foreground depth-level map, where each feature point set is a foreground contour.

Preferably, in S5, the extracting the target contour includes extracting a foreground contour of the target object from the contours of all feature point sets of the foreground contour depth map according to an occupied area of the target object and/or the depth of field of the region where the target object is located.

Preferably, in S1, the target processing picture and the auxiliary processing picture include binocular pictures taken by binocular cameras, or binocular pictures taken by the binocular cameras after being preprocessed, or pictures taken by a plurality of cameras with parallax, or pictures obtained by preprocessing images taken by a plurality of cameras with parallax.

Preferably, the preprocessing comprises image rectification, and the image rectification comprises contour detection rectification and/or rotation angle rectification and/or corresponding similarity part connecting line rectification and/or gray level rectification and/or binarization rectification and/or histogram equalization rectification for image matching.

Preferably, the optimized deep neural network model is obtained through multiple training and testing, and the method comprises the following steps:

s2.1, splicing the data of the target processing picture and the auxiliary processing picture into different parts of the same picture to obtain a feature extraction original picture;

s2.2, performing two-dimensional convolution and pooling operation on the feature extraction original image for a plurality of times to obtain a first feature data set; the first characteristic data set does not correlate the data information of the target processing picture and the auxiliary processing picture, and is simply spliced;

s2.3, extracting a second characteristic data set under a plurality of resolution levels from high to low through residual error network operation and spatial pyramid pooling operation on the first characteristic data set; each resolution level corresponds to one of the second feature data sets;

s2.4, symmetrically fusing and normalizing the data information belonging to the target processing picture and the auxiliary processing picture in each second characteristic data set to obtain a group of third characteristic data sets;

s2.5, performing three-dimensional convolution on each third characteristic data set to obtain a group of initial depth maps;

s2.6, comparing the initial depth map with a real calibrated depth map respectively, and calculating a loss function of the initial depth map;

s2.7, extracting the features of a large number of original image groups shot at different moments, repeating the steps S2.2-S2.6, and continuously optimizing a network weight value through back propagation to obtain a loss function value L as small as possible; and obtaining an optimized deep neural network model.

Preferably, in S2.6, the loss function value L ∑ AkLk (k ═ 1,2,3,4 … …), Lk represents the loss of the initial depth map at each resolution, where L1 represents the loss of the initial depth map at the highest resolution, L2, L3 … … represent the loss of the initial depth maps at successively lower resolutions, Ak represents a loss coefficient, and is a fixed constant, and Ak > Ak + 1.

A system for rapidly acquiring a high-definition foreground image only containing a target object comprises a depth image acquisition module, a foreground layer depth image acquisition module, a foreground contour depth image acquisition module and a target foreground image acquisition module; a depth neural network model is arranged in the depth map acquisition module; the depth map acquisition module is used for processing the input target processing picture and the auxiliary processing picture through a depth neural network model to obtain a depth map, and the depth map is processed by the foreground layer depth map acquisition module, the foreground contour depth map acquisition module and the target foreground map acquisition module in sequence to obtain a high-definition target foreground map.

Preferably, the foreground layer depth map obtaining module processes the input depth map according to the maximum depth-of-field layer and the minimum depth-of-field layer occupied by the target object, and captures a depth point set between the maximum depth-of-field layer and the minimum depth-of-field layer to obtain a foreground layer depth map; the foreground contour depth map acquisition module is used for carrying out contour edge calculation on the input foreground layer depth map, calibrating and dividing all foreground contours in the foreground layer depth map to obtain a foreground contour depth map; the target foreground image acquisition module is used for extracting a target contour from the input foreground contour depth image to obtain a pixel point set of the target processing image corresponding to a pixel point set contained in the target foreground contour of the target foreground contour, namely a high-definition target foreground image;

preferably, the depth map obtaining module further comprises a neural network model training sub-module; the neural network model training submodule comprises a training set input submodule, a feature extraction submodule, a feature fusion submodule, a depth calculation submodule, a depth information comparison submodule and a loss function adjusting submodule; the training set input submodule splices the data of the target processing picture and the auxiliary processing picture into different parts of the same image to obtain a feature extraction original image; the feature extraction sub-module sends the feature extraction original image into a convolution layer and a pooling layer to carry out two-dimensional convolution and pooling operation, and a first feature data set is obtained; extracting second feature data sets under multiple resolution levels from high to low from the first feature data set through residual network operation and spatial pyramid pooling operation; each resolution level corresponds to one of the second feature data sets; the feature fusion sub-module performs feature fusion and normalization processing on data information belonging to the target processing picture in each second feature data set and data information belonging to the auxiliary processing picture in the second feature data set, and associates the data information belonging to the target processing picture and the auxiliary processing picture in each second feature data set to obtain a group of third feature data sets; the depth calculation submodule performs three-dimensional convolution on each third feature data set to obtain a group of initial depth maps; the depth information comparison submodule compares each initial depth map with a real calibrated depth map respectively to obtain a loss function; the loss function adjusting submodule continuously optimizes a network weight value through back propagation according to a loss function finally calculated by a large number of original image groups to obtain a loss function value L as small as possible; and obtaining a deep neural network model.

Compared with the prior art, the invention has the advantages that: according to the method for rapidly acquiring the high-definition foreground image only containing the target object, the depth image with extremely low loss rate is acquired by optimizing the depth neural network model, and the high-definition foreground image only containing the target object is acquired by performing layer depth interception, contour edge calculation and target contour extraction on the depth image, so that the target object can not only comprise a tourist, but also perfectly reserve information such as articles and shadows carried by the tourist, and the like, and the condition that subsequent processing such as foreground and background combination cannot be distorted is ensured. If the foreground image only containing the target object is synthesized with the background image, a high-definition image for shielding the non-target foreground can be obtained. The problem that when a picture is taken in a scenic spot or a red card punching place or any picture taking place, non-target objects such as other passers-by, tourists and the like can be shot into the picture, and a high-definition picture only containing a specified target object cannot be obtained in a short time is solved.

Drawings

FIG. 1 is a flowchart of a method for rapidly acquiring a high-definition foreground image containing only a target object according to the present invention;

FIG. 2 is a flowchart of the operation of the optimized deep neural network model of the method for rapidly obtaining a high-definition foreground map containing only a target object according to the present invention;

FIG. 3 is a schematic diagram illustrating an exemplary feature extraction and feature fusion process of the method for rapidly obtaining a high-definition foreground image containing only a target object according to the present invention;

FIG. 4 is a schematic diagram illustrating an example of a depth calculation process of the method for rapidly obtaining a high-definition foreground image containing only a target object according to the present invention;

fig. 5 is a block diagram of a system for rapidly acquiring a high-definition foreground image containing only a target object according to the present invention.

Detailed Description

To facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying figures 1-4 and the specific examples.

Example 1

A method for rapidly acquiring a high-definition foreground image only containing a target object is disclosed, and a flow chart is shown in FIG. 1, and comprises the following steps:

s1, acquiring a target processing picture and an auxiliary processing picture which are shot at the same time and have parallax, wherein the target processing picture and the auxiliary processing picture are shot in real time through a binocular camera; the two pictures which are shot at the same time and have parallax can be taken as a target processing picture and an auxiliary processing picture after being shot by a plurality of cameras with parallax, and the pictures shot by a binocular camera or a plurality of cameras with parallax can be obtained after being preprocessed, wherein the pictures necessarily contain a target object. The preprocessing comprises image rectification, and the image rectification comprises contour detection rectification and/or rotation angle rectification and/or corresponding similarity position connecting line rectification and/or gray level rectification and/or binarization rectification and/or histogram equalization rectification of image matching.

The target object includes a target object and all objects in direct contact with the target object. The tourist can take a picture in the scenic spot, and can also be the tourist and all objects in contact with the tourist, such as accessories, shadows and the like.

S2, inputting the target processing picture and the auxiliary processing picture into an optimized depth neural network model, and acquiring parallax information of the target processing picture and the auxiliary processing picture to further obtain a depth map of the target processing picture; each pixel value of the depth map represents a distance of a point in the scene from the camera.

S3, determining the maximum depth of field and the minimum depth of field occupied by the target object according to the minimum distance and the maximum distance between the target object and the camera lens, further obtaining a maximum depth of field level and a minimum depth of field level, and intercepting and reserving a pixel point set between the maximum depth of field level and the minimum depth of field level from the depth map to obtain a foreground level depth map; at this time, the target object may include the specified guest nail and a shadow of the specified guest nail, even a friend b who is in direct contact with the guest nail, or the like. The maximum depth of field level refers to a distance level where a pixel point which is farthest from the camera and belongs to the target object is located, and the minimum depth of field level refers to a distance level where a pixel point which is closest to the camera and belongs to the target object is located. The foreground layer depth map includes all sets of depth points between a maximum depth of field layer and a minimum depth of field layer.

S4, performing contour edge calculation on the foreground layer depth map to obtain all foreground contours in the foreground layer depth map to obtain a foreground contour depth map; and the contour edge calculation comprises the steps of identifying all feature point sets in the foreground layer depth map and calculating the edge of each feature point set to obtain the contour of all the feature point sets on the foreground depth map. The feature point set is a set of pixel points left in the foreground depth map by all tourists and/or passersby and/or non-background objects staying in a short time within the shooting range.

S5, extracting a target foreground contour of the target object from the foreground contour depth map according to the area occupied by the target object and/or the depth of field of the area where the target object is located, wherein the pixel point set of the target processing picture corresponding to the pixel point set contained in the target foreground contour is a high-definition foreground map only containing the target object, and the high-definition target foreground map completely shields the foreground of the non-target object; the target foreground contour only contains a target object contour; when the guest nail has no other person in contact with it, the target object may be the outline of the guest nail and accessories (such as a satchel, a cell phone, etc.), shadows, etc. in contact with it. The target object can also comprise a friend B, and of course, if the friend B contacts the appointed tourist A, the friend A and the friend B are both used as the target objects at the same time, and only one time of target contour extraction is needed, wherein the target contour extraction comprises the step of extracting the contour of the appointed target object from the contours of all feature point sets of the foreground contour depth map according to the occupied area of the appointed target object and/or the depth of field of a central area where the appointed target object is located. In the extraction process, only target foreground is reserved, such as appointed tourists and shadows, accessories and the like contacted with the tourists. If the friend B does not contact the appointed tourist A, the high-definition target foreground image only containing the target object can be obtained by carrying out target outline extraction twice or changing a target outline extraction method, and the problem that when a picture is taken in a scenic spot, a red-line card-punching place or any picture-taking place, non-target objects such as other passersby, tourists and the like can be taken into the image, and the high-definition image only containing the appointed target object cannot be obtained in a short time is solved. The target foreground contour obtained through contour edge calculation and target contour extraction not only comprises the contour of the designated target object, but also comprises the contours of all foreground objects in contact with the designated target object, such as the shadow of a certain position or a certain target photographer and/or various articles worn on the body, particularly the shadow of the target photographer, and the high-definition picture for shielding non-target foreground can not be distorted after the foreground picture only containing the target foreground and the background picture are synthesized to the greatest extent.

Preferably, the optimized deep neural network model in step S2 is obtained through multiple training and testing, and a flowchart thereof is shown in fig. 2, and includes the following steps:

s2.1, splicing the data of the target processing picture and the auxiliary processing picture into different parts of the same picture to obtain an original feature extraction picture;

s2.2, as shown in the figure 3, sending the feature extraction original image into a convolution layer and a pooling layer to carry out two-dimensional convolution and pooling operation to obtain a first feature data set; the first feature data set does not correlate the data information of the target processing picture and the auxiliary processing picture, and is simply spliced;

s2.3, performing residual error network operation on the first characteristic data set through a residual error network layer, performing spatial pyramid pooling operation through a spatial pyramid pooling layer, and extracting a group of second characteristic data sets with 4 resolutions from high to low; there is one second feature data set at each resolution.

S2.4, symmetrically fusing and normalizing the data information which belongs to the target processing picture in each second characteristic data set with the data information which belongs to the target processing picture and the auxiliary processing picture in other second characteristic data sets to obtain a group of third characteristic data sets;

s2.5, as shown in FIG. 4, performing three-dimensional convolution on each third feature data set, namely performing depth calculation to obtain a group of initial depth maps;

s2.6, comparing the initial depth map with a real calibrated depth map respectively, and calculating a loss function of the initial depth map; loss function L ═ A ₁ L ₁ +A ₂ L ₂ +A ₃ L ₃ +A ₄ L ₄ ，L ₁ Representing the loss of the initial depth map at the highest resolution, L ₂ 、L ₃ … … represents the loss of the original depth map with successively decreasing resolution, A _k Represents a loss coefficient, is a fixed constant, and A ₁ >A ₂ >A ₃ >A ₄ (ii) a The real calibrated depth map can be a depth information map calculated through the lens of the camera and the related information of the position of the camera; the depth information input after the real place is calibrated in advance can be artificially input.

S2.7, extracting a large number of characteristics of the original image group shot at different moments, repeating the steps S2.2-S2.6, and continuously optimizing a network weight value through back propagation according to each obtained loss function, wherein the optimization of the network weight value comprises the adjustment of a two-dimensional convolution in S2.2 and a three-dimensional convolution parameter in S2.5, and the optimization of a loss function L value is realized to minimize the value; and obtaining an optimized deep neural network model.

Example 2

A modularized block diagram of a system for rapidly acquiring a high-definition foreground image only containing a target object is shown in figure 5, and the system comprises a depth image acquisition module, a foreground layer depth image acquisition module, a foreground contour depth image acquisition module and a target foreground image acquisition module; a depth neural network model is arranged in the depth map acquisition module; the depth map acquisition module processes the input target processing picture and the auxiliary processing picture through a depth neural network model to acquire parallax information of the input target processing picture and the auxiliary processing picture, and then obtains a depth map of the target processing picture.

And the depth map is processed by a foreground layer depth map acquisition module, a foreground contour depth map acquisition module and a target foreground map acquisition module in sequence to obtain a high-definition target foreground map.

Preferably, the foreground level depth map acquiring module is configured to intercept a maximum depth level and a minimum depth level occupied by a target object according to a minimum distance and a maximum distance between the target object and a camera, and intercept a foreground level depth map from the depth map; the foreground level depth map includes all sets of depth points between a maximum depth of field level and a minimum depth of field level. The maximum depth level refers to a distance level where a pixel point which is farthest from the camera and belongs to the target object is located, and the minimum depth level refers to a distance level where a pixel point which is closest to the camera and belongs to the target object is located.

And the foreground contour depth map acquisition module is used for carrying out contour edge calculation on the input foreground layer depth map, calibrating and dividing all foreground contours in the foreground layer depth map to obtain a foreground contour depth map. The target foreground image acquisition module is used for extracting a target contour from the input foreground contour depth image to obtain a pixel point set of the target processing image corresponding to a pixel point set contained in the target foreground contour of the target foreground contour, namely a high-definition target foreground image; and the high-definition target foreground image completely shields the foreground of the non-target object.

Preferably, the depth map obtaining module further comprises a neural network model training sub-module; the neural network model training submodule comprises a training set input submodule, a feature extraction submodule, a feature fusion submodule, a depth calculation submodule, a depth information comparison submodule and a loss function adjusting submodule; the training set input sub-module combines the data of the target processing picture and the auxiliary processing picture into different parts of the same image to obtain a feature extraction original image; the feature extraction submodule sends the feature extraction original image into a convolution layer and a pooling layer to carry out two-dimensional convolution and pooling operation to obtain a first feature data set; extracting a second characteristic data set under a plurality of resolution levels from high to low through residual error network operation and spatial pyramid pooling operation on the first characteristic data set; each resolution level corresponds to one of the second feature data sets; the feature fusion submodule performs feature fusion and normalization processing on data information which belongs to the target processing picture and is in each second feature data set and data information which belongs to the auxiliary processing picture, and associates the data information which belongs to the target processing picture and the auxiliary processing picture in each second feature data set to obtain a group of third feature data sets; the depth calculation submodule performs three-dimensional convolution on each third characteristic data set to obtain a group of initial depth maps; the depth information comparison submodule compares each initial depth map with a real calibrated depth map respectively to obtain a loss function; the loss function adjusting submodule continuously optimizes a network weight value through back propagation according to a loss function finally calculated by a large number of original image groups to obtain a loss function value L as small as possible; and obtaining a deep neural network model.

It should be noted that the above-described embodiments may enable those skilled in the art to more fully understand the present invention, but do not limit the present invention in any way. Therefore, although the present invention has been described in detail with reference to the drawings and examples, it should be understood by those skilled in the art that the present invention may be modified and replaced by other equivalent elements, and it should be understood that all the technical solutions and modifications which do not depart from the spirit and scope of the present invention are covered by the protection scope of the present patent.

Claims

1. A method for rapidly acquiring a high-definition foreground image only containing a target object is characterized by comprising the following steps:

s2, directly inputting the target processing picture and the auxiliary processing picture into an optimized depth neural network model, and acquiring parallax information of the target processing picture and the auxiliary processing picture so as to obtain a depth map of the target processing picture;

the optimized deep neural network model is obtained through multiple times of training and testing, and comprises the following steps:

s2.3, extracting second feature data sets under a plurality of resolution levels from high to low through residual error network operation and spatial pyramid pooling operation on the first feature data sets; each resolution level corresponds to one of the second feature data sets;

s2.7, extracting a large number of different features from the original image, repeating the steps S2.2-S2.6, and continuously optimizing a network weight value through back propagation to obtain a loss function value L meeting the requirement; obtaining an optimized deep neural network model;

s3, according to the maximum depth of field level and the minimum depth of field level of the target object, intercepting a foreground level depth map from the depth map;

s5, performing target contour extraction on the foreground contour depth map to obtain a target foreground contour of a target object, wherein a pixel point set of the target processing picture corresponding to a pixel point set contained in the target foreground contour is a high-definition foreground map only containing the target object;

the target object includes a target object and all objects in direct contact with the target object.

2. The method according to claim 1, wherein in S3, the depth map of the target object is obtained by intercepting the depth of the target object in the depth map according to the minimum distance and the maximum distance between the target object and the camera lens.

3. The method according to claim 1, wherein in S4, the contour edge calculation includes identifying all feature point sets in the foreground layer depth map and calculating the edge of each feature point set to obtain the contours of all feature point sets on the foreground depth map, and each feature point set is a foreground contour; and/or in S5, the extracting the target contour includes extracting a foreground contour of the target object from the contours of all feature point sets of the foreground contour depth map according to an occupied area of the target object and/or the depth of field of the region where the target object is located.

4. The method for rapidly acquiring a high-definition foreground image containing only a target object as claimed in claim 1 wherein, in S1, the target processing picture and the auxiliary processing picture include binocular pictures taken by binocular cameras, or the preprocessed binocular pictures taken by the binocular cameras, or the pictures taken by a plurality of cameras with parallax, or the preprocessed pictures taken by the plurality of cameras with parallax.

5. The method according to claim 1, wherein in S2.6, the loss function L ═ Σ a is used to quickly obtain the high-definition foreground image containing only the target object _k L _k (k＝1,2,3,4……)，L _k Represents each pointLoss of initial depth map at resolution, where L ₁ Representing the loss of the initial depth map at the highest resolution, L ₂ 、L ₃ … … represents the loss of the original depth map with successively decreasing resolution, A _k Represents a loss coefficient, is a fixed constant, and A _k >A _k+1 。

6. A system for rapidly acquiring a high-definition foreground image only containing a target object is characterized by comprising a depth image acquisition module, a foreground layer depth image acquisition module, a foreground contour depth image acquisition module and a target foreground image acquisition module; a depth neural network model is arranged in the depth map acquisition module; the depth map acquisition module is used for processing an input target processing picture and an input auxiliary processing picture through a depth neural network model to obtain a depth map, and the depth map is processed through a foreground layer depth map acquisition module, a foreground contour depth map acquisition module and a target foreground map acquisition module in sequence to obtain a high-definition target foreground map;

the foreground layer depth map acquisition module is used for processing the input depth map according to the maximum depth-of-field layer and the minimum depth-of-field layer occupied by the target object, and intercepting a depth point set between the maximum depth-of-field layer and the minimum depth-of-field layer to obtain a foreground layer depth map; the foreground contour depth map acquisition module is used for carrying out contour edge calculation on the input foreground layer depth map, calibrating and dividing all foreground contours in the foreground layer depth map to obtain a foreground contour depth map; the target foreground image acquisition module is used for extracting a target contour from the input foreground contour depth image to obtain a pixel point set of the target processing image corresponding to a pixel point set contained in the target foreground contour of the target foreground contour, namely a high-definition target foreground image; the target object comprises a target object and all objects in direct contact with the target object; the depth map acquisition module also comprises a neural network model training submodule; the neural network model training submodule comprises a training set input submodule, a feature extraction submodule, a feature fusion submodule, a depth calculation submodule, a depth information comparison submodule and a loss function adjusting submodule; the training set input sub-module combines the data of the target processing picture and the auxiliary processing picture into different parts of the same image to obtain a feature extraction original image; the feature extraction sub-module sends the feature extraction original image into a convolution layer and a pooling layer to carry out two-dimensional convolution and pooling operation, and a first feature data set is obtained; extracting a second characteristic data set under a plurality of resolution levels from high to low through residual error network operation and spatial pyramid pooling operation on the first characteristic data set; each resolution level corresponds to one of the second feature data sets; the feature fusion sub-module performs feature fusion and normalization processing on data information belonging to the target processing picture in each second feature data set and data information belonging to the auxiliary processing picture in the second feature data set, and associates the data information belonging to the target processing picture and the auxiliary processing picture in each second feature data set to obtain a group of third feature data sets; the depth calculation submodule performs three-dimensional convolution on each third feature data set to obtain a group of initial depth maps; the depth information comparison submodule compares each initial depth map with a real calibrated depth map respectively to obtain a loss function; the loss function adjusting submodule continuously optimizes a network weight value through back propagation according to a loss function finally calculated by a large number of original image groups to obtain a loss function value L meeting requirements; and obtaining an optimized deep neural network model.