WO2021169404A1 - 深度图像生成方法、装置及存储介质 - Google Patents

深度图像生成方法、装置及存储介质 Download PDF

Info

Publication number
WO2021169404A1
WO2021169404A1 PCT/CN2020/127891 CN2020127891W WO2021169404A1 WO 2021169404 A1 WO2021169404 A1 WO 2021169404A1 CN 2020127891 W CN2020127891 W CN 2020127891W WO 2021169404 A1 WO2021169404 A1 WO 2021169404A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
depth
image
feature map
scale
Prior art date
Application number
PCT/CN2020/127891
Other languages
English (en)
French (fr)
Inventor
张润泽
易鸿伟
陈颖
徐尚
戴宇荣
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021169404A1 publication Critical patent/WO2021169404A1/zh
Priority to US17/714,654 priority Critical patent/US20220230338A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular, to a method, device, and storage medium for generating a depth image.
  • Three-dimensional models can be applied to a variety of scenarios, such as building three-dimensional model construction scenes, human body three-dimensional model construction scenes, and so on.
  • a depth image of the object needs to be generated first, so how to generate a depth image becomes an urgent problem to be solved.
  • the embodiments of the present application provide a depth image generation method, device, and storage medium, which can improve the accuracy of the depth image.
  • the technical solution is as follows:
  • a depth image generation method includes:
  • the plurality of target images are respectively obtained by shooting the target object according to different perspectives;
  • Multi-level convolution processing is performed on the multiple target images through multiple convolutional layers in the convolution model to obtain feature map sets respectively output by the multiple convolutional layers, and each feature map set includes the multiple A feature map corresponding to the target image;
  • the multiple aggregated features obtained are fused to obtain a depth image.
  • a depth image generation device in another aspect, the device includes:
  • An image acquisition module configured to acquire multiple target images, the multiple target images are respectively obtained by shooting the target object according to different perspectives;
  • the convolution processing module is used to perform multi-level convolution processing on the multiple target images through multiple convolution layers in the convolution model to obtain a set of feature maps respectively output by the multiple convolution layers, each The feature map set includes feature maps corresponding to the multiple target images;
  • the perspective aggregation module is configured to separately aggregate the multiple feature maps in each feature map set to obtain the aggregated features corresponding to each feature map set;
  • the feature fusion module is used to perform fusion processing on the obtained multiple aggregated features to obtain a depth image.
  • a computer device in another aspect, includes a processor and a memory, and at least one piece of program code is stored in the memory, and the at least one piece of program code is loaded and executed by the processor to implement the following The depth image generation method described in the above aspect.
  • a computer-readable storage medium is provided, and at least one piece of program code is stored in the computer-readable storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the above-mentioned aspect Depth image generation method.
  • the method, device, and storage medium provided by the embodiments of the present application acquire multiple target images.
  • the multiple target images are obtained by shooting the target object according to different perspectives.
  • the multiple convolutional layers in the convolution model are used to obtain multiple target images.
  • the target image undergoes multi-level convolution processing to obtain feature map sets output by multiple convolutional layers, and the multiple feature maps in each feature map set are aggregated from the perspective to obtain the aggregated features corresponding to each feature map set.
  • the multiple aggregated features obtained are fused to obtain a depth image.
  • the obtained multiple target images are obtained by shooting the target object according to different perspectives, so that the multiple target images include information of different angles of the target object, which enriches the amount of information of the obtained target image, and passes through multiple convolutions.
  • the multi-level convolution processing of the layers obtains multiple different feature map sets, which enriches the information content of the feature maps.
  • the feature maps output by multiple convolutional layers are fused to enrich the information contained in the obtained depth images. , Thereby improving the accuracy of the depth image obtained.
  • the multiple feature maps in each feature map set are aggregated for perspective, so that the feature maps belonging to the same perspective can be subsequently merged, and the resulting aggregation is improved.
  • the accuracy of the features improves the accuracy of the depth image obtained.
  • the probability map corresponding to each aggregated feature is fused, so that multiple When the aggregated features are fused, the influence of the probability on the position of each pixel is taken into consideration, and the accuracy of the obtained fourth aggregated feature is improved, thereby improving the accuracy of the obtained depth image.
  • FIG. 1 is a schematic structural diagram of an implementation environment provided by an embodiment of the present application.
  • Fig. 2 is a flowchart of a depth image generation method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a method for generating a depth image provided by an embodiment of the present application
  • FIG. 4 is a flowchart of fusion processing for a second feature volume provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of fusion processing for a second feature volume provided by an embodiment of the present application.
  • Fig. 6 is a flowchart of a depth image generation method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a depth image generation model provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a depth image generation method provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a depth image fusion provided by an embodiment of the present application.
  • FIG. 10 is a flow chart of generating a three-dimensional model provided by an embodiment of the present application.
  • FIG. 11 is a flow chart of generating a three-dimensional model provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a depth image generating apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a depth image generation device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • first, second, etc. used in this application can be used herein to describe various concepts, but unless specifically stated otherwise, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another.
  • the first feature map can be referred to as the second feature map, and similarly, the second feature map can be referred to as the first feature map.
  • the terms “plurality”, “each”, “any”, multiple includes two or more than two, and each refers to each of the corresponding multiple, and any one refers to multiple Any one of them.
  • multiple elements include 3 elements, and each refers to each of these 3 elements, any one refers to any one of these 3 elements, which can be the first or the second One or the third one.
  • the depth image generation method provided in the embodiment of the present application can be used in a computer device.
  • the computer device is a terminal
  • the terminal is a mobile phone, a computer, a tablet computer, and other types of terminals.
  • the terminal shoots the target object through the camera, obtains multiple target images, and performs multi-level convolution processing on the multiple target images through multiple convolutional layers in the convolution model, and obtains the feature map set output by the multiple convolutional layers. , Perform perspective aggregation of multiple feature maps in each feature map set to obtain the aggregated feature corresponding to each feature map set, and perform fusion processing on the obtained multiple aggregated features to obtain a depth image.
  • the computer device includes a server and a terminal.
  • FIG. 1 is a schematic structural diagram of an implementation environment provided by an embodiment of the present application. As shown in FIG. 1, the implementation environment includes a terminal 101 and a server 102.
  • the terminal 101 establishes a communication connection with the server 102, and interacts through the established communication connection.
  • the terminal 101 is various types of terminals 101 such as mobile phones, computers, and tablet computers.
  • the server 102 is a server, or a server cluster composed of several servers, or a cloud computing server center.
  • the terminal 101 shoots the target object through the camera, acquires multiple target images, and sends the acquired multiple target images to the server 102.
  • the server 102 multiplies the multiple target images through multiple convolutional layers in the convolution model.
  • One-level convolution processing to obtain the feature map sets respectively output by the multiple convolution layers, and the multiple feature maps in each feature map set are aggregated from the perspective to obtain the aggregated features corresponding to each feature map set.
  • the multiple aggregated features are fused to obtain a depth image.
  • the server 102 can also send the depth image to the terminal 101.
  • the method provided in the embodiment of the present application can be used in various scenarios for constructing a three-dimensional model.
  • the terminal When the user photographs a building through the camera of the terminal, the terminal adopts the depth image generation method provided in the embodiment of this application to photograph the building according to different perspectives, and process the acquired multiple target images to obtain a depth image. Processing is performed to obtain a three-dimensional model of the building, so that subsequent surveying and mapping of the three-dimensional model of the building can be performed.
  • the user photographs the interior of the house through the camera of the terminal.
  • the terminal uses the depth image generation method provided in the embodiments of this application to photograph the interior of the house according to different perspectives, and processes the acquired multiple target images to obtain a depth image. After processing, a three-dimensional model of the house is obtained, and the user can simulate the home layout in the three-dimensional model of the house to realize the dynamic display of the home design, so that the user can intuitively view the state of the display of the home design.
  • FIG. 2 is a flowchart of a depth image generation method provided by an embodiment of the present application, which is applied to a computer device. As shown in FIG. 2, the method includes:
  • the computer device acquires multiple target images.
  • the multiple target images are respectively obtained by shooting the target objects according to different perspectives.
  • the same target object is photographed from different perspectives, and the display states of the target object in the obtained multiple target images are different. Therefore, the multiple target images can be processed according to the difference between the multiple target images. Processing can generate a depth image of the target object, so that a three-dimensional model of the target object can be obtained through the depth image later.
  • the angle of view is jointly determined by the shooting parameters of the camera and the relative position between the camera and the target object.
  • the shooting parameters may include focal length, pixels, and so on. For example, keeping the relative position between the camera and the object unchanged, shooting the object with different shooting parameters, can get images of different perspectives; or keeping the shooting parameters unchanged, shooting the object with different relative positions, also Images with different perspectives can be obtained; or, by shooting objects with different relative positions and different shooting parameters, images with different perspectives can also be obtained.
  • the display state of the target object is different, and the display state may include the size of the display target object, the display position of the target object in the image, and the angle at which the target object is displayed.
  • the display state may include the size of the display target object, the display position of the target object in the image, and the angle at which the target object is displayed.
  • three images are obtained by shooting a target object according to different perspectives, the target object is displayed in the upper left corner of the first image, and the left side image of the target object is displayed; the middle area in the second image is displayed the target object , And the front image of the target object is displayed; the lower right corner area in the third image displays the target object, and the right side image of the target object is displayed; and the size of the target object displayed in the three images decreases sequentially .
  • different target images correspond to different viewing angles. Therefore, the same target object is included in different target images, and the display state of the target object is different.
  • the multiple target images may be directly obtained by photographing the target object according to different perspectives, or may be obtained after processing the photographed images.
  • the target object is photographed according to multiple different perspectives to obtain multiple target images; or the target object is photographed according to multiple different perspectives to obtain multiple original images.
  • the scales of multiple original images are adjusted to obtain multiple target images after the adjustment of the multiple original images, and the scales of the multiple target images are the same.
  • the scale adjustment of multiple original images may be: reducing multiple original images to obtain multiple target images of smaller scale; or zooming in multiple original images to obtain multiple target images of larger scale. Since the scales of the multiple original images obtained by shooting are the same, after the multiple original images are scaled, the scales of the multiple target images obtained are also equal.
  • the computer device performs multi-level convolution processing on multiple target images through multiple convolution layers in the convolution model to obtain feature map sets respectively output by the multiple convolution layers.
  • the convolution model is used to obtain the feature map of the image.
  • the convolution model is a two-dimensional convolution network model, which can be VGG (Visual Geometry Group Network, visual geometry group network model), Restnet (a convolution network model) Wait.
  • the convolution model includes multiple convolution layers, and each convolution layer is used to perform convolution processing on the input image and output a feature map of the image.
  • the feature map set includes feature maps corresponding to multiple target images, and the feature maps are used to represent the features included in the corresponding target images, such as color features, texture features, shape features, or spatial features.
  • each convolutional layer can output a set of feature maps, and the feature maps included in each feature map set and the multiple target images There is a one-to-one correspondence, and the number of feature maps included in each feature map set is equal to the number of the multiple target images. Since multiple convolutional layers all perform convolution processing on the multiple target images, multiple feature map sets can be obtained, and the number of the multiple feature map sets is equal to the number of the multiple convolutional layers.
  • this step 202 may include:
  • the first convolution layer in the convolution model convolution processing is performed on multiple target images, and the feature map set output by the first convolution layer is obtained.
  • the next convolution layer in the convolution model the upper Each feature map in the feature map set output by one convolution layer is subjected to convolution processing to obtain the feature map set output by the next convolution layer until the feature map set output by multiple convolution layers is obtained.
  • the convolutional model includes 4 convolutional layers, the multiple target images are input to the first convolutional layer in the convolutional model, and the first convolutional layer is used to convolve multiple target images Through the product processing, the first feature map set output by the first convolutional layer is obtained.
  • the first feature map set includes the first feature maps corresponding to multiple target images; the first feature map set is input to the second convolutional layer, Through the second convolutional layer, each first feature map in the first feature map set is subjected to convolution processing to obtain the second feature map set output by the second convolutional layer.
  • the second feature map set includes multiple A second feature map corresponding to the target image; the second feature map set is input to the third convolutional layer, and each second feature map in the second feature map set is convolved through the third convolutional layer Through the product processing, the third feature map set output by the third convolutional layer is obtained.
  • the third feature map set includes the third feature maps corresponding to multiple target images; the third feature map set is input to the fourth convolutional layer, Through the fourth convolutional layer, each third feature map in the third feature map set is subjected to convolution processing to obtain the fourth feature map set output by the fourth convolutional layer.
  • the fourth feature map set includes multiple A fourth feature map corresponding to the target image, so as to obtain a set of feature maps respectively output by the 4 convolutional layers.
  • each feature map can be expressed as Among them, i represents the serial number of the target image, i is an integer greater than 0 and not greater than N; N represents the number of multiple target images, N is an integer greater than 1, and l is any convolution of multiple convolutional layers Layer, the l is an integer greater than 0 and not greater than L; L represents the number of multiple convolutional layers, and L is an integer greater than 1.
  • the computer device separately aggregates multiple feature maps in each feature map set to obtain aggregated features corresponding to each feature map set.
  • each feature map set includes multiple feature maps, and the multiple feature maps correspond to the multiple target images in a one-to-one correspondence. Because different target images correspond to different perspectives, the perspective aggregation of multiple feature maps is used to convert multiple feature maps into the same perspective, and then multiple feature maps with the same perspective are aggregated to obtain aggregation Features can eliminate the viewing angle difference between different target images. Among them, when acquiring the aggregated features corresponding to each feature map set, a Self-adaptive View Aggregation (adaptive view aggregation) method is adopted to convert multiple feature maps into feature maps under the same perspective and then merge them.
  • a Self-adaptive View Aggregation adaptive view aggregation
  • this step 203 may include the following steps 2031-2034:
  • the reference image may be any one of multiple target images.
  • the reference image may be any one of multiple target images.
  • the first image may include one or more.
  • the number of target images is 2, the number of first images is one; when the number of target images is 5, the number of first images is 4.
  • the reference feature map corresponding to the reference image and the first feature map corresponding to the first image can be determined from the multiple feature maps.
  • the viewing angle of the image corresponding to the second feature map is the same as the viewing angle of the reference image.
  • the multiple target images correspond to different viewing angles
  • the corresponding angle of view is the same as the angle of view of the reference image, so as to eliminate the difference in the image shooting angle of view.
  • the feature map set includes multiple first feature maps, and for any first feature map in the multiple first feature maps, the first image corresponding to the first feature map is For the difference between the shooting angle of view of the reference image and the angle of view conversion of the first feature map, the converted second feature map is obtained.
  • a similar manner can be used to perform viewing angle conversion, so that second feature maps corresponding to multiple first feature maps can be obtained.
  • this step 2033 may include the following steps 1-4:
  • Step 1 Acquire a first shooting parameter corresponding to the first image and a reference shooting parameter corresponding to the reference image.
  • the shooting parameters may include focal length, pixels, and so on. Since different target images correspond to different viewing angles, the viewing angle is determined by the shooting parameters of the camera and the relative position between the camera and the target object. Therefore, the shooting parameters corresponding to the first image and the reference image are obtained to pass the shooting parameters later. Perform viewing angle conversion on feature images.
  • the shooting parameters may be obtained when shooting the target object.
  • a user shoots a target object through a mobile phone, and the sensor of the mobile phone records the shooting parameters of the shooting target object, and then multiple target images and shooting parameters corresponding to each target image are obtained.
  • Step 2 Determine multiple depth values corresponding to the convolutional layer of the output feature map set.
  • the depth value is used to indicate the distance between the camera and the target object when shooting the target object.
  • the multiple depth values may be 0.1 meters, 0.2 meters, 0.3 meters, 0.4 meters, and so on.
  • the multiple depth values corresponding to the convolutional layer can be preset or determined according to the depth range and the preset number of depth values.
  • different convolutional layers The corresponding multiple depth values are different.
  • the first convolutional layer corresponds to multiple depth values of 0.1 meters, 0.2 meters, and 0.3 meters
  • the second convolutional layer corresponds to multiple depth values of 0.1 meters, 0.3 meters. , 0.5 meters.
  • the number of depth layers can be preset by the developer, and the number of depth layers can be any number, such as 100, 80, and so on.
  • the preset depth range is used to indicate the range to which the distance between the target object and the camera belongs when multiple target images are captured, and may be preset or predicted based on multiple target images.
  • the preset depth range is (0, 1) meters, or (1, 2) meters, etc.
  • the preset depth range is divided by the number of depth layers and the preset depth range, and multiple depth values are extracted from the preset depth range.
  • the difference between any two adjacent depth values is equal, and the number of the multiple depth values is equal to the value corresponding to the number of depth layers.
  • the arrangement order L of the convolutional layers outputting the feature map set is determined, then the arrangement order L and the number of depth layers DL satisfy The following relationships:
  • D L represents the number of depth layers of the convolutional layer in the arrangement order L among the plurality of convolutional layers.
  • the maximum depth value and the minimum depth value in the preset depth range optionally, determine the maximum depth value and the minimum depth value in the preset depth range, use the difference between the maximum depth value and the minimum depth value as the depth span, and set the number of depth layers The value after subtracting 1 is used as the first value, and the ratio between the depth span and the first value is used as the depth interval.
  • the preset depth range starting from the minimum depth value, a depth value is determined every interval of depth , Then multiple depth values equal to the number of depth layers are obtained.
  • the preset depth range is [1, 9] meters
  • the number of depth layers is 5, the maximum depth value is 9 meters, and the minimum depth value is 1 meter
  • the depth span is 8, the first value is 4, and the depth span is passed
  • the first value determines the depth interval to be 2, starting from the minimum depth value 1, every time a depth interval of 2, to determine a depth value, then 1, 3, 5, 7, 9 in the preset depth range are determined Is the depth value.
  • Step 3 According to the difference between the first shooting parameter and the second shooting parameter and the multiple depth values, determine multiple viewing angle conversion matrices corresponding to the multiple depth values.
  • the viewing angle conversion matrix is used to perform viewing angle transformation of the image, and images shot from different angles can be converted into the same viewing angle.
  • the viewing angle conversion matrix may be HomographyMatrix (homography matrix), or other matrices, etc. Since the viewing angle conversion matrix is determined by the shooting parameters and depth values of the two images, a plurality of viewing angle conversion matrices can be determined according to the first shooting parameter, the second shooting parameter, and multiple depth values. In each view conversion matrix corresponds to a depth value.
  • Step 4 According to the multiple viewing angle conversion matrices, the first feature maps are respectively converted into perspectives to obtain multiple second feature maps after conversion.
  • the viewing angle corresponding to the second feature map is the same as the viewing angle of the reference image.
  • each viewing angle conversion matrix is used to perform viewing angle conversion, and then multiple second feature maps after conversion can be obtained.
  • the feature map set includes multiple first feature maps, and for each first feature map, multiple perspective conversion matrices corresponding to each first feature map are determined, and according to each first feature map The multiple viewing angle conversion matrices corresponding to the graphs respectively perform viewing angle conversion on each first feature graph to obtain multiple second feature graphs after each first feature graph is replaced.
  • the viewing angle is determined by the shooting parameters of the camera and the relative position between the camera and the target object, and the multiple depth values corresponding to different first feature maps are all corresponding to the convolutional layer.
  • different first feature maps correspond to different first shooting parameters, and therefore, different first feature maps correspond to different viewing angle conversion matrices.
  • multiple second feature maps converted from each first feature map can be acquired.
  • the feature map set includes 3 first feature maps, and the convolutional layer that outputs the feature map set has 20 depth values, then 20 viewing angle conversion matrices can be determined for each first feature map, and by obtaining each The 20 viewing angle conversion matrix corresponding to the first feature map is obtained, and 20 second feature maps after conversion corresponding to each first feature map are obtained. Therefore, by performing perspective conversion on the three first feature maps, 60 can be obtained The second feature map.
  • the Coarse-To-fine Depth Estimator (a sparse to dense depth predictor) can be used to process the first feature map.
  • the fine Depth Estimator outputs multiple second feature maps.
  • the aggregated feature is used to represent the multi-dimensional features of the feature map set corresponding to multiple target images. If the reference feature map and the second feature map are both one-dimensional feature maps, then the reference feature map and the second feature map are fused Obtain a two-dimensional feature map. Since the obtained shooting angle of view corresponding to the second feature map is the same as the shooting angle of view corresponding to the reference feature map, the reference feature map can be directly fused with the second feature map to obtain aggregated features.
  • the first image includes multiple, and step 2034 may include the following steps 5-7:
  • Step 5 Perform fusion processing on the first number of reference feature maps to obtain a reference feature volume corresponding to the reference image.
  • the first number is equal to the number of multiple depth values
  • the reference feature volume is used to represent the multi-dimensional features corresponding to the reference image.
  • multiple depth values are determined for the convolutional layer of the output feature map set, and each first image can be obtained when the perspective conversion is performed on each first image in the feature map set
  • Corresponding converted multiple second feature maps in order to ensure that the reference feature is consistent in quantity with the multiple second feature maps corresponding to each first feature map, so as to facilitate subsequent fusion processing of the reference feature and the second feature Therefore, it is necessary to fuse the first number of reference features to obtain the reference feature volume.
  • the first number of reference feature maps are stacked to obtain the reference feature volume. Since each reference feature map belongs to a one-dimensional feature map, the first number of reference feature maps are stacked to obtain a multi-dimensional reference feature volume.
  • Step 6 For each first image, perform fusion processing on multiple second feature maps converted from the first feature map corresponding to the first image to obtain a first feature volume, and the first feature volume and the reference feature volume The difference between is determined as the second characteristic volume.
  • the first feature volume is used to represent the multi-dimensional feature corresponding to the first image
  • the second feature volume is used to represent the multi-dimensional feature corresponding to the difference between the first image and the reference image
  • any first image For any first image, perform fusion processing on multiple second feature maps converted from the first feature map corresponding to the first image to obtain the first feature volume corresponding to the first image.
  • the multiple second feature maps after the conversion of the first feature map corresponding to one image are fused, so as to obtain the first feature volume corresponding to the multiple first images.
  • the first feature volumes corresponding to different first images are different.
  • the difference between each first feature volume and the reference feature volume can be directly determined, thereby obtaining multiple second feature volumes.
  • the second feature rolls corresponding to different first images are different.
  • each second feature map belongs to a one-dimensional feature map, multiple second feature maps are stacked to obtain a multi-dimensional second feature volume.
  • Step 7 Perform fusion processing on the determined multiple second feature volumes to obtain aggregated features.
  • the aggregate feature is used to represent the multi-dimensional features corresponding to multiple target images, and the aggregate feature is the aggregate feature corresponding to the convolutional layer that outputs the feature map set.
  • the resulting aggregate feature eliminates the difference in perspective between multiple target images, merges objects shot from different perspectives, and enriches the features of objects from multiple perspectives. This constitutes a comprehensive representation of the aggregation characteristics of objects.
  • step 7 may include: obtaining a weight matrix corresponding to the convolutional layer of the output feature map set, and combining the multiple second feature maps according to the weight matrix.
  • the two feature volumes are subjected to weighted fusion processing to obtain aggregated features.
  • the weight matrix includes the weight corresponding to each pixel position in the feature map output by the convolutional layer.
  • the product between each second feature volume and the weight matrix is determined, and the ratio between the sum of the products corresponding to the multiple second feature volumes and the number of the multiple second feature volumes, As the aggregation feature, the influence of the weight is incorporated when the multiple second feature volumes are fused, thereby improving the accuracy of the obtained aggregation feature.
  • the weight matrix can be obtained through WeightNet (weight matrix acquisition model) training, and the WeightNet can be composed of multiple convolutional layers and a ResNet (Residual Network, deep residual network) block.
  • the average feature volume of V′ i,d,h,w is avg_pooling(
  • the second feature volume of the largest scale is max_pooling(
  • i represents any one of the multiple first images, i is a positive integer greater than 0 and less than or equal to N-1; N represents the number of multiple target images, and N is a positive integer greater than or equal to 2 ; D represents any one of the multiple depth values, h represents the height of the feature map in the feature map set; w represents the width of the feature map in the feature map set.
  • the Pixel-Wise View Aggregation (pixel-level view aggregation) method can be used to merge the reference feature map with the second feature map, that is, in a possible implementation, the feature is aggregated , Refer to the feature volume, the first feature volume, the second feature volume and the weight matrix, and meet the following relationships:
  • V′ i,d,h,w V i,d,h,w -V 0,d,h,w
  • i represents the serial number of the first image, i is a positive integer greater than 0 and not greater than N-1; N represents the number of multiple target images, N is an integer greater than 1, and d represents one of the multiple depth values Any depth value, h represents the height of the feature map in the feature map set, w represents the width of the feature map in the feature map set; V′ i,d,h,w represents the second feature volume, Vi ,d,h ,w represents the first feature volume, V 0,d,h,w represents the reference feature volume, C d,h,w represents the aggregate feature, U h,w represents the weight matrix; ⁇ is used to represent the element-level multiplication.
  • the weight matrix 405 performs convolution processing on the plurality of second feature volumes 401 to obtain aggregated features 406.
  • the Voxel-Wise View Aggregation (voxel-level view aggregation) method can be used to merge the reference feature map with the second feature map, that is, in a possible implementation, the aggregation Features, reference feature volume, first feature volume, second feature volume and weight matrix satisfy the following relationships:
  • V′ i,d,h,w V i,d,h,w -V 0,d,h,w
  • i represents the serial number of the first image, i is a positive integer greater than 0 and less than or equal to N-1; N represents the number of multiple target images, N is an integer greater than 1, and d represents one of the multiple depth values
  • h represents the height of the feature map in the feature map set;
  • w represents the width of the feature map in the feature map set;
  • V′ i,d,h,w represents the second feature volume, Vi ,d,h ,w represents the first feature volume, V 0,d,h,w represents the reference feature volume, C d,h,w represents the aggregate feature, U d,h,w represents the weight matrix corresponding to the depth value d;
  • is used for Represents element-level multiplication.
  • the embodiment of the present application will directly aggregate the multiple feature maps in each feature map set after obtaining the feature map sets respectively output by multiple convolutional layers.
  • the computer device performs fusion processing on the obtained multiple aggregated features to obtain a depth image.
  • the depth image includes the depth value of the target object. Since the feature maps output by each convolutional layer are different, and different feature maps contain different amounts of information, through multiple convolutional layers, different aggregated features of the multiple aggregated features obtained contain different information. Therefore, there will be more The fusion processing of the aggregated features enriches the information of the feature map, thereby improving the accuracy of the obtained depth image.
  • each aggregated feature includes a multi-dimensional feature
  • the multi-dimensional feature of each aggregated feature is merged to obtain a depth image.
  • the computer device performs conversion processing on the depth image to obtain point cloud data.
  • the point cloud data is data composed of multiple points in a three-dimensional coordinate system.
  • a point is created in the three-dimensional coordinate system according to the depth value corresponding to any pixel in the depth image, and then multiple points can be obtained through the depth values of multiple pixels in the depth image to form a point Cloud data.
  • the computer device performs aggregation processing on the point cloud data to obtain a three-dimensional model of the target object.
  • the point cloud data is aggregated to connect multiple points in the point cloud data to obtain a three-dimensional model of the target object.
  • this step 206 may include: filtering the point cloud data to obtain filtered point cloud data, and performing aggregation processing on the filtered point cloud data to obtain a three-dimensional model of the target object.
  • the generated point cloud data will have the influence of noise, by filtering the noise in the point cloud data, the accuracy of the filtered point cloud data is improved, thereby improving the accuracy of the obtained three-dimensional model.
  • the related art provides a depth image generation method.
  • multi-level convolution processing is performed on the captured object image, and the feature map output by the last convolution layer is obtained.
  • the image is subjected to convolution processing to obtain the depth image of the object. Because the above method only uses the feature map output by the last convolutional layer in the process of acquiring the depth image, the feature map has less information, resulting in poor accuracy of the depth image.
  • the method provided in the embodiment of the application obtains multiple target images, which are obtained by shooting the target objects according to different perspectives, and perform multi-level processing on the multiple target images through multiple convolution layers in the convolution model.
  • Convolution processing obtains the feature map sets outputted by multiple convolutional layers, and separately aggregates the multiple feature maps in each feature map set to obtain the aggregated features corresponding to each feature map set.
  • the aggregated features are fused to obtain a depth image.
  • the obtained multiple target images are obtained by shooting the target object according to different perspectives, so that the multiple target images include information of different angles of the target object, which enriches the amount of information of the obtained target image, and passes through multiple convolutions.
  • the multi-level convolution processing of the layers obtains multiple different feature map sets, which enriches the information content of the feature maps.
  • the feature maps output by multiple convolutional layers are fused to enrich the information contained in the obtained depth images. , Thereby improving the accuracy of the depth image obtained.
  • the multiple feature maps in each feature map set are aggregated for perspective, so that the feature maps belonging to the same perspective can be subsequently merged, and the resulting aggregation is improved.
  • the accuracy of the features improves the accuracy of the depth image obtained.
  • the probability map corresponding to each aggregated feature is fused, so that multiple When the aggregated features are fused, the influence of the probability on the position of each pixel is taken into consideration, and the accuracy of the obtained fourth aggregated feature is improved, thereby improving the accuracy of the obtained depth image.
  • the foregoing step 204 may include the following steps 2041-2046:
  • the computer device uses the aggregation feature with the largest scale among the plurality of aggregation features as the first aggregation feature, and uses the other aggregation features among the plurality of aggregation features as the second aggregation feature.
  • the scales of the feature maps output by the multiple convolutional layers in the convolution model are sequentially reduced. Since the aggregated features are obtained by the feature map fusion processing, the aggregated features corresponding to the multiple convolutional layers The scales of are successively reduced. Therefore, aggregated features of multiple scales can be obtained through the multiple convolutional layers.
  • the scale of the feature map includes the height of the feature map and the width of the feature map.
  • the dimension of each feature map is 1, the aggregated feature obtained by fusing multiple feature maps is a multi-dimensional feature.
  • the scale of the aggregated feature includes the height of the feature map, the width of the feature map, and the number of dimensions. The number of feature maps in the feature map set corresponding to the aggregated features is equal. Since in multiple convolutional layers, the scales of the feature maps output by the multiple convolutional layers are sequentially reduced, the scales of the multiple aggregated features corresponding to the multiple convolutional layers are sequentially reduced.
  • the computer device performs multi-level convolution processing on the first aggregated feature to obtain multiple third aggregated features.
  • the scales of the plurality of third aggregated features correspond to the scales of the plurality of second aggregated features in a one-to-one correspondence.
  • the first aggregated feature is subjected to multi-level convolution processing through multiple convolutional layers, and the first aggregated feature is subjected to convolution processing through the first convolutional layer to obtain the first third aggregate Features, through the next convolutional layer to perform convolution processing on the third aggregated feature output by the previous convolutional layer to obtain the third aggregated feature output by the next convolutional layer, until the last convolutional layer outputs the last third aggregate feature.
  • the computer device performs fusion processing on the second aggregated feature of the first scale and the third aggregated feature of the first scale, and performs deconvolution processing on the fused feature to obtain the fourth aggregated feature of the second scale.
  • the first scale is the smallest scale of the multiple second aggregated features
  • the second scale is the upper scale of the first scale
  • the second aggregated feature of the first scale is fused with the third aggregated feature of the first scale, and the scale of the resulting fused feature is the first Scale, the fused feature is subjected to deconvolution processing, so that the scale of the fused feature is increased, thereby obtaining the fourth aggregated feature of the second scale.
  • the computer device continues to perform the fusion processing on the currently obtained fourth aggregate feature, the second aggregate feature and the third aggregate feature with the same scale as the fourth aggregate feature, and performs deconvolution processing on the fused feature to obtain the upper level
  • the fourth aggregation feature of the scale until the fourth aggregation feature with the same scale as the first aggregation feature is obtained.
  • the fused feature has the same scale as the fourth aggregate feature currently obtained.
  • step 2044 is executed multiple times, and step 2044 is executed multiple times.
  • the scales of the fourth aggregated features obtained after 2044 are successively increased, so that the fourth aggregated feature with the largest scale can be obtained, that is, the fourth aggregated feature with the same scale as the first aggregated feature can be obtained.
  • the fourth aggregate feature of the second scale is obtained through the second aggregate feature of the first scale and the third aggregate feature of the first scale; the second aggregate feature of the second scale is obtained.
  • the four aggregated features, the second aggregated feature of the second scale, and the third aggregated feature of the second scale are fused, and the fused feature is subjected to deconvolution processing to obtain the fourth aggregated feature of the third scale;
  • the fourth aggregation feature of the third scale, the second aggregation feature of the third scale, and the third aggregation feature of the third scale are fused, and the fused features are subjected to deconvolution processing to obtain the fourth aggregation feature of the fourth scale;
  • the fourth aggregation feature of the four scales, the second aggregation feature of the fourth scale, and the third aggregation feature of the fourth scale are fused, and the fused features are subjected to deconvolution processing to obtain the fourth aggregation feature of the fifth scale.
  • the fifth scale is equal to the scale
  • this step 2044 may include: continuing to compare the currently obtained fourth aggregated feature, the second aggregated feature and the third aggregated feature with the same scale as the fourth aggregated feature, and the probability map of the second aggregated feature The fusion process is performed, and the fused feature is subjected to deconvolution processing to obtain the fourth aggregate feature of the upper level scale.
  • the second aggregation feature, the third aggregation feature, the fourth aggregation feature of the same scale and the probability map corresponding to the second aggregation feature are fused, and the fused features are deconvolved. Processing, repeating the above steps, so as to be able to have the fourth aggregate feature of the largest scale, which is equal to the scale of the first aggregate feature.
  • the probability map includes the probability corresponding to each pixel position in the second aggregated feature, when acquiring multiple fourth aggregated features, the probability map of the second aggregated feature is incorporated, so that the multiple aggregated features are fused into consideration
  • the influence of the probability on the position of each pixel improves the accuracy of the obtained fourth aggregation feature, so that the accuracy of the obtained depth image can be improved subsequently.
  • the computer device performs fusion processing on the currently obtained fourth aggregated feature and the first aggregated feature to obtain the fifth aggregated feature.
  • the fourth aggregated feature and the first aggregated feature are fused to make the fused fifth aggregated feature equal to the scale of the first aggregated feature.
  • the set of feature maps output by each convolutional layer corresponds to aggregated features, then the aggregated features corresponding to multiple convolutional layers are fused, so that the fifth aggregated feature includes the feature maps output by multiple convolutional layers.
  • the computer device performs convolution processing according to the probability map corresponding to the fifth aggregation feature and the first aggregation feature to obtain a depth image.
  • the probability map is used to represent the probability corresponding to each pixel position in the first aggregation feature, and each probability is used to represent the probability that the depth value corresponding to each pixel position is correct.
  • the probability map may be obtained by convolution processing the first aggregation feature by the probability map acquisition model.
  • the probability map acquisition model may include an encoder and a decoder.
  • the first aggregation feature is encoded by the encoder, and then passed
  • the decoder performs decoding to obtain the probability map
  • the probability map acquisition model may be a 3D CNN (3Dimension Convolutional Neural Networks, three-dimensional convolutional neural network) model, or other neural network models.
  • each pixel position in the fifth aggregated feature corresponds to each pixel position in the first aggregated feature one-to-one, then each of the fifth aggregated features Each pixel position corresponds to the probability in the probability map one-to-one. Therefore, the fifth aggregated feature is convolved with the probability map to obtain a depth image. The corresponding probability is incorporated into the aggregated feature to improve the obtained The accuracy of the depth map.
  • the convolutional layer corresponding to the first aggregated feature corresponds to multiple depth values
  • the first aggregated feature is composed of multiple depth values.
  • the second feature map is fused with the reference feature map, and each second feature map corresponds to a depth value
  • the fifth aggregated feature includes multiple feature maps, and the number of the multiple feature maps is equal to The number of depth values is equal; then the step 2046 may include: determining the depth value corresponding to each feature map in the fifth aggregated feature, and determining the probability corresponding to each feature map according to the probability map corresponding to the first aggregated feature, The depth values corresponding to the multiple feature maps and the probabilities corresponding to the multiple feature maps are weighted to obtain the predicted depth, and the depth image is formed by the predicted depth.
  • the depth value d corresponding to each feature map, the probability P corresponding to each feature map, and the predicted depth E satisfy the following relationship:
  • d min represents the minimum value among multiple depth values
  • d max represents the maximum value among multiple depth values
  • P(d) represents the probability corresponding to the depth value d.
  • the process of acquiring a depth image by using multiple target images in the above embodiment can be implemented by a depth image generation model.
  • the depth image generation model Process multiple target images and output depth images.
  • the depth image generation model may be VA-MVSNet (View Aggregation Mult-view StereoNetwork, a network model) or other network models.
  • the depth image generation model When training the depth image generation model, acquire multiple sample images and corresponding depth images, use the sample images as the input of the depth image generation model, use the depth image as the output of the depth image generation model, and use the depth image as the output of the depth image generation model. Obtain the model for iterative training.
  • the depth image acquisition model is trained through the DTU (Technical University of Denmark) dataset, the number of sample images is 3, the resolution of each sample image is 640x512, and the preset depth range is from 425 mm to 935 Mm, the number of depth layers is 192 layers.
  • the depth image generation model uses Adam (an optimization algorithm) with an initial learning rate of 0.1 and an attenuation parameter of 0.9 to train the depth image generation model, and adjust the weight matrix w and bias parameter b in the depth image generation model.
  • the output depth image is compared with the real depth image to obtain the prediction result error. According to the prediction result error, the parameters of the depth image generation model are adjusted so that the loss function of the depth image generation model is And can be reduced.
  • the loss function parameter ⁇ of each scale is ⁇ 0.32, 0.16, 0.04, 0.01 ⁇ respectively
  • the number of multiple scales is 4, and the GPU in the DTU dataset
  • the number of (Graphics Processing Unit, graphics processor) is also 4.
  • the depth image generation model needs to be tested.
  • the number of input images is 5
  • the number of depth layers is 192
  • the number of pyramid layers is 3
  • the downsampling parameter is 0.5.
  • the input image size is 1600x1184
  • the input image size is 1920x1056.
  • the depth image acquisition model can be trained according to the sum of the loss functions of the depth image acquisition model.
  • the sum of the loss functions reaches the preset threshold, the depth image acquisition model can be trained. Training of depth image acquisition model.
  • the sum of the loss function can be expressed as E, which satisfies the following relationship:
  • the depth image generation model includes a first convolution model 701, a second convolution model 702, a third convolution model 703, and a fourth convolution model 704.
  • the first convolution model 701 is the same as the convolution model in step 202 above, and is used to obtain the feature map of the target image, and input the feature map set output by each first convolution layer 7011 to the second convolution model 702;
  • the second convolution model 702 performs perspective aggregation on each feature map set, and outputs the first aggregation feature 705 and the second aggregation feature 706;
  • the third convolution model 703 performs multiple second convolution layers 7031 on the first aggregation feature 705 Multi-level convolution processing obtains a plurality of third aggregated features 707;
  • the fourth convolution model 704 passes through a plurality of third convolution layers 7041, executes the above-mentioned steps 2043-2046, and outputs a depth image 708.
  • Fig. 8 is a flowchart of a depth image generation method provided by an embodiment of the present application. As shown in Fig. 8, the method includes:
  • the computer device photographs a target object according to multiple different perspectives to obtain multiple original images, and determines the multiple original images as a target image set.
  • This step is similar to the way of obtaining the original image in the above step 201, and will not be repeated here.
  • the computer device performs multiple rounds of scale adjustment on multiple original images to obtain multiple sets of target images.
  • each target image set includes multiple target images of the same scale, and the scales of the target images in different target image sets are different.
  • Adjusting the scale of multiple original images may be: reducing multiple original images to obtain multiple target images of a smaller scale; or zooming in multiple original images to obtain multiple target images of larger scale. Since the scales of the multiple original images are equal, after the multiple original images are scaled in each round, the scales of the multiple target images obtained are equal, and the scales of the target images obtained by different rounds of scale adjustment are different.
  • the first round of scale adjustment is performed on multiple original images to obtain the first set of target images, and the multiple target images of the target image set obtained in the previous round are downloaded.
  • One round of scale adjustment the next set of target images is obtained, until multiple sets of target images are obtained.
  • the multiple rounds include 3 rounds, the first round of scale adjustment is performed on multiple original images to obtain the first set of target images, and the second round of scale adjustment is performed on multiple target images in the first set of target images to obtain For the second set of target images, the third round of scale adjustment is performed on multiple target images in the second set of target images to obtain the third set of target images.
  • multiple target image sets obtained through steps 801-802 can form an image pyramid.
  • the scale of the bottommost image is the largest, and as the level in the image pyramid increases, the scale of the image in the corresponding level decreases.
  • the target image set corresponding to the multiple original images is the bottom layer of the image pyramid.
  • the first round of scale adjustment is performed on the multiple original images to obtain the target image set of the upper layer of the lowest layer.
  • the image collection undergoes a round of scale adjustment to obtain a higher level of target image collection, and multiple rounds of scale adjustment are repeated to form an image pyramid containing a preset number of layers of target image collections.
  • the computer device respectively executes the above steps 201-208 for multiple sets of target image sets to obtain a depth image corresponding to each set of target image sets.
  • each target image set includes multiple target images
  • the multiple target images in each target image set are used as the multiple target images in the above step 201, and the multiple target images
  • the image is processed to obtain the depth image corresponding to each target image set, that is, multiple depth images are obtained.
  • the scales of images in different target image sets are different, the scales of depth images corresponding to different sets of target image sets are different, that is, for multiple sets of target image sets, depth images of multiple scales are obtained.
  • the computer device performs fusion processing on the depth images corresponding to the multiple sets of target image sets to obtain a fused depth image.
  • the depth images of different scales contain different depth values.
  • the depth images of multiple scales can be fused in sequence from small to large scale. By fusing depth images of multiple scales, the depth value of the fused depth image is enriched, thereby improving the accuracy of the fused depth image.
  • this step 804 may include: starting from the depth image of the smallest scale, and taking the first depth image that meets the preset conditions in the current depth image.
  • the depth value of one pixel replaces the depth value of the second pixel corresponding to the first pixel in the depth image of the previous scale, until the depth value in the depth image of the largest scale is replaced, and the depth image of the largest scale is obtained after replacing the depth value Depth image.
  • the depth image includes multiple pixels, and each pixel corresponds to a depth value.
  • the first pixel corresponds to the second pixel, which means that the corresponding position of the first pixel and the second pixel are the same.
  • Meeting the preset condition means that the depth value of the first pixel is greater than that of the second pixel.
  • the depth value is more accurate. Therefore, the depth value of the first pixel with high accuracy in the small-scale depth image is replaced with the depth value of the second pixel in the depth image of the previous scale, so that each pixel in the depth image of the previous scale is replaced The depth value is more accurate.
  • the first pixel in the small-scale depth image is replaced with the depth value of the second pixel of the previous scale. After multiple replacements, the depth image of the largest scale is obtained The depth value of each pixel in is more accurate, thereby improving the accuracy of the acquired depth image.
  • the obtained multiple scale depth images constitute the image pyramid of the depth map.
  • Multi-metric Pyramid Depth Map Aggregation multi-scale metric pyramid depth map aggregation
  • Step 1 For the first depth image and the second depth image of adjacent scales, according to the pixel mapping relationship between the first depth image and the second depth image, map any second pixel in the second depth image to the first In the depth image, get the first pixel. Wherein, the scale of the second depth image is larger than the scale of the first depth image.
  • the pixel mapping relationship includes a corresponding relationship between a plurality of pixels in the first depth image and a plurality of pixels in the second depth image. Since the first depth image and the second depth image are both obtained through multiple target images, the scales of the target images corresponding to different depth images are different, and the target images of different scales are all obtained by adjusting the original image. Therefore, the corresponding relationship between the multiple pixels in the first depth image and the second depth image can be determined, so that the pixel mapping relationship between the first depth image and the second depth image can be obtained.
  • the scale of the first depth image is smaller than the scale of the second depth image
  • the number of pixels contained in the first depth image is compared with the first depth image.
  • the two depth images contain the same number of pixels, and the size of each first pixel in the first depth image is smaller than the size of each second pixel in the second depth image.
  • the size of the pixels contained in the first depth image is equal to the size of the pixels contained in the second depth image, then the number of first pixels in the first depth image is less than the number of second pixels in the second depth image, each The first pixel corresponds to a plurality of second pixels.
  • Step 2 According to the pixel mapping relationship, the first pixel is inversely mapped to the second depth image to obtain the third pixel.
  • the process of determining the corresponding pixel in the small-scale depth image through the pixels in the large-scale depth image is the mapping process; through the pixels in the small-scale depth image, the process of determining the large-scale depth image The process of corresponding pixels is called the de-mapping process.
  • the scales of the first depth image and the second depth image are different, there is no guarantee that the pixels in the first depth image correspond to the pixels in the second depth image one-to-one, so the second pixel in the second depth image is mapped to the first depth image
  • the first pixel is obtained, when the first pixel is inversely mapped to the second depth image, a difference will occur between the obtained third pixel and the second pixel, so that the obtained third pixel is different from the second pixel.
  • Step 3 In response to the distance between the first pixel and the third pixel being less than the first preset threshold, it is determined that the first pixel corresponds to the second pixel.
  • the first preset threshold may be any preset value, such as 1, 2 and so on.
  • the distance between the first pixel and the third pixel is less than the first preset distance, which indicates that the image consistency between the first pixel and the second pixel is satisfied, so it can be determined that the first pixel corresponds to the second pixel.
  • the distance between the first pixel and the third pixel may be determined according to the coordinate value of the first pixel and the coordinate value of the third pixel .
  • the first pixel and the second pixel corresponding to the coordinate value of the first pixel coordinate values P 1, P 3 of the third pixel satisfy the following relation:
  • M is an arbitrary constant, such as M is 1.
  • this step 3 may include: responding to that the distance is less than the first preset threshold, and the depth value corresponding to the first pixel and the third pixel is different. The difference value between is smaller than the second preset threshold, and it is determined that the first pixel corresponds to the second pixel.
  • the second preset threshold may be any preset value.
  • the distance between the first pixel and the third pixel is less than the first preset distance, which means that the image consistency between the first pixel and the second pixel is satisfied, and the difference value between the depth values corresponding to the first pixel and the third pixel is less than
  • the second preset threshold indicates that the geometric consistency between the first pixel and the second pixel is satisfied, so it can be determined that the first pixel corresponds to the second pixel.
  • each pixel has a corresponding depth value.
  • the depth value D(P 1 ) corresponding to the first pixel and the depth value d 3 corresponding to the third pixel satisfy the following relationship :
  • the probability corresponding to the depth value of the first pixel is greater than the second preset threshold, and the probability corresponding to the depth value of the second pixel is less than the first pixel.
  • both the second preset threshold and the third preset threshold may be any preset values, for example, the second preset threshold is 0.9 and the third preset threshold is 0.5.
  • the probability corresponding to the depth value of the first pixel is greater than the second preset threshold, and the probability corresponding to the depth value of the second pixel is less than the third preset threshold, indicating that the depth value of the first pixel is more accurate than the depth value of the second pixel Therefore, it is determined that the first pixel meets the preset condition, and then the depth value of the first pixel can be replaced with the depth value of the second pixel.
  • the probability P(P 1 ) corresponding to the depth value of the first pixel and the probability P(P 2 ) corresponding to the depth value of the second pixel satisfy the following relationship:
  • Y is the second preset threshold
  • Z is the third preset threshold
  • Y and Z are arbitrary constants
  • Z is less than Y, for example, Y is 0.9 and Z is 0.5.
  • Each pixel position in the fifth aggregate feature corresponds to the probability in the probability map one-to-one, then the probability corresponding to each feature map can be determined, that is, the probability corresponding to each depth value can be determined; for the depth image
  • a preset number of depth values are determined from the multiple depth values, The sum of the probabilities corresponding to the preset number of depth values is determined as the probability of the pixel in the depth image.
  • the preset number of depth values is the preset number of depth values closest to the predicted depth value among the multiple depth values.
  • the preset number can be any preset number, such as 4 or 5.
  • the preset depth in the depth image is 1, the preset number is 4, and multiple depth values are 0.2, 0.4, 0.6, 0.8, 1.2, 1.4, 1.6, 1.8, Then, according to the preset depth 1, it is determined that the adjacent preset number of depth values are 0.6, 0.8, 1.2, 1.4, and the sum of the probabilities corresponding to the preset numbers is used as the sum of the probabilities of the pixel in the depth image. Probability.
  • the scale of the first depth image 901 is smaller than the scale of the second depth image 902, and the first depth image 901 is fused by depth images of other multiple scales. Obtained, the first probability map 903 corresponding to the first depth image 901 and the second probability map 904 corresponding to the second depth image 902 are determined.
  • the first depth image 901 Fusion with the second depth image 902, the depth value of the first pixel in the first depth image 901 that meets the preset condition, and the depth value of the second pixel corresponding to the first pixel in the second depth image 902 are replaced to obtain the A three-depth image 905, the scale of the third depth image 903 is equal to the scale of the second depth image 902, and the probability corresponding to the first pixel in the first probability map 903 is replaced by the second probability image 904 and the second pixel Corresponding probabilities, a third probability map 906 corresponding to the third depth image 905 is generated.
  • the computer device performs conversion processing on the depth image to obtain point cloud data.
  • This step is similar to the above step 205, and will not be repeated here.
  • the computer device performs aggregation processing on the point cloud data to obtain a three-dimensional model of the target object.
  • This step is similar to the above step 206, and will not be repeated here.
  • each of the multiple target images is used as the reference image.
  • step 806 the multiple point cloud data are aggregated to obtain a three-dimensional model of the target object.
  • the method provided in the embodiment of the application obtains multiple target images, which are obtained by shooting the target objects according to different perspectives, and perform multi-level processing on the multiple target images through multiple convolution layers in the convolution model.
  • Convolution processing obtains the feature map sets outputted by multiple convolutional layers, and separately aggregates the multiple feature maps in each feature map set to obtain the aggregated features corresponding to each feature map set.
  • the aggregated features are fused to obtain a depth image.
  • the obtained multiple target images are obtained by shooting the target object according to different perspectives, so that the multiple target images include information of different angles of the target object, which enriches the amount of information of the obtained target image, and passes through multiple convolutions.
  • the multi-level convolution processing of the layers obtains multiple different feature map sets, which enriches the information content of the feature maps.
  • the feature maps output by multiple convolutional layers are fused to enrich the information contained in the obtained depth images. , Thereby improving the accuracy of the depth image obtained.
  • the depth values with high accuracy in the low-scale depth images are replaced with the high-scale depth images, which improves the accuracy of the depth images, thereby improving the acquisition The accuracy of the 3D model.
  • each target image in the multiple target images is used as a reference image, and multiple point cloud data are obtained, and the multiple point cloud data are aggregated to enrich the information contained in the point cloud data, thereby improving the obtained The accuracy of the 3D model.
  • multiple original images are acquired, the multiple original images are determined as the first target image set 1001, and two rounds of scale adjustment are performed on the first target image set to obtain the second target image set 1002 and the third target respectively.
  • Image collection 1003. Input each target image collection to the depth image generation model 1004 to obtain multiple scale depth images 1005.
  • the multiple depth images are merged to obtain the fused depth image 1006.
  • For the fused depth image 1006 performs conversion processing, performs aggregation processing on the obtained point cloud data, and obtains a three-dimensional model 1007 of the target object.
  • steps 801-804 in the embodiment of the present application can be implemented through a network model, by inputting multiple original images into the network model, and the network model processes multiple original images to obtain multiple sets of target images Set, obtain the depth image corresponding to each set of target images, merge multiple depth images, and output the merged depth image.
  • the network model may be PVA-MVSNet (PyramidView Aggregation Multi-view Stereo Network, pyramid multi-view stereo geometric neural network model), or other network models.
  • FIG. 11 is a flow chart of generating a three-dimensional model provided by an embodiment of the present application. As shown in FIG. 11, the method includes:
  • the user uses the camera of the terminal to shoot the target object according to different perspectives to obtain multiple original images.
  • the terminal determines the shooting parameters corresponding to each original image through the sensor.
  • the terminal inputs multiple original images and corresponding shooting parameters into the depth image generation model, and the depth image generation model outputs the depth image of the target object.
  • the terminal converts the depth image into point cloud data, filters the point cloud data, and fuses the filtered point cloud data to obtain a three-dimensional model of the target object.
  • the terminal displays the three-dimensional model of the target object.
  • FIG. 12 is a schematic structural diagram of a depth image generation device provided by an embodiment of the present application. As shown in FIG. 12, the device includes:
  • the image acquisition module 1201 is used to acquire multiple target images, and the multiple target images are obtained by shooting the target object according to different perspectives;
  • the convolution processing module 1202 is configured to perform multi-level convolution processing on multiple target images through multiple convolution layers in the convolution model to obtain a set of feature maps respectively output by the multiple convolution layers;
  • the perspective aggregation module 1203 is configured to separately aggregate multiple feature maps in each feature map set to obtain the aggregated features corresponding to each feature map set;
  • the feature fusion module 1204 is used to perform fusion processing on the obtained multiple aggregated features to obtain a depth image.
  • the device provided by the embodiment of the present application acquires multiple target images.
  • the multiple target images are obtained by shooting target objects from different perspectives.
  • the multiple convolutional layers in the convolution model are used to perform multi-level processing on the multiple target images.
  • Convolution processing obtains the feature map sets respectively output by multiple convolution layers, and aggregates the multiple feature maps in each feature map set separately to obtain the aggregated features corresponding to each feature map set.
  • the aggregated features are fused to obtain a depth image.
  • the obtained multiple target images are obtained by shooting the target object according to different perspectives, so that the multiple target images include information of different angles of the target object, which enriches the amount of information of the obtained target image, and passes through multiple convolutions.
  • the multi-level convolution processing of the layers obtains multiple different feature map sets, which enriches the information content of the feature maps.
  • the feature maps output by multiple convolutional layers are fused to enrich the information contained in the obtained depth images. , Thereby improving the accuracy of the depth image obtained.
  • the convolution processing module 1202 includes:
  • the convolution processing unit 1221 is configured to perform convolution processing on multiple target images through the first convolution layer in the convolution model to obtain the feature map set output by the first convolution layer, and the feature map set includes multiple The feature map corresponding to the target image;
  • the convolution processing unit 1221 is also used to perform convolution processing on each feature map in the feature map set output by the previous convolution layer through the next convolution layer in the convolution model to obtain the next convolution layer output Until the set of feature maps output by multiple convolutional layers is obtained.
  • the view angle aggregation module 1203 includes:
  • the image determining unit 1231 is configured to use any one of the multiple target images as a reference image, and use other target images among the multiple target images as the first image;
  • the feature map determining unit 1232 is configured to determine the reference feature map corresponding to the reference image and the first feature map corresponding to the first image in the feature map set;
  • the angle of view conversion unit 1233 is configured to convert the angle of view of the first feature map according to the difference between the shooting angles of the first image and the reference image to obtain a converted second feature map;
  • the first fusion processing unit 1234 is configured to perform fusion processing on the reference feature map and the second feature map to obtain aggregated features.
  • the viewing angle conversion unit 1233 is further configured to obtain the first shooting parameter corresponding to the first image and the reference shooting parameter corresponding to the reference image; determine the multiple depth values corresponding to the convolutional layer of the output feature map set; The difference between the shooting parameter and the second shooting parameter, and multiple depth values, determine multiple perspective conversion matrices corresponding to the multiple depth values; according to the multiple perspective conversion matrices, perform perspective conversion on the first feature map, respectively, to obtain Multiple second feature maps after conversion.
  • the viewing angle conversion unit 1233 is further configured to determine the number of depth layers corresponding to the convolutional layers of the output feature map set; divide the preset depth range according to the number of depth layers to obtain multiple depth values.
  • the viewing angle conversion unit 1233 is further configured to perform fusion processing on a first number of reference feature maps to obtain a reference feature volume corresponding to the reference image, where the first number is equal to the number of multiple depth values; for each first image , Perform fusion processing on multiple second feature maps converted from the first feature map corresponding to the first image to obtain the first feature volume, and determine the difference between the first feature volume and the reference feature volume as the second feature volume ; Perform fusion processing on the determined multiple second feature volumes to obtain aggregated features.
  • the viewing angle conversion unit 1233 is further configured to obtain a weight matrix corresponding to the convolutional layer of the output feature map set.
  • the weight matrix includes the weight corresponding to each pixel position in the feature map output by the convolutional layer; according to the weight matrix, Perform weighted fusion processing on multiple second feature volumes to obtain aggregate features.
  • the aggregate feature, the reference feature volume, the first feature volume, the second feature volume, and the weight matrix satisfy the following relationship:
  • V′ i,d,h,w V i,d,h,w -V 0,d,h,w
  • i represents the serial number of the first image, i is a positive integer greater than 0 and not greater than N-1; N represents the number of multiple target images, N is an integer greater than 1, and d represents one of the multiple depth values Any depth value, h represents the height of the feature map in the feature map set, w represents the width of the feature map in the feature map set; V′ i,d,h,w represents the second feature volume, Vi ,d,h ,w represents the first feature volume, V 0,d,h,w represents the reference feature volume, C d,h,w represents the aggregate feature, U h,w represents the weight matrix; ⁇ is used to represent the element-level multiplication.
  • the aggregate feature, the reference feature volume, the first feature volume, the second feature volume, and the weight matrix satisfy the following relationship:
  • V′ i,d,h,w V i,d,h,w -V 0,d,h,w
  • i represents the serial number of the first image, i is a positive integer greater than 0 and less than or equal to N-1; N represents the number of multiple target images, N is an integer greater than 1, and d represents one of the multiple depth values
  • h represents the height of the feature map in the feature map set;
  • w represents the width of the feature map in the feature map set;
  • V′ i,d,h,w represents the second feature volume, Vi ,d, h,w represents the first feature volume, V 0,d,h,w represents the reference feature volume, C d,h,w represents the aggregate feature, U d,h,w represents the weight matrix corresponding to the depth value d;
  • Yu Yu means element-level multiplication.
  • the scales of the feature maps output by multiple convolutional layers are sequentially reduced; as shown in FIG. 13, the feature fusion module 1204 includes:
  • the aggregation feature determining unit 1241 is configured to use the aggregation feature with the largest scale among the aggregation features as the first aggregation feature, and use the other aggregation features among the aggregation features as the second aggregation feature;
  • the convolution processing unit 1242 is configured to perform multi-level convolution processing on the first aggregation feature to obtain a plurality of third aggregation features, and the scales of the plurality of third aggregation features correspond to the scales of the plurality of second aggregation features in a one-to-one correspondence;
  • the deconvolution processing unit 1243 is configured to perform fusion processing on the second aggregated feature of the first scale and the third aggregated feature of the first scale, and perform deconvolution processing on the fused feature to obtain the fourth aggregate of the second scale Features, the first scale is the smallest scale of multiple second aggregated features, and the second scale is the upper scale of the first scale;
  • the deconvolution processing unit 1243 is further configured to continue to perform the fusion processing on the currently obtained fourth aggregate feature, the second aggregate feature and the third aggregate feature having the same scale as the fourth aggregate feature, and perform deconvolution on the fused feature Processing to obtain the fourth aggregation feature of the upper level scale until the fourth aggregation feature equal to the first aggregation feature scale is obtained;
  • the second fusion processing unit 1244 is configured to perform fusion processing on the fourth aggregation feature having the same scale as the first aggregation feature and the first aggregation feature to obtain the fifth aggregation feature;
  • the convolution processing unit 1242 is further configured to perform convolution processing on the fifth aggregate feature according to the probability map corresponding to the first aggregate feature to obtain a depth image.
  • the deconvolution processing unit 1243 is further configured to continue to compare the currently obtained fourth aggregated feature, the second aggregated feature with the same scale as the fourth aggregated feature, the third aggregated feature, and the probability map of the second aggregated feature.
  • the fusion process is performed, and the fused feature is subjected to deconvolution processing to obtain the fourth aggregate feature of the upper level scale.
  • the image acquisition module 1201 includes:
  • the first image acquisition unit 12011 is configured to photograph the target object according to multiple different perspectives to obtain multiple target images; or,
  • the second image acquisition unit 12012 is configured to photograph the target object according to multiple different perspectives to obtain multiple original images
  • the scale adjustment unit 12013 is configured to perform scale adjustment on multiple original images to obtain multiple target images adjusted by the multiple original images, and the multiple target images have the same scale.
  • the scale adjustment unit 12013 is further configured to perform multiple rounds of scale adjustment on multiple original images to obtain multiple sets of target images, each set of target images includes multiple target images of the same scale, and different target image sets The scale of the target image is different;
  • the device also includes: a fusion processing module 1205, which is used to perform fusion processing on depth images corresponding to multiple sets of target image sets to obtain a fused depth image.
  • a fusion processing module 1205 which is used to perform fusion processing on depth images corresponding to multiple sets of target image sets to obtain a fused depth image.
  • the fusion processing module 1205 includes:
  • the third fusion processing unit 1251 is configured to start from the depth image of the smallest scale, and replace the depth value of the first pixel in the current depth image that meets the preset condition with the second pixel corresponding to the first pixel in the depth image of the previous scale. The depth value of the pixel until the depth value in the depth image of the largest scale is replaced, and the depth image with the depth image of the largest scale replaced by the depth value is obtained.
  • the device includes:
  • the pixel mapping module 1206 is used to convert any second pixel in the second depth image according to the pixel mapping relationship between the first depth image and the second depth image for the first depth image and the second depth image of adjacent scales. Mapped to the first depth image to obtain the first pixel, and the scale of the second depth image is larger than the scale of the first depth image;
  • the pixel de-mapping module 1207 is used to de-map the first pixel to the second depth image according to the pixel mapping relationship to obtain the third pixel;
  • the first pixel determining module 1208 is configured to determine that the first pixel corresponds to the second pixel in response to the distance between the first pixel and the third pixel being less than the first preset threshold.
  • the first pixel determination module 1208 includes:
  • the pixel determining unit 1281 is configured to determine that the first pixel corresponds to the second pixel in response to that the distance is less than the first preset threshold and the difference value between the depth values corresponding to the first pixel and the third pixel is less than the second preset threshold .
  • the device includes:
  • the second pixel determination module 1209 is configured to respond to the probability that the depth value of the first pixel corresponds to greater than the second preset threshold, and the probability that the depth value of the second pixel corresponds to is less than the third preset threshold, to determine that the first pixel satisfies the preset threshold. Set conditions.
  • the device further includes:
  • the conversion processing module 1210 is used to perform conversion processing on the depth image to obtain point cloud data
  • the aggregation processing module 1211 is used to aggregate the point cloud data to obtain a three-dimensional model of the target object.
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application, which can implement operations performed by the first terminal, the second terminal, and the third terminal in the foregoing embodiment.
  • the terminal 1400 may be a portable mobile terminal, such as a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic Video experts compress the standard audio level 4) Players, laptops, desktop computers, head-mounted devices, smart TVs, smart speakers, smart remotes, smart microphones, or any other smart terminal.
  • the terminal 1400 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 1400 includes a processor 1401 and a memory 1402.
  • the processor 1401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the memory 1402 may include one or more computer-readable storage media, which may be non-transitory and used to store at least one instruction, and the at least one instruction is used by the processor 1401 to implement the The depth image generation method provided by the method embodiment.
  • the terminal 1400 may optionally further include: a peripheral device interface 1403 and at least one peripheral device.
  • the processor 1401, the memory 1402, and the peripheral device interface 1403 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1403 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1404, a display screen 1405, and an audio circuit 1406.
  • the radio frequency circuit 1404 is used to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1404 communicates with a communication network and other communication devices through electromagnetic signals.
  • the display screen 1405 is used to display UI (User Interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 1405 may be a touch display screen, and may also be used to provide virtual buttons and/or virtual keyboards.
  • the audio circuit 1406 may include a microphone and a speaker.
  • the microphone is used to collect audio signals of the user and the environment, and convert the audio signals into electrical signals and input to the processor 1401 for processing, or input to the radio frequency circuit 1404 to implement voice communication.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 1401 or the radio frequency circuit 1404 into an audio signal.
  • FIG. 14 does not constitute a limitation on the terminal 1400, and may include more or less components than those shown in the figure, or combine some components, or adopt different component arrangements.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1500 may have relatively large differences due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1501 and one Or more than one memory 1502, where at least one instruction is stored in the memory 1502, and the at least one instruction is loaded and executed by the processor 1501 to implement the methods provided by the foregoing method embodiments.
  • the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be repeated here.
  • the server 1500 may be used to execute the above-mentioned depth image generation method.
  • An embodiment of the present application also provides a computer device that includes a processor and a memory, and at least one piece of program code is stored in the memory, and the at least one piece of program code is loaded and executed by the processor to implement the depth image of the above-mentioned embodiment. Generation method.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores at least one piece of program code, and the at least one piece of program code is loaded and executed by a processor to realize the depth image generation of the above-mentioned embodiment. method.
  • An embodiment of the present application also provides a computer program, in which at least one piece of program code is stored, and the at least one piece of program code is loaded and executed by a processor to implement the depth image generation method of the foregoing embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Processing (AREA)

Abstract

一种深度图像生成方法、装置及存储介质,属于计算机技术领域。该方法包括:获取多张目标图像(201),通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合(202),分别将所述每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征(203),将得到的多个聚合特征进行融合处理,得到深度图像(204)。获取的多张目标图像是按照不同视角拍摄目标物体分别得到的,使得到的多张目标图像中不同角度的信息,丰富了获取到的目标图像的信息量,且通过多个卷积层的多级卷积处理,得到多个不同的特征图集合,丰富了特征图的信息量,从而提高了得到的深度图像的准确性。

Description

深度图像生成方法、装置及存储介质
本申请要求于2020年2月26日提交的申请号为2020101197135、发明名称为“深度图像生成方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种深度图像生成方法、装置及存储介质。
背景技术
随着计算机技术的发展,三维模型的应用越来越广泛。三维模型可以应用于多种场景下,如建筑物的三维模型构建场景、人体的三维模型构建场景等。在生成物体的三维模型时,需要先生成物体的深度图像,因此如何生成深度图像成为亟待解决的问题。
发明内容
本申请实施例提供了一种深度图像生成方法、装置及存储介质,能够提高深度图像的准确性。所述技术方案如下:
一方面,提供了一种深度图像生成方法,所述方法包括:
获取多张目标图像,所述多张目标图像是按照不同视角拍摄目标物体分别得到的;
通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合,每个特征图集合包括所述多张目标图像对应的特征图;
分别将所述每个特征图集合中的多个特征图进行视角聚合,得到所述每个特征图集合对应的聚合特征;
将得到的多个聚合特征进行融合处理,得到深度图像。
另一方面,提供了一种深度图像生成装置,所述装置包括:
图像获取模块,用于获取多张目标图像,所述多张目标图像是按照不同视角拍摄目标物体分别得到的;
卷积处理模块,用于通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合,每个特征图集合包括所述多张目标图像对应的特征图;
视角聚合模块,用于分别将所述每个特征图集合中的多个特征图进行视角聚合,得到所述每个特征图集合对应的聚合特征;
特征融合模块,用于将得到的多个聚合特征进行融合处理,得到深度图像。
另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行,以实现如上述方面所述的深度图像生成方法。
另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如上述方面所述的深度图像生成方法。
本申请实施例提供的方法、装置及存储介质,获取多张目标图像,该多张目标图像是按照不同视角拍摄目标物体分别得到的,通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特 征图进行视角聚合,得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。获取的多张目标图像是按照不同视角拍摄目标物体分别得到的,使得到的多张目标图像中包括目标物体不同角度的信息,丰富了获取到的目标图像的信息量,且通过多个卷积层的多级卷积处理,得到多个不同的特征图集合,丰富了特征图的信息量,将多个卷积层输出的特征图进行融合处理,丰富了得到的深度图像中包含的信息量,从而提高了得到的深度图像的准确性。
并且,通过多张目标图像之间的拍摄视角差异,对每个特征图集合中的多个特征图进行视角聚合,以使后续能够将属于相同视角的特征图进行融合处理,提高了得到的聚合特征的准确性,从而提高了得到的深度图像的准确性。
并且,在将多个卷积层输出的特征图进行融合处理的过程中,将每个卷积层对应的聚合特征进行融合时,将每个聚合特征对应的概率图进行融合处理,使得多个聚合特征进行融合时考虑到了概率对各个像素位置的影响,提高了得到的第四聚合特征的准确性,从而提高了得到的深度图像的准确性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种实施环境的结构示意图;
图2是本申请实施例提供的一种深度图像生成方法的流程图;
图3是本申请实施例提供的一种深度图像生成方法的流程图;
图4是本申请实施例提供的一种第二特征卷进行融合处理的流程图;
图5是本申请实施例提供的一种第二特征卷进行融合处理的流程图;
图6是本申请实施例提供的一种深度图像生成方法的流程图;
图7是本申请实施例提供的一种深度图像生成模型的结构示意图;
图8是本申请实施例提供的一种深度图像生成方法的流程图;
图9是本申请实施例提供的一种深度图像融合的流程图;
图10是本申请实施例提供的一种生成三维模型的流程图;
图11是本申请实施例提供的一种生成三维模型的流程图;
图12是本申请实施例提供的一种深度图像生成装置的结构示意图;
图13是本申请实施例提供的一种深度图像生成装置的结构示意图;
图14是本申请实施例提供的一种终端的结构示意图;
图15是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种概念,但除非特别说 明,这些概念不受这些术语限制。这些术语仅用于将一个概念与另一个概念区分。举例来说,在不脱离本申请的范围的情况下,可以将第一人特征图称为第二特征图,且类似地,可将第二特征图称为第一特征图。
本申请所使用的术语“多个”、“每个”、“任一”,多个包括两个或两个以上,而每个是指对应的多个中的每一个,任一是指多个中的任意一个。举例来说,多个元素包括3个元素,而每个是指这3个元素中的每一个元素,任一是指这3个元素中的任意一个,可以是第一个,可以是第二个、也可以是第三个。
本申请实施例提供的深度图像生成方法,可以用于计算机设备中。在一种可能实现方式中,该计算机设备为终端,该终端为手机、计算机、平板电脑等多种类型的终端。终端通过摄像机拍摄目标物体,获取多张目标图像,通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到该多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。
在另一种可能实现方式中,该计算机设备包括服务器和终端。图1是本申请实施例提供的一种实施环境的结构示意图,如图1所示,该实施环境包括终端101和服务器102。终端101与服务器102建立通信连接,通过建立的通信连接进行交互。其中,该终端101为手机、计算机、平板电脑等多种类型的终端101。服务器102为一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务器中心。终端101通过摄像机拍摄目标物体,获取到多张目标图像,将获取到的多张目标图像发送至服务器102,该服务器102通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到该多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。后续该服务器102还能够将该深度图像发送给该终端101。
本申请实施例提供的方法,可用于构建三维模型的各种场景下。
例如,建筑物测绘的场景下:
用户通过终端的摄像头拍摄建筑物,则终端采用本申请实施例提供的深度图像生成方法,按照不同视角拍摄建筑物,对获取到的多张目标图像进行处理,得到深度图像,通过对该深度图像进行处理,得到该建筑物的三维模型,以使后续能够对建筑物三维模型进行测绘。
再例如,室内布置场景下:
用户通过终端的摄像头拍摄房屋室内,终端采用本申请实施例提供的深度图像生成方法,按照不同视角拍摄房屋室内,对获取到的多张目标图像进行处理,得到深度图像,通过对该深度图像进行处理,得到该房屋室内的三维模型,用户能够在该房屋的三维模型中模拟家居布置,以实现家居设计的动态展示,使用户能够直观的查看家居设计展现的状态。
图2是本申请实施例提供的一种深度图像生成方法的流程图,应用于计算机设备中,如图2所示,该方法包括:
201、计算机设备获取多张目标图像。
其中,该多张目标图像是按照不同视角拍摄目标物体分别得到的。
在本申请实施例中,通过不同视角拍摄同一个目标物体,则获取到的多张目标图像中目标物体的显示状态不同,因此可以根据多张目标图像之间的差异,对多张目标图像进行处理,可以生成目标物体的深度图像,以便后续通过深度图像获取到目标物体的三维模型。
其中,该视角是由摄像头的拍摄参数及摄像头与目标物体之间的相对位置共同决定的。该拍摄参数可以包括焦距、像素等。例如,保持摄像头与物体之间的相对位置不变,通过不同的拍摄参数对物体进行拍摄,可以得到不同视角的图像;或者,保持拍摄参数不变,通过不同的相对位置对物体进行拍摄,也可以得到不同视角的图像;或者,通过不同的相对位置及不同的拍摄参数对物体进行拍摄,也可以得到不同视角的图像。
按照不同视角拍摄得到的多个图像中,目标物体的显示状态不同,该显示状态可以包括显示目标物体的大小、目标物体在图像中的显示位置及显示目标物体的角度。例如,按照不同视角拍摄目标物体得到三张图像,在第一张图像中的左上角区域显示该目标物体,且显示目标物体的左侧面图像;第二张图像中的中间区域显示该目标物体,且显示目标物体的正面图像;第三张图像中的右下角区域显示该目标物体,且显示该目标物体的右侧面图像;并且,在三张图像中显示的目标物体的尺寸依次减小。
在该多张目标图像中,不同的目标图像对应的视角不同,因此,在不同的目标图像中包括同一个目标物体,该目标物体的显示状态不同。该多张目标图像可以是按照不同视角拍摄目标物体直接得到的,也可以是对拍摄得到的图像进行处理后得到的。
对于该多张目标图像的获取方式,在一种可能实现方式中,按照多个不同的视角拍摄目标物体,得到多张目标图像;或者,按照多个不同的视角拍摄目标物体,得到多张原始图像,对多张原始图像进行尺度调整,得到多张原始图像调整后的多张目标图像,多张目标图像的尺度相等。
其中,对多张原始图像进行尺度调整可以为:将多张原始图像进行缩小,得到更小尺度的多张目标图像;或者多张原始图像进行放大,得到更大尺度的多张目标图像。由于通过拍摄得到的多张原始图像的尺度相等,则在对多张原始图像进行尺度调整后,得到的多张目标图像的尺度也相等。
202、计算机设备通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合。
其中,卷积模型用于获取图像的特征图,该卷积模型为二维卷积网络模型,可以为VGG(Visual Geometry Group Network,视觉几何组网络模型)、Restnet(一种卷积网络模型)等。该卷积模型中包括多个卷积层,每个卷积层均用于对输入的图像进行卷积处理,输出该图像的特征图。
特征图集合包括多张目标图像对应的特征图,特征图用于表示对应的目标图像中包括的特征,如颜色特征、纹理特征、形状特征或空间特征等。计算机设备通过多个卷积层对该多张目标图像进行多级卷积处理时,每个卷积层均可以输出特征图集合,每个特征图集合中包括的特征图与该多张目标图像一一对应,每个特征图集合中包括的特征图的个数,均与该多张目标图像的个数相等。由于多个卷积层均对该多张目标图像进行卷积处理,因此可以得到多个特征图集合,该多个特征图集合的个数与该多个卷积层的个数相等。
在多个卷积层中,对于同一张目标图像,不同的卷积层输出的特征图不同,则得到的特 征图集合也不同。在卷积模型中,该多个卷积层是按照预设顺序排列的,除第一个卷积层外,其他的卷积层均是将上一个卷积层的输出作为当前卷积层的输入,相应地,在一种可能实现方式中,该步骤202可以包括:
通过卷积模型中的第一个卷积层,对多张目标图像进行卷积处理,得到第一个卷积层输出的特征图集合,通过卷积模型中的下一个卷积层,对上一个卷积层输出的特征图集合中的每个特征图进行卷积处理,得到下一个卷积层输出的特征图集合,直至得到多个卷积层分别输出的特征图集合。
例如,该卷积模型包括4个卷积层,将该多张目标图像输入至该卷积模型中的第一个卷积层,通过该第一个卷积层,对多张目标图像进行卷积处理,得到第一个卷积层输出的第一特征图集合,第一特征图集合包括多张目标图像对应的第一特征图;将第一特征图集合输入至第二个卷积层,通过该第二个卷积层,对第一特征图集合中的每个第一特征图进行卷积处理,得到第二个卷积层输出的第二特征图集合,第二特征图集合包括多张目标图像对应的第二特征图;将第二特征图集合输入至第三个卷积层,通过该第三个卷积层,对第二特征图集合中的每个第二特征图进行卷积处理,得到第三个卷积层输出的第三特征图集合,第三特征图集合包括多张目标图像对应的第三特征图;将第三特征图集合输入至第四个卷积层,通过该第四个卷积层,对第三特征图集合中的每个第三特征图进行卷积处理,得到第四个卷积层输出的第四特征图集合,第四特征图集合包括多张目标图像对应的第四特征图,从而得到4个卷积层分别输出的特征图集合。
另外,对于多个卷积层分别输出的特征图集合,每个特征图可以表示为
Figure PCTCN2020127891-appb-000001
其中,i表示目标图像的序号,i为大于0、且不大于N的整数;N表示多张目标图像的个数,N为大于1的整数;l为多个卷积层中任一个卷积层,该l为大于0、且不大于L的整数;L表示多个卷积层的个数,L为大于1的整数。
203、计算机设备分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征。
在本申请实施例中,每个特征图集合中包括多个特征图,该多个特征图与该多张目标图像一一对应。由于不同的目标图像对应的视角不同,因此,通过对多个特征图进行视角聚合,将多个特征图转换成相同的视角,而后将具有相同的视角的多个特征图进行聚合,从而得到聚合特征,可以消除不同目标图像之间的视角差异。其中,在获取每个特征图集合对应的聚合特征时,采用Self-adaptive View Aggregation(自适应视角聚合)的方式,将多个特征图转换为相同视角下的特征图后进行融合。
在一种可能实现方式中,如图3所示,该步骤203可以包括以下步骤2031-2034:
2031、将多张目标图像中的任一张目标图像作为参考图像,将多张目标图像中的其他目标图像作为第一图像。
在本申请实施例中,参考图像可以为多张目标图像中的任一张图像。在获取多个特征图集合对应的聚合特征时,对于该多个特征图集合,将同一张目标图像作为参考图像,以保证获取到的多个特征图集合对应的聚合特征的一致性,从而提高了后续得到的深度图像的准确性。
其中,第一图像可以包括一个或者多个,如,目标图像的个数为2时,则第一图像的个数为一个;目标图像的个数为5时,则第一图像的个数为4个。
2032、确定特征图集合中,参考图像对应的参考特征图及第一图像对应的第一特征图。
由于该特征图集合中的多个特征图与多张目标图像一一对应,则可以从该多个特征图中确定出参考图像对应的参考特征图,及第一图像对应的第一特征图。
2033、按照第一图像与参考图像的拍摄视角的差异,将第一特征图进行视角转换,得到第二特征图。
其中,第二特征图对应的图像的视角与参考图像的视角相同。
由于该多张目标图像对应的视角不同,为了便于后续将多个特征图进行融合,需要按照第一图像与参考图像的拍摄视角的差异,对第一特征图进行转换,使转换后的特征图对应的视角与参考图像的视角相同,以消除图像拍摄视角的差异。
在一种可能实现方式中,该特征图集合中包括多个第一特征图,则对于该多个第一特征图中的任一第一特征图,按照该第一特征图对应的第一图像与参考图像的拍摄视角的差异,将该第一特征图进行视角转换,得到转换后的第二特征图。相应地,对于其他的第一特征图也可以采用类似的方式进行视角转换,因此可以得到多个第一特征图对应的第二特征图。
在一种可能实现方式中,该步骤2033可以包括以下步骤1-4:
步骤1:获取第一图像对应的第一拍摄参数及参考图像对应的参考拍摄参数。
其中,拍摄参数可以包括焦距、像素等。由于不同的目标图像对应的视角不同,视角是由摄像头的拍摄参数及摄像头与目标物体之间的相对位置共同决定的,因此,获取第一图像及参考图像对应的拍摄参数,以便后续通过拍摄参数对特征图像进行视角转换。
另外,拍摄参数可以是在拍摄目标物体时获取到的。例如,用户通过手机拍摄目标物体,手机传感器会记录下拍摄目标物体的拍摄参数,则获取到多张目标图像及每张目标图像对应的拍摄参数。
步骤2:确定输出特征图集合的卷积层对应的多个深度值。
其中,深度值用于表示拍摄目标物体时摄像头与目标物体之间的距离,例如,该多个深度值可以为0.1米、0.2米、0.3米和0.4米等。卷积层对应的多个深度值可以是预先设置的,也可以是根据深度范围及预设的深度值个数确定的,在该卷积模型中的多个卷积层中,不同卷积层对应的多个深度值不同。例如,在多个卷积层中,第一个卷积层对应的多个深度值为0.1米、0.2米、0.3米;第二个卷积层对应的多个深度值为0.1米、0.3米、0.5米。
对于确定多个深度值的方式,在一种可能实现方式中,确定输出该特征图集合的卷积层对应的深度层数,按照深度层数将预设深度范围进行划分,得到多个深度值。其中,深度层数可以由开发人员预先设置的,该深度层数可以是任意数值,如100、80等。预设深度范围用于表示拍摄得到多张目标图像时目标物体与摄像头之间的距离所属的范围,可以是预先设置的,也可以是根据多张目标图像进行预测得到的。如,该预设深度范围为(0,1)米,或者(1,2)米等。
通过深度层数及预设深度范围,对该预设深度范围进行划分,从预设深度范围中提取多个深度值。可选地,在该多个深度值中,任两个相邻的深度值之间差值相等,该多个深度值的个数与深度层数对应的数值相等。
对于确定深度层数的方式,可选地,多个卷积层按照预设顺序排列,确定输出该特征图集合的卷积层的排列顺序L,则该排列顺序L及深度层数D L满足以下关系:
Figure PCTCN2020127891-appb-000002
其中,D L表示多个卷积层中排列顺序L的卷积层的深度层数。
对于将预设深度范围进行划分的方式,可选地,确定预设深度范围中最大深度值和最小深度值,将最大深度值与最小深度值之间的差值作为深度跨度,将深度层数减去1之后的数值作为第一数值,将深度跨度与该第一数值之间的比值作为深度间隔,在该预设深度范围中,从最小深度值开始,每相隔一个深度间隔确定一个深度值,则得到与深度层数相等个数的多个深度值。例如,预设深度范围为[1,9]米,深度层数为5,最大深度值为9米,最小深度值为1米,则深度跨度为8,第一数值为4,通过该深度跨度和第一数值,确定深度间隔为2,从最小深度值1开始,每相隔一个深度间隔2,确定一个深度值,则将该预设深度范围中的1、3、5、7、9均确定为深度值。
步骤3:根据第一拍摄参数与第二拍摄参数之间的差异,及多个深度值,确定与多个深度值对应的多个视角转换矩阵。
其中,视角转换矩阵用于对图像进行视角变换,可以将不同角度拍摄的图像都转换成同样的视角。该视角转换矩阵可以为HomographyMatrix(单应矩阵),或者其他矩阵等。由于视角转换矩阵是通过两个图像的拍摄参数及深度值确定的,则根据第一拍摄参数、第二拍摄参数及多个深度值,可以确定多个视角转换矩阵,在该多个视角转换矩阵中,每个视角转换矩阵与一个深度值对应。
步骤4:根据多个视角转换矩阵,分别对第一特征图进行视角转换,得到转换后的多个第二特征图。
其中,第二特征图对应的视角与参考图像的视角相同。对于该第一特征图,利用每一个视角转换矩阵进行视角转换,则可以得到转换后的多个第二特征图。
在一种可能实现方式中,该特征图集合中包括多个第一特征图,对于每个第一特征图,确定每个第一特征图对应的多个视角转换矩阵,根据每个第一特征图对应的多个视角转换矩阵,分别对每个第一特征图进行视角转换,得到每个第一特征图更换后的多个第二特征图。
由于不同的目标图像的视角不同,视角是由摄像头的拍摄参数及摄像头与目标物体之间的相对位置共同决定的,且不同的第一特征图对应的多个深度值均为卷积层对应的多个深度值,则不同的第一特征图对应的第一拍摄参数不同,因此,不同的第一特征图对应的视角转换矩阵不同。通过获取每个第一特征图对应的多个视角转换矩阵,从而可以获取到每个第一特征图转换后的多个第二特征图。
例如,该特征图集合中包括3个第一特征图,输出该特征图集合的卷积层具有20个深度值,则可以为每个第一特征图确定20个视角转换矩阵,通过获取每个第一特征图对应的20视角转换矩阵,获取到每个第一特征图对应的转换后的20个第二特征图,因此,通过对3个第一特征图分别进行视角转换,可以得到60个第二特征图。
另外,本申请实施例中在对第一特征图进行视角转换时,可以通过Coarse-To-fine Depth Estimator(由稀疏到稠密的深度预测器)对第一特征图进行处理,该Coarse-To-fine Depth Estimator输出多个第二特征图。
2034、将参考特征图与第二特征图进行融合处理,得到聚合特征。
其中,该聚合特征用于表示多张目标图像对应的特征图集合的多维特征,如参考特征图 与第二特征图均为一维的特征图,则将参考特征图与第二特征图进行融合得到二维特征图。由于得到的第二特征图对应的拍摄视角,与参考特征图对应的拍摄视角相同,则可以直接将参考特征图与第二特征图进行融合处理,从而得到聚合特征。
在一种可能实现方式中,第一图像包括多个,步骤2034可以包括以下步骤5-7:
步骤5:将第一数量的参考特征图进行融合处理,得到参考图像对应的参考特征卷。
其中,第一数量等于多个深度值的数量,参考特征卷用于表示参考图像对应的多维特征。
在本申请实施例中,为输出特征图集合的卷积层确定了多个深度值,则在对该特征图集合中的每个第一图像进行视角转换时,可以获取到每个第一图像对应的转换后的多个第二特征图,为了保证参考特征与每个第一特征图对应的多个第二特征图在数量上的一致性,便于后续对参考特征和第二特征进行融合处理,因此需要将第一数量的参考特征进行融合,得到参考特征卷。
对于融合处理的方式,在一种可能实现方式中,将第一数量的参考特征图进行堆叠,得到该参考特征卷。由于每个参考特征图属于一维的特征图,将第一数量的参考特征图进行堆叠,得到多维的参考特征卷。
步骤6:对于每个第一图像,将第一图像对应的第一特征图转换后的多个第二特征图进行融合处理,得到第一特征卷,将该第一特征卷与该参考特征卷之间的差值,确定为第二特征卷。
其中,第一特征卷用于表示第一图像对应的多维特征,第二特征卷用于表示第一图像与参考图像之间的差异对应的多维特征
对于任一第一图像,将该第一图像对应的第一特征图转换后的多个第二特征图进行融合处理,得到该第一图像对应的第一特征卷,相应地,对其他的第一图像对应的第一特征图转换后的多个第二特征图进行融合处理,从而得到多个第一图像中对应的第一特征卷。在该多个第一特征卷中,不同的第一图像对应的第一特征卷不同。
由于第一特征卷与参考特征卷均属于相同维度的多维特征,因此可以直接确定每个第一特征卷与参考特征卷之间的差值,从而得到多个第二特征卷。在该多个第二特征卷中,不同的第一图像对应的第二特征卷不同。
对于融合处理的方式,在一种可能实现方式中,对于任一第一图像,将该第一图像对应的多个第二特征图进行堆叠,得到该第一图像的第一特征卷。由于每个第二特征图属于一维的特征图,将多个第二特征图进行堆叠,得到多维的第二特征卷。
步骤7:将确定的多个第二特征卷进行融合处理,得到聚合特征。
其中,聚合特征用于表示多张目标图像对应的多维特征,该聚合特征为输出该特征图集合的卷积层对应的聚合特征。通过将多个第二特征卷进行融合处理,使得到的聚合特征消除了多张目标图像之间的视角的差异,融合了不同视角所拍摄到的物体,丰富了多个视角的物体的特征,从而构成了能全面表现物体的聚合特征。
对于多个第二特征卷进行融合处理的方式,在一种可能实现方式中,该步骤7可以包括:获取输出特征图集合的卷积层对应的权重矩阵,按照该权重矩阵,将多个第二特征卷进行加权融合处理,得到聚合特征。
其中,权重矩阵中包括卷积层输出的特征图中每个像素位置对应的权重。通过该权重矩阵,确定每个第二特征卷与该权重矩阵之间的乘积,将多个第二特征卷对应的乘积之和,与 该多个第二特征卷的个数之间的比值,作为该聚合特征,使得在将多个第二特征卷进行融合处理时,融入了权重的影响,从而提高了得到的聚合特征的准确性。
该权重矩阵可以通过WeightNet(权重矩阵获取模型)训练得到,该WeightNet可以由多个卷积层和一个ResNet(Residual Network,深度残差网络)块组成。获取多个第二特征卷V′ i,d,h,w中的最大尺度的第二特征卷max_pooling(||V′ i,d,h,w|| 1),及多个第二特征卷V′ i,d,h,w的平均特征卷avg_pooling(||V′ i,d,h,w|| 1),将该最大尺度的第二特征卷max_pooling(||V′ i,d,h,w|| 1)与该平均特征卷avg_pooling(||V′ i,d,h,w|| 1)进行连接,得到连接数组f h,w,通过该WeightNet对连接数据进行卷积处理,得到该权重矩阵U h,w,则最大尺度的第二特征卷max_pooling(||V′ i,d,h,w|| 1)、该平均特征卷avg_pooling(||V′ i,d,h,w|| 1)、连接数组f h,w及权重矩阵U h,w满足以下关系:
U h,w=WeightNet(f h,w)
f h,w=CONCAT[max_pooling(||V′ i,d,h,w|| 1),avg_pooling(||V′ i,d,h,w|| 1)]
其中,i表示多个第一图像中的任一第一图像,i为大于0、且小于等于N-1的正整数;N表示多张目标图像的个数,N为大于等于2的正整数;d表示多个深度值中的任一深度值,h表示特征图集合中的特征图的高度;w表示特征图集合中的特征图的宽度。
根据上述步骤5-7中的内容,可以采用Pixel-Wise View Aggregation(像素级视角聚合)的方式,将参考特征图与第二特征图进行融合处理,即在一种可能实现方式中,聚合特征、参考特征卷、第一特征卷、第二特征卷及权重矩阵,满足以下关系:
V′ i,d,h,w=V i,d,h,w-V 0,d,h,w
Figure PCTCN2020127891-appb-000003
其中,i表示第一图像的序号,i为大于0、且不大于N-1的正整数;N表示多张目标图像的个数,N为大于1的整数;d表示多个深度值中的任一深度值,h表示特征图集合中的特征图的高度,w表示特征图集合中的特征图的宽度;V′ i,d,h,w表示第二特征卷,V i,d,h,w表示第一特征卷,V 0,d,h,w表示参考特征卷,C d,h,w表示聚合特征,U h,w表示权重矩阵;⊙用于表示元素级乘法。
如图4所示,在获取到多个第二特征卷401后,确定最大尺度的第二特征卷402,及多个第二特征卷401的平均特征卷403,通过权重矩阵获取模型404,获取权重矩阵405,根据该权重矩阵405,对多个第二特征卷401进行卷积处理,得到聚合特征406。
根据上述步骤5-7中的内容,可以采用Voxel-Wise View Aggregation(体素级视角聚合)的方式,将参考特征图与第二特征图进行融合处理,即在一种可能实现方式中,聚合特征、参考特征卷、第一特征卷、第二特征卷及权重矩阵,满足以下关系:
V′ i,d,h,w=V i,d,h,w-V 0,d,h,w
Figure PCTCN2020127891-appb-000004
其中,i表示第一图像的序号,i为大于0、且小于等于N-1的正整数;N表示多张目标图像的个数,N为大于1的整数;d表示多个深度值中的任一深度值,h表示特征图集合中的特征图的高度;w表示特征图集合中的特征图的宽度;V′ i,d,h,w表示第二特征卷,V i,d,h,w表示第一特征卷,V 0,d,h,w表示参考特征卷,C d,h,w表示聚合特征,U d,h,w表示与深度值d对应的权重 矩阵;⊙用于表示元素级乘法。
如图5所示,在获取到多个第二特征卷501后,将该多个第二特征卷501,输入至与深度值d对应的权重矩阵获取模型502,得到权重矩阵503,根据该权重矩阵503,对多个第二特征卷501进行卷积处理,得到聚合特征504。
需要说明的是,本申请实施例是以在获取到多个卷积层分别输出的特征图集合后,直接将每个特征图集合中的多个特征图进行视角聚合进行说明的,而在另一实施例中,在执行步骤203之前,需要对获取到的多个卷积层分别输出的特征图集合中的特征图进行采样,使每个特征图的维度为一维,以便后续将每个特征图集合中的特征图进行融合。
204、计算机设备将得到的多个聚合特征进行融合处理,得到深度图像。
其中,深度图像中包括目标物体的深度值。由于每个卷积层输出的特征图不同,不同的特征图包含的信息量不同,则通过多个卷积层,得到的多个聚合特征中不同的聚合特征包含不同的信息,因此,将多个聚合特征进行融合处理,丰富了特征图的信息量,从而提高了得到的深度图像的准确性。
由于每个聚合特征中包括多维特征,在对多个聚合特征进行融合处理时,将每个聚合特征的多维特征进行融合,可以得到深度图像。
205、计算机设备对深度图像进行转化处理,得到点云数据。
其中,点云数据为由三维坐标系下的多个点构成的数据。对深度图像进行转化处理时,根据深度图像中任一像素对应的深度值,在三维坐标系中创建一个点,则通过深度图像中多个像素的深度值,可以得到多个点,从而构成点云数据。
206、计算机设备对点云数据进行聚合处理,得到目标物体的三维模型。
由于点云数据中的多个点是处于离散状态的,通过对点云数据进行聚合处理,将点云数据中的多个点进行连接,从而得到该目标物体的三维模型。
在一种可能实现方式中,该步骤206可以包括:对点云数据进行过滤处理,得到过滤后的点云数据,对过滤后的点云数据进行聚合处理,得到目标物体的三维模型。
由于生成的点云数据中会存在噪声的影响,通过对点云数据中的噪声进行过滤处理,提高了过滤后的点云数据的准确性,从而提高了得到的三维模型的准确性。
需要说明的是,本申请实施例以使生成三维模型进行说明的,而在另一实施例中,无需执行步骤205-206,得到深度图像即可。
相关技术中提供了一种深度图像生成方法,通过卷积模型中的多个卷积层,对拍摄的物体图像进行多级卷积处理,得到最后一个卷积层输出的特征图,对该特征图进行卷积处理,得到物体的深度图像。由于上述方法在获取深度图像的过程中,仅是使用了最后一个卷积层输出的特征图,该特征图的信息量较少,导致深度图像的准确性差。
本申请实施例提供的方法,获取多张目标图像,该多张目标图像是按照不同视角拍摄目标物体分别得到的,通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。获取的多张目标图像是按照不同视角拍摄目标物体分别得到的,使得到的多张目标图像中包括目标物体不同角度的信息,丰富了获取到的目标图像的信息量,且通过多个卷积层的多级卷积处理,得到多个不同的特征图集合,丰富了特征图的信息量,将多个卷积层输出的特征 图进行融合处理,丰富了得到的深度图像中包含的信息量,从而提高了得到的深度图像的准确性。
并且,通过多张目标图像之间的拍摄视角差异,对每个特征图集合中的多个特征图进行视角聚合,以使后续能够将属于相同视角的特征图进行融合处理,提高了得到的聚合特征的准确性,从而提高了得到的深度图像的准确性。
并且,在将多个卷积层输出的特征图进行融合处理的过程中,将每个卷积层对应的聚合特征进行融合时,将每个聚合特征对应的概率图进行融合处理,使得多个聚合特征进行融合时考虑到了概率对各个像素位置的影响,提高了得到的第四聚合特征的准确性,从而提高了得到的深度图像的准确性。
在上述实施例的基础上,在一种可能实现方式中,参见图6,上述步骤204可以包括以下步骤2041-2046:
2041、计算机设备将多个聚合特征中最大尺度的聚合特征作为第一聚合特征,将多个聚合特征中其他的多个聚合特征作为第二聚合特征。
在本申请实施例中,该卷积模型中的多个卷积层输出的特征图的尺度依次减小,由于聚合特征是由特征图融合处理得到的,则多个卷积层对应的聚合特征的尺度依次减小,因此,通过该多个卷积层可以获取到多个尺度的聚合特征。
其中,特征图的尺度包括特征图的高度和特征图的宽度,尺度越大,高度和宽度越大;尺度越小,高度和宽度越小。由于每个特征图的维度为1,将多个特征图融合处理后得到的聚合特征为多维特征,该聚合特征的尺度包括特征图的高度、特征图的宽度及维度数,该维度数与该聚合特征对应的特征图集合中的特征图的个数相等。由于在多个卷积层中,多个卷积层输出的特征图的尺度依次减小,则多个卷积层对应的多个聚合特征的尺度依次减小。
2042、计算机设备将第一聚合特征进行多级卷积处理,得到多个第三聚合特征。
其中,多个第三聚合特征的尺度与多个第二聚合特征的尺度一一对应。通过将第一聚合特征进行多次卷积处理,使得第一聚合特征的尺度缩小,得到多个第三聚合特征。
在一种可能实现方式中,通过多个卷积层对第一聚合特征进行多级卷积处理,通过第一个卷积层对第一聚合特征进行卷积处理,得到第一个第三聚合特征,通过下一个卷积层对上一个卷积层输出的第三聚合特征进行卷积处理,得到下一个卷积层输出的第三聚合特征,直至最后一个卷积层输出最后一个第三聚合特征。
2043、计算机设备将第一尺度的第二聚合特征与第一尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第二尺度的第四聚合特征。
其中,第一尺度为多个第二聚合特征的最小尺度,第二尺度为第一尺度的上一级尺度。
由于第二聚合特征的尺度与第三聚合特征的尺度相等,则将第一尺度的第二聚合特征与第一尺度的第三聚合特征进行融合处理,得到的融合后的特征的尺度为第一尺度,将融合后的特征进行反卷积处理,使得融合后的特征的尺度增大,从而得到第二尺度的第四聚合特征。
2044、计算机设备继续将当前得到的第四聚合特征、与第四聚合特征尺度相等的第二聚合特征和第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征,直至得到与第一聚合特征尺度相等的第四聚合特征。其中,融合后的特征与当前得到的第四聚合特征的尺度相等。
在多个第三聚合特征中,除第一尺度的第三聚合特征外,还包括多个第三聚合特征时,按照尺度由小到大的顺序,则多次执行步骤2044,多次执行步骤2044后得到的第四聚合特征的尺度依次增大,从而能够得到最大尺度的第四聚合特征,也即是得到与第一聚合特征尺度相等的第四聚合特征。
例如,多个第三聚合特征的个数为4,通过第一尺度的第二聚合特征和第一尺度的第三聚合特征,得到第二尺度的第四聚合特征后;将第二尺度的第四聚合特征、第二尺度的第二聚合特征及第二尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第三尺度的第四聚合特征;将第三尺度的第四聚合特征、第三尺度的第二聚合特征及第三尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第四尺度的第四聚合特征;将第四尺度的第四聚合特征、第四尺度的第二聚合特征及第四尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第五尺度的第四聚合特征,该第五尺度与第一聚合特征的尺度相等。
在一种可能实现方式中,该步骤2044可以包括:继续将当前得到的第四聚合特征、与第四聚合特征尺度相等的第二聚合特征和第三聚合特征、及第二聚合特征的概率图进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征。
按照尺度由小到大的顺序,将相同尺度的第二聚合特征、第三聚合特征、第四聚合特征及该第二聚合特征对应的概率图进行融合处理,对融合后的特征进行反卷积处理,重复执行上述步骤,从而能够最大尺度的第四聚合特征,该最大尺度与第一聚合特征的尺度相等。
由于概率图中包括第二聚合特征中的每个像素位置对应的概率,则在获取多个第四聚合特征时,通过融入第二聚合特征的概率图,使得多个聚合特征进行融合时考虑到了概率对各个像素位置的影响,从而提高了得到的第四聚合特征的准确性,以使后续能够提高得到的深度图像的准确性。
2045、计算机设备将当前得到的第四聚合特征与第一聚合特征进行融合处理,得到第五聚合特征。
由于第四聚合特征的尺度与第一聚合特征的尺度相等,则将第四聚合特征与第一聚合特征进行融合处理,使融合后的第五聚合特征与第一聚合特征的尺度相等。且通过每个卷积层输出的特征图集合对应有聚合特征,则将多个卷积层对应的聚合特征进行融合,使得到的第五聚合特征中包括多个卷积层输出的特征图的特征,增加了第五聚合特征包括的信息量,从而提高了获取到的第五聚合特征的准确性。
2046、计算机设备根据第五聚合特征与第一聚合特征对应的概率图进行卷积处理,得到深度图像。
其中,概率图用于表示第一聚合特征中每个像素位置对应的概率,每个概率用于表示每个像素位置对应的深度值正确的概率。该概率图可以由概率图获取模型对该第一聚合特征进行卷积处理得到的,该概率图获取模型可以包括一个编码器和一个解码器,通过编码器对第一聚合特征进行编码,而后通过解码器进行解码得到该概率图,该概率图获取模型可以为3D CNN(3Dimension Convolutional Neural Networks,三维卷积神经网络)模型,或者其他神经网络模型。
由于第五聚合特征的尺度与该第一聚合特征的尺度相等,则第五聚合特征中每个像素位置与第一聚合特征中的每个像素位置一一对应,则第五聚合特征中的每个像素位置,与该概 率图中的概率一一对应,因此将该第五聚合特征与该概率图进行卷积处理,从而得到深度图像,通过在聚合特征中融入对应的概率,以提高得到的深度图的准确性。
对于将第五聚合特征与该概率图进行卷积处理的方式,在一种可能实现方式中,与该第一聚合特征对应的卷积层对应多个深度值,该第一聚合特征是由多个第二特征图与参考特征图融合处理得到的,且每个第二特征图对应一个深度值,则该第五聚合特征中包括多个特征图,该多个特征图的个数与多个深度值的个数相等;则该步骤2046可以包括:确定该第五聚合特征中每个特征图对应的深度值,根据第一聚合特征对应的概率图,确定该每个特征图对应的概率,将多个特征图对应的深度值及多个特征图对应的概率进行加权处理,得到预测深度,通过该预测深度构成深度图像。
对于上述加权处理的方式,每个特征图对应的深度值d、每个特征图对应的概率P及预测深度E,满足以下关系:
Figure PCTCN2020127891-appb-000005
其中,d min表示多个深度值中的最小值;d max表示多个深度值中的最大值;P(d)表示深度值d对应的概率。
需要说明的是,上述实施例中通过多张目标图像获取深度图像的过程,可以通过深度图像生成模型来实现,通过将多张目标图像输入至该深度图像生成模型中,该深度图像生成模型对多张目标图像进行处理,输出深度图像。其中,该深度图像生成模型可以为VA-MVSNet(View Aggregation Mult-view StereoNetwork,一种网络模型)或者其他网络模型。
在对该深度图像生成模型进行训练时,获取多个样本图像及对应的深度图像,将样本图像作为深度图像生成模型的输入,将该深度图像作为该深度图像生成模型的输出,对该深度图像获取模型进行迭代训练。
例如,通过DTU(Technical University of Denmark,丹麦技术大学)数据集对深度图像获取模型进行训练,样本图像的数目为3,每个样本图像的分辨率为640x512,预设深度范围从425毫米到935毫米,深度层数为192层。该深度图像生成模型采用初始学习率为0.1,衰减参数为0.9的Adam(一种优化算法)训练深度图像生成模型,对该深度图像生成模型中的权重矩阵w和偏置参数b进行调整,在每次迭代过程中,将输出的深度图像与真实的深度图像进行对比,得到预测结果误差,根据该预测结果误差对深度图像生成模型的参数进行调整,以使该深度图像生成模型的损失函数之和可以减小。在通过多个尺度的样本图像对该深度图像生成模型进行训练时,每个尺度损失函数参数λ分别为{0.32,0.16,0.04,0.01},多个尺度的数目为4,以及DTU数据集中GPU(Graphics Processing Unit,图形处理器)数目也为4。
另外,在对深度图像生成模型进行训练的过程中,需要对该深度图像生成模型进行测试。例如,在测试时,输入图片数目为5,深度层数为192,金字塔层数为3,降采样参数为0.5。在DTU数据集上对深度图像生成模型进行测试时,输入图片尺度为1600x1184;在Tanks and Tempers(一种数据集)上对深度图像生成模型进行测试时,输入图片尺度为1920x1056。
在对该深度图像获取模型进行训练的过程中,可以根据该深度图像获取模型的损失函数之和,对该深度图像获取模型进行训练,当该损失函数之和达到预设阈值时,完成对该深度图像获取模型的训练。该损失函数之和可以表示为E,满足以下关系:
Figure PCTCN2020127891-appb-000006
其中,l为多个卷积层中任一个卷积层,该l为大于0、且不大于L的整数;l 1为多个卷积层中第一个卷积层;L用于表示获取特征图的多个卷积层的个数;λ l为卷积层l对应的损失函数参数;x为深度图像中每个像素,X valid为每个深度图像中包含的所有像素;d l(x)表示像素x的真实深度,
Figure PCTCN2020127891-appb-000007
表示像素x的预设深度。
如图7所示,该深度图像生成模型包括第一卷积模型701、第二卷积模型702、第三卷积模型703和第四卷积模型704。第一卷积模型701与上述步骤202中的卷积模型相同,用于获取目标图像的特征图,将每个第一卷积层7011输出的特征图集合输入至第二卷积模型702;第二卷积模型702对每个特征图集合进行视角聚合,输出第一聚合特征705和第二聚合特征706;第三卷积模型703通过多个第二卷积层7031对第一聚合特征705进行多级卷积处理,得到多个第三聚合特征707;第四卷积模型704通过多个第三卷积层7041,执行上述步骤2043-2046,输出深度图像708。
图8是本申请实施例提供的一种深度图像生成方法的流程图,如图8所示,该方法包括:
801、计算机设备按照多个不同的视角拍摄目标物体,得到多张原始图像,将多张原始图像确定为目标图像集合。
该步骤与上述步骤201中获取原始图像的方式类似,在此不再赘述。
802、计算机设备对多张原始图像进行多轮尺度调整,得到多组目标图像集合。
其中,每组目标图像集合包括同一尺度的多张目标图像,不同目标图像集合中的目标图像的尺度不同。
对多张原始图像进行尺度调整可以为:将多张原始图像进行缩小,得到更小尺度的多张目标图像;或者多张原始图像进行放大,得到更大尺度的多张目标图像。由于多张原始图像的尺度相等,则在每轮对多张原始图像进行尺度调整后,得到的多张目标图像的尺度相等,不同轮尺度调整得到的目标图像的尺度不同。
对于多轮尺度调整,在一种可能实现方式中,对多张原始图像进行第一轮尺度调整,得到第一组目标图像集合,对上一轮得到的目标图像集合的多张目标图像进行下一轮尺度调整,得到下一组目标图像集合,直至得到多组目标图像集合。
例如,该多轮包括3轮,对多张原始图像进行第一轮尺度调整,得到第一组目标图像集合,对第一组目标图像集合中的多张目标图像进行第二轮尺度调整,得到第二组目标图像集合,对第二组目标图像集合中的多张目标图像进行第三轮尺度调整,得到第三组目标图像集合。
另外,通过步骤801-802得到的多组目标图像集合,可以构成图像金字塔。在图像金字塔中,最底层的图像的尺度最大,随着图像金字塔中层级的增加,相应的层级中的图像的尺度减小。多张原始图像对应的目标图像集合即为该图像金字塔的最底层,对该多张原始图像进行第一轮尺度调整,得到该最底层的上一层的目标图像集合,对上一层的目标图像集合进行一轮尺度调整,得到更上一层的目标图像集合,重复多轮尺度调整,即可构成包含预设数量层目标图像集合的图像金字塔。
803、计算机设备对于多组目标图像集合,分别执行上述步骤201-208,得到每组目标图像集合对应的深度图像。
由于多组目标图像集合中,每组目标图像集合包括多张目标图像,则分别将每组目标图 像集合中的多张目标图像,作为上述步骤201中的多张目标图像,对该多张目标图像进行处理,得到每组目标图像集合对应的深度图像,即得到多个深度图像。
由于不同目标图像集合中的图像的尺度不同,则不同组的目标图像集合对应的深度图像的尺度不同,即对于多组目标图像集合,得到多个尺度的深度图像。
804、计算机设备将多组目标图像集合对应的深度图像进行融合处理,得到融合后的深度图像。
由于多组目标图像集合对应的深度图像的尺度不同,不同尺度的深度图像中包含的深度值不同,尺度越大的深度图像包含的深度值越多,因此,在将多个尺度的深度图像进行融合处理时,可以按照尺度由小到大的顺序,依次将多个尺度的深度图像进行融合。通过将多个尺度的深度图像进行融合,丰富了融合后的深度图像的深度值,从而提高了融合后的深度图像的准确性。
对于将多组目标图像集合对应的深度图像进行融合处理的方式,在一种可能实现方式中,该步骤804可以包括:由最小尺度的深度图像开始,将当前深度图像中满足预设条件的第一像素的深度值,替换上一尺度的深度图像中与第一像素对应的第二像素的深度值,直至替换最大尺度的深度图像中的深度值后,得到最大尺度的深度图像替换深度值后的深度图像。其中,深度图像中包括多个像素,每个像素对应有深度值。
在相邻的两个尺度的深度图像中,第一像素与第二像素对应,表示第一像素与第二像素对应的位置相同,满足预设条件是指第一像素的深度值比第二像素的深度值更准确。因此,将小尺度的深度图像中准确率高的第一像素的深度值,替换上一尺度的深度图像中第二像素的深度值,从而使得替换后的上一尺度的深度图像中的各个像素的深度值更准确。按照深度图像的尺度由小到大的顺序,依次以小尺度的深度图像中的第一像素替换上一尺度的第二像素的深度值,多次替换处理后,使得到的最大尺度的深度图像中的各个像素的深度值更准确,从而提高了获取到的深度图像的准确性。
通过获取到多组图像集合对应的深度图像,即得到的多个尺度深度图像构成深度图的图像金字塔,通过Multi-metric Pyramid Depth Map Aggregation(多尺度度量金字塔深度图聚合),将多个尺度的深度图像进行融合,得到融合后的深度图像。
对于确定相邻尺度的深度图像中对应的像素的方式,在一种可能实现方式中,可以包括以下步骤:
步骤1:对于相邻尺度的第一深度图像和第二深度图像,根据第一深度图像与第二深度图像之间的像素映射关系,将第二深度图像中任一第二像素映射到第一深度图像中,得到第一像素。其中,第二深度图像的尺度大于第一深度图像的尺度。
其中,像素映射关系中包括第一深度图像中多个像素与第二深度图像中多个像素之间的对应关系。由于第一深度图像和第二深度图像均是通过多张目标图像得到的,不同的深度图像对应的目标图像的尺度不同,而不同尺度的目标图像中均是通过对原始图像进行尺度调整得到的,因此可以确定第一深度图像与第二深度图像中多个像素之间的对应关系,从而可以获取到第一深度图像与第二深度图像之间的像素映射关系。
由于第一深度图像的尺度小于第二深度图像的尺度,在确定第一深度图像与第二深度图像中多个像素之间的像素映射关系时,第一深度图像中包含的像素个数与第二深度图像中包含的像素个数相同,则第一深度图像中每个第一像素的尺寸小于第二深度图像中每个第二像 素的尺寸。第一深度图像中包含的像素的尺寸与第二深度图像中包含的像素的尺寸相等,则第一深度图像中第一像素的个数小于第二深度图像中第二像素的个数,每个第一像素对应多个第二像素。
步骤2:根据像素映射关系,将第一像素反映射到第二深度图像中,得到第三像素。
在本申请实施例中,通过大尺度的深度图像中的像素,确定小尺度的深度图像中对应的像素的过程为映射过程;通过小尺度的深度图像中的像素,确定大尺度的深度图像中对应的像素的过程称为反映射过程。由于第一深度图像与第二深度图像的尺度不同,无法保证第一深度图像中与第二深度图像中的像素一一对应,因此通过第二深度图像中的第二像素映射到第一深度图像时,得到第一像素,则再将第一像素反映射到第二深度图像时,得到的第三像素与第二像素之间会产生差异,使得到的第三像素与第二像素不同。
步骤3:响应于第一像素与第三像素之间的距离小于第一预设阈值,确定第一像素与第二像素对应。
其中,第一预设阈值可以为预设的任意数值,如1、2等。第一像素与第三像素之间的距离小于第一预设距离,表示第一像素与第二像素之间满足图像一致性,因此可以确定第一像素与第二像素对应。
在确定第一像素与第三像素之间的距离时,可以在第一深度图像中,根据第一像素的坐标值与第三像素的坐标值,确定第一像素与第三像素之间的距离。在确定第一像素与第二像素对应时,该第一像素的坐标值P 1、第三像素的坐标值P 3满足以下关系:
||P 1-P 3|| 2<M
其中,M为任意的常数,如M为1。
对于确定第一像素与第二像素对应的方式,在一种可能实现方式中,该步骤3可以包括:响应于距离小于第一预设阈值,且第一像素与第三像素对应的深度值之间的差异数值小于第二预设阈值,确定第一像素与第二像素对应。
其中,第二预设阈值可以为预设的任意数值。第一像素与第三像素之间的距离小于第一预设距离,表示第一像素与第二像素之间满足图像一致性,第一像素与第三像素对应的深度值之间的差异数值小于第二预设阈值,表示第一像素与第二像素之间满足几何一致性,因此可以确定第一像素与第二像素对应。
在第一深度图像及第二深度图像中,每个像素均具有对应的深度值。第一像素与第三像素对应的深度值之间的差异数值小于第二预设阈值时,则第一像素对应的深度值D(P 1)、第三像素对应的深度值d 3满足以下关系:
||D(P 1)-d 3|| 2<0.01·D(P 1)
对于确定第一像素满足预设条件的方式,在一种可能实现方式中,响应于第一像素的深度值对应的概率大于第二预设阈值,且第二像素的深度值对应的概率小于第三预设阈值,确定第一像素满足预设条件。
其中,第二预设阈值和第三预设阈值均可以是预设的任意数值,如第二预设阈值为0.9,第三预设阈值为0.5。第一像素的深度值对应的概率大于第二预设阈值,且第二像素的深度值对应的概率小于第三预设阈值,表示第一像素的深度值比第二像素的深度值的准确性高,因此,确定第一像素满足预设条件,后续可以将第一像素的深度值替换第二像素的深度值。
在第一像素满足预设条件时,第一像素的深度值对应的概率P(P 1)、第二像素的深度值对 应的概率P(P 2),满足以下关系:
P(P 1)>Y,P(P 2)<Z
其中,Y为第二预设阈值、Z为第三预设阈值,Y、Z均为任意的常数,且Z小于Y,如Y为0.9,Z为0.5。
另外,在确定深度图像中每个像素对应的概率时,通过上述步骤2046可知,根据第五聚合特征中每个特征图对应的深度值,及第一聚合特征对应的概率图,
第五聚合特征中的每个像素位置,与该概率图中的概率一一对应,则可以确定该每个特征图对应的概率,即可确定每个深度值对应的概率;对于深度图像中的任一像素,根据该任一像素在该深度图像中的预测深度,及第五聚合特征中的特征图对应的多个深度值,从该多个深度值中确定的预设数目的深度值,将该预设数目的深度值对应的概率之和,确定为该深度图像中该像素的概率。其中,预设数目的深度值,为该多个深度值中,与该预测深度值最相近的预设数目的深度值。该预设数目可以为预设的任意数值,如4或5等。
例如,对于深度图像中的任一像素,在深度图像中的预设深度为1,预设数目为4,多个深度值为0.2,、0.4、0.6、0.8、1.2、1.4、1.6、1.8,则根据该预设深度1,确定相邻的预设数目的深度值为0.6、0.8、1.2、1.4,则将该预设数目分别对应的概率相加之和,作为该深度图像中该像素的概率。
如图9所示,对于相邻的两个尺度的深度图像,第一深度图像901的尺度小于第二深度图像902的尺度,该第一深度图像901是通过其他多个尺度的深度图像融合后得到的,确定第一深度图像901对应的第一概率图903,及第二深度图像902对应的第二概率图904,根据第一概率图903及第二概率图904,将第一深度图像901和第二深度图像902进行融合,将第一深度图像901中满足预设条件的第一像素的深度值,替换第二深度图像902中与第一像素对应的第二像素的深度值,得到第三深度图像905,该第三深度图像903的尺度与第二深度图像902的尺度相等,且,将第一概率图903中与第一像素对应的概率,替换第二概率图像904与第二像素对应的概率,生成第三深度图像905对应的第三概率图906。
805、计算机设备对深度图像进行转化处理,得到点云数据。
该步骤与上述步骤205类似,在此不再赘述。
806、计算机设备对点云数据进行聚合处理,得到目标物体的三维模型。
该步骤与上述步骤206类似,在此不再赘述。
需要说明的是,本申请实施例仅是以多张目标图像中任一目标图像作为参考图像进行说明的,而在另一实施例中,分别将多张目标图像中每一张目标图像分别作为参考图像,重复执行步骤801-805,从而得到多个点云数据,则在执行步骤806时,将多个点云数据进行聚合处理,得到目标物体的三维模型。
本申请实施例提供的方法,获取多张目标图像,该多张目标图像是按照不同视角拍摄目标物体分别得到的,通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。获取的多张目标图像是按照不同视角拍摄目标物体分别得到的,使得到的多张目标图像中包括目标物体不同角度的信息,丰富了获取到的目标图像的信息量,且通过多个卷积层的多级卷积处理,得到多个不同的特征图集合,丰富了特征图的信息量,将多个卷积层输出的特征 图进行融合处理,丰富了得到的深度图像中包含的信息量,从而提高了得到的深度图像的准确性。
并且,通过将多个尺度的深度图像进行融合处理,将低尺度的深度图像中的准确性高的深度值替换到高尺度的深度图像中,提高了深度图像的准确性,从而提高了获取到的三维模型的准确性。
并且,将多张目标图像中每张目标图像均作为参考图像,获取到多个点云数据,将多个点云数据进行聚合处理,丰富了点云数据包含的信息,从而提高了获取到的三维模型的准确性。
如图10所示,获取多张原始图像,将多张原始图像确定为第一目标图像集合1001,对第一目标图像集合进行两轮尺度调整,分别得到第二目标图像集合1002和第三目标图像集合1003,将每个目标图像集合分别输入至深度图像生成模型1004,得到多个尺度的深度图像1005,将多个深度图像进行融合,得到融合后的深度图像1006,对融合后的深度图像1006进行转化处理,对得到的点云数据进行聚合处理,得到目标物体的三维模型1007。
需要说明的是,本申请实施例中步骤801-804可以通过网络模型来实现,通过将多张原始图像输入至该网络模型中,该网络模型对多张原始图像进行处理,得到多组目标图像集合,获取每组目标图像集合对应的深度图像,将多个深度图像进行融合,输出融合后的深度图像。其中,该网络模型可以为PVA-MVSNet(PyramidView AggregationMulti-view Stereo Network,金字塔多视角立体几何神经网络模型),或者其他网络模型。
图11是本申请实施例提供的一种生成三维模型的流程图,如图11所示,该方法包括:
1、用户通过终端的摄像头,按照不同视角对目标物体进行拍摄,得到多张原始图像。
2、终端通过传感器,确定每张原始图像对应的拍摄参数。
3、终端将多张原始图像及对应的拍摄参数输入至深度图像生成模型中,该深度图像生成模型输出目标物体的深度图像。
4、终端将深度图像进转换成点云数据,对点云数据进行过滤处理,将过滤后的点云数据进行融合,得到目标物体的三维模型。
5、终端显示该目标物体的三维模型。
图12是本申请实施例提供的一种深度图像生成装置的结构示意图,如图12所示,该装置包括:
图像获取模块1201,用于获取多张目标图像,多张目标图像是按照不同视角拍摄目标物体分别得到的,;
卷积处理模块1202,用于通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合;
视角聚合模块1203,用于分别将每个特征图集合中的多个特征图进行视角聚合,得到每个特征图集合对应的聚合特征;
特征融合模块1204,用于将得到的多个聚合特征进行融合处理,得到深度图像。
本申请实施例提供的装置,获取多张目标图像,该多张目标图像是按照不同视角拍摄目标物体分别得到的,通过卷积模型中的多个卷积层,对多张目标图像进行多级卷积处理,得到多个卷积层分别输出的特征图集合,分别将每个特征图集合中的多个特征图进行视角聚合, 得到每个特征图集合对应的聚合特征,将得到的多个聚合特征进行融合处理,得到深度图像。获取的多张目标图像是按照不同视角拍摄目标物体分别得到的,使得到的多张目标图像中包括目标物体不同角度的信息,丰富了获取到的目标图像的信息量,且通过多个卷积层的多级卷积处理,得到多个不同的特征图集合,丰富了特征图的信息量,将多个卷积层输出的特征图进行融合处理,丰富了得到的深度图像中包含的信息量,从而提高了得到的深度图像的准确性。
可选地,如图13所示,卷积处理模块1202,包括:
卷积处理单元1221,用于通过卷积模型中的第一个卷积层,对多张目标图像进行卷积处理,得到第一个卷积层输出的特征图集合,特征图集合包括多张目标图像对应的特征图;
卷积处理单元1221,还用于通过卷积模型中的下一个卷积层,对上一个卷积层输出的特征图集合中的每个特征图进行卷积处理,得到下一个卷积层输出的特征图集合,直至得到多个卷积层分别输出的特征图集合。
可选地,如图13所示,视角聚合模块1203,包括:
图像确定单元1231,用于将多张目标图像中的任一张目标图像作为参考图像,将多张目标图像中的其他目标图像作为第一图像;
对于任一特征图集合进行如下处理:
特征图确定单元1232,用于确定特征图集合中,参考图像对应的参考特征图及第一图像对应的第一特征图;
视角转换单元1233,用于按照第一图像与参考图像的拍摄视角的差异,将第一特征图进行视角转换,得到转换后的第二特征图;
第一融合处理单元1234,用于将参考特征图与第二特征图进行融合处理,得到聚合特征。
可选地,视角转换单元1233,还用于获取第一图像对应的第一拍摄参数及参考图像对应的参考拍摄参数;确定输出特征图集合的卷积层对应的多个深度值;根据第一拍摄参数与第二拍摄参数之间的差异,及多个深度值,确定与多个深度值对应的多个视角转换矩阵;根据多个视角转换矩阵,分别对第一特征图进行视角转换,得到转换后的多个第二特征图。
可选地,视角转换单元1233,还用于确定输出特征图集合的卷积层对应的深度层数;按照深度层数将预设深度范围进行划分,得到多个深度值。
可选地,视角转换单元1233,还用于将第一数量的参考特征图进行融合处理,得到参考图像对应的参考特征卷,第一数量等于多个深度值的数量;对于每个第一图像,将第一图像对应的第一特征图转换后的多个第二特征图进行融合处理,得到第一特征卷,将第一特征卷与参考特征卷之间的差值确定为第二特征卷;将确定的多个第二特征卷进行融合处理,得到聚合特征。
可选地,视角转换单元1233,还用于获取输出特征图集合的卷积层对应的权重矩阵,权重矩阵中包括卷积层输出的特征图中每个像素位置对应的权重;按照权重矩阵,将多个第二特征卷进行加权融合处理,得到聚合特征。
可选地,聚合特征、参考特征卷、第一特征卷、第二特征卷及权重矩阵,满足以下关系:
V′ i,d,h,w=V i,d,h,w-V 0,d,h,w
Figure PCTCN2020127891-appb-000008
其中,i表示第一图像的序号,i为大于0、且不大于N-1的正整数;N表示多张目标图像的个数,N为大于1的整数;d表示多个深度值中的任一深度值,h表示特征图集合中的特征图的高度,w表示特征图集合中的特征图的宽度;V′ i,d,h,w表示第二特征卷,V i,d,h,w表示第一特征卷,V 0,d,h,w表示参考特征卷,C d,h,w表示聚合特征,U h,w表示权重矩阵;⊙用于表示元素级乘法。
可选地,聚合特征、参考特征卷、第一特征卷、第二特征卷及权重矩阵,满足以下关系:
V′ i,d,h,w=V i,d,h,w-V 0,d,h,w
Figure PCTCN2020127891-appb-000009
其中,i表示第一图像的序号,i为大于0、且小于等于N-1的正整数;N表示多张目标图像的个数,N为大于1的整数;d表示多个深度值中的任一深度值,h表示特征图集合中的特征图的高度;w表示特征图集合中的特征图的的宽度;V′ i,d,h,w表示第二特征卷,V i,d,h,w表示第一特征卷,V 0,d,h,w表示参考特征卷,C d,h,w表示聚合特征,U d,h,w表示与深度值d对应的权重矩阵;⊙用于表示元素级乘法。
可选地,多个卷积层输出的特征图的尺度依次减小;如图13所示,特征融合模块1204,包括:
聚合特征确定单元1241,用于将多个聚合特征中最大尺度的聚合特征作为第一聚合特征,将多个聚合特征中其他的多个聚合特征作为第二聚合特征;
卷积处理单元1242,用于将第一聚合特征进行多级卷积处理,得到多个第三聚合特征,多个第三聚合特征的尺度与多个第二聚合特征的尺度一一对应;
反卷积处理单元1243,用于将第一尺度的第二聚合特征与第一尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第二尺度的第四聚合特征,第一尺度为多个第二聚合特征的最小尺度,第二尺度为第一尺度的上一级尺度;
反卷积处理单元1243,还用于继续将当前得到的第四聚合特征、与第四聚合特征尺度相等的第二聚合特征和第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征,直至得到与第一聚合特征尺度相等的第四聚合特征;
第二融合处理单元1244,用于将当与第一聚合特征尺度相等的第四聚合特征与第一聚合特征进行融合处理,得到第五聚合特征;
卷积处理单元1242,还用于根据第一聚合特征对应的概率图,将第五聚合特征进行卷积处理,得到深度图像。
可选地,反卷积处理单元1243,还用于继续将当前得到的第四聚合特征、与第四聚合特征尺度相等的第二聚合特征、第三聚合特征、及第二聚合特征的概率图进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征。
可选地,如图13所示,图像获取模块1201,包括:
第一图像获取单元12011,用于按照多个不同的视角拍摄目标物体,得到多张目标图像;或者,
第二图像获取单元12012,用于按照多个不同的视角拍摄目标物体,得到多张原始图像;
尺度调整单元12013,用于对多张原始图像进行尺度调整,得到多张原始图像调整后的多张目标图像,多张目标图像的尺度相等。
可选地,尺度调整单元12013,还用于对多张原始图像进行多轮尺度调整,得到多组目标图像集合,每组目标图像集合包括同一尺度的多张目标图像,不同目标图像集合中的目标图像的尺度不同;
装置还包括:融合处理模块1205,用于将多组目标图像集合对应的深度图像进行融合处理,得到融合后的深度图像。
可选地,如图13所示,融合处理模块1205,包括:
第三融合处理单元1251,用于由最小尺度的深度图像开始,将当前深度图像中满足预设条件的第一像素的深度值,替换上一尺度的深度图像中与第一像素对应的第二像素的深度值,直至替换最大尺度的深度图像中的深度值后,得到最大尺度的深度图像替换深度值后的深度图像。
可选地,如图13所示,装置包括:
像素映射模块1206,用于对于相邻尺度的第一深度图像和第二深度图像,根据第一深度图像与第二深度图像之间的像素映射关系,将第二深度图像中任一第二像素映射到第一深度图像中,得到第一像素,第二深度图像的尺度大于第一深度图像的尺度;
像素反映射模块1207,用于根据像素映射关系,将第一像素反映射到第二深度图像中,得到第三像素;
第一像素确定模块1208,用于响应于第一像素与第三像素之间的距离小于第一预设阈值,确定第一像素与第二像素对应。
可选地,如图13所示,第一像素确定模块1208,包括:
像素确定单元1281,用于响应于距离小于第一预设阈值,且第一像素与第三像素对应的深度值之间的差异数值小于第二预设阈值,确定第一像素与第二像素对应。
可选地,如图13所示,装置包括:
第二像素确定模块1209,用于响应于第一像素的深度值对应的概率大于第二预设阈值,且第二像素的深度值对应的概率小于第三预设阈值,确定第一像素满足预设条件。
可选地,如图13所示,装置还包括:
转化处理模块1210,用于对深度图像进行转化处理,得到点云数据;
聚合处理模块1211,用于对点云数据进行聚合处理,得到目标物体的三维模型。
图14是本申请实施例提供的一种终端的结构示意图,可以实现上述实施例中第一终端、第二终端及第三终端执行的操作。该终端1400可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑、台式电脑、头戴式设备、智能电视、智能音箱、智能遥控器、智能话筒,或其他任意智能终端。终端1400还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1400包括有:处理器1401和存储器1402。
处理器1401可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。存储器1402可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的,用于存储至少一个指令,该至少一个指令用于被处理器1401所具有以实现本申请中方法实施 例提供的深度图像生成方法。
在一些实施例中,终端1400还可选包括有:外围设备接口1403和至少一个外围设备。处理器1401、存储器1402和外围设备接口1403之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1403相连。具体地,外围设备包括:射频电路1404、显示屏1405和音频电路1406中的至少一种。
射频电路1404用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路1404通过电磁信号与通信网络及其他通信设备进行通信。
显示屏1405用于显示UI(UserInterface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。该显示屏1405可以是触摸显示屏,还可以用于提供虚拟按钮和/或虚拟键盘。
音频电路1406可以包括麦克风和扬声器。麦克风用于采集用户及环境的音频信号,并将音频信号转换为电信号输入至处理器1401进行处理,或者输入至射频电路1404以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端1400的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1401或射频电路1404的电信号转换为音频信号。
本领域技术人员可以理解,图14中示出的结构并不构成对终端1400的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图15是本申请实施例提供的一种服务器的结构示意图,该服务器1500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,CPU)1501和一个或一个以上的存储器1502,其中,存储器1502中存储有至少一条指令,至少一条指令由处理器1501加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器还可以具有有线或无线网络接口、键盘及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
服务器1500可以用于执行上述深度图像生成方法。
本申请实施例还提供了一种计算机设备,该计算机设备包括处理器和存储器,存储器中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现上述实施例的深度图像生成方法。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现上述实施例的深度图像生成方法。
本申请实施例还提供了一种计算机程序,该计算机程序中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现上述实施例的深度图像生成方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请实施例的可选实施例,并不用以限制本申请实施例,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种深度图像生成方法,所述方法包括:
    获取多张目标图像,所述多张目标图像是按照不同视角拍摄目标物体分别得到的;
    通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合,每个特征图集合包括所述多张目标图像对应的特征图;
    分别将所述每个特征图集合中的多个特征图进行视角聚合,得到所述每个特征图集合对应的聚合特征;
    将得到的多个聚合特征进行融合处理,得到深度图像。
  2. 根据权利要求1所述的方法,所述通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合,包括:
    通过所述卷积模型中的第一个卷积层,对所述多张目标图像进行卷积处理,得到所述第一个卷积层输出的特征图集合;
    通过所述卷积模型中的下一个卷积层,对上一个卷积层输出的特征图集合中的每个特征图进行卷积处理,得到所述下一个卷积层输出的特征图集合,直至得到所述多个卷积层分别输出的特征图集合。
  3. 根据权利要求1所述的方法,所述分别将所述每个特征图集合中的多个特征图进行视角聚合,得到所述每个特征图集合对应的聚合特征,包括:
    将所述多张目标图像中的任一张目标图像作为参考图像,将所述多张目标图像中的其他目标图像作为第一图像;
    对于任一特征图集合进行如下处理:
    确定所述特征图集合中,所述参考图像对应的参考特征图及所述第一图像对应的第一特征图;
    按照所述第一图像与所述参考图像的拍摄视角的差异,将所述第一特征图进行视角转换,得到转换后的第二特征图;
    将所述参考特征图与所述第二特征图进行融合处理,得到所述聚合特征。
  4. 根据权利要求3所述的方法,所述按照所述第一图像与所述参考图像的拍摄视角的差异,将所述第一特征图进行视角转换,得到转换后的第二特征图,包括:
    获取所述第一图像对应的第一拍摄参数及所述参考图像对应的参考拍摄参数;
    确定输出所述特征图集合的卷积层对应的多个深度值;
    根据所述第一拍摄参数与所述第二拍摄参数之间的差异,及所述多个深度值,确定与所述多个深度值对应的多个视角转换矩阵;
    根据所述多个视角转换矩阵,分别对所述第一特征图进行视角转换,得到转换后的多个第二特征图。
  5. 根据权利要求4所述的方法,所述确定输出所述特征图集合的卷积层对应的多个深度值,包括:
    确定输出所述特征图集合的卷积层对应的深度层数;
    按照所述深度层数将预设深度范围进行划分,得到所述多个深度值。
  6. 根据权利要求4所述的方法,所述第一图像包括多个,所述将所述参考特征图与所述 第二特征图进行融合处理,得到所述聚合特征,包括:
    将第一数量的所述参考特征图进行融合处理,得到所述参考图像对应的参考特征卷,所述第一数量等于所述多个深度值的数量;
    对于每个第一图像,将所述第一图像对应的第一特征图转换后的多个第二特征图进行融合处理,得到第一特征卷,将所述第一特征卷与所述参考特征卷之间的差值确定为第二特征卷;
    将确定的多个第二特征卷进行融合处理,得到所述聚合特征。
  7. 根据权利要求6所述的方法,所述将确定的多个第二特征卷进行融合处理,得到所述聚合特征,包括:
    获取所述输出所述特征图集合的卷积层对应的权重矩阵,所述权重矩阵中包括所述卷积层输出的特征图中每个像素位置对应的权重;
    按照所述权重矩阵,将所述多个第二特征卷进行加权融合处理,得到所述聚合特征。
  8. 根据权利要求1所述的方法,所述多个卷积层输出的特征图的尺度依次减小;所述将得到的多个聚合特征进行融合处理,得到深度图像,包括:
    将所述多个聚合特征中最大尺度的聚合特征作为第一聚合特征,将所述多个聚合特征中其他的多个聚合特征作为第二聚合特征;
    将所述第一聚合特征进行多级卷积处理,得到多个第三聚合特征,所述多个第三聚合特征的尺度与所述多个第二聚合特征的尺度一一对应;
    将第一尺度的第二聚合特征与所述第一尺度的第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到第二尺度的第四聚合特征,所述第一尺度为多个第二聚合特征的最小尺度,所述第二尺度为所述第一尺度的上一级尺度;
    继续将当前得到的第四聚合特征、与所述第四聚合特征尺度相等的第二聚合特征和第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征,直至得到与所述第一聚合特征尺度相等的第四聚合特征;
    将与所述第一聚合特征尺度相等的第四聚合特征与所述第一聚合特征进行融合处理,得到第五聚合特征;
    根据所述第一聚合特征对应的概率图,将所述第五聚合特征进行卷积处理,得到所述深度图像。
  9. 根据权利要求8所述的方法,所述继续将当前得到的第四聚合特征、与所述第四聚合特征尺度相等的第二聚合特征和第三聚合特征进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征,包括:
    继续将当前得到的第四聚合特征、与所述第四聚合特征尺度相等的第二聚合特征、第三聚合特征、及所述第二聚合特征的概率图进行融合处理,将融合后的特征进行反卷积处理,得到上一级尺度的第四聚合特征。
  10. 根据权利要求1所述的方法,所述获取多张目标图像,包括:
    按照多个不同的视角拍摄所述目标物体,得到所述多张目标图像;或者,
    按照多个不同的视角拍摄所述目标物体,得到多张原始图像;
    对所述多张原始图像进行尺度调整,得到所述多张原始图像调整后的所述多张目标图像,所述多张目标图像的尺度相等。
  11. 根据权利要求10所述的方法,所述对所述多张原始图像进行尺度调整,得到所述多张原始图像调整后的所述多张目标图像,包括:
    对所述多张原始图像进行多轮尺度调整,得到多组目标图像集合,每组目标图像集合包括同一尺度的多张目标图像,不同目标图像集合中的目标图像的尺度不同;
    所述方法还包括:将所述多组目标图像集合对应的深度图像进行融合处理,得到融合后的深度图像。
  12. 根据权利要求11所述的方法,所述将所述多组目标图像集合对应的深度图像进行融合处理,得到融合后的深度图像,包括:
    由最小尺度的深度图像开始,将当前深度图像中满足预设条件的第一像素的深度值,替换上一尺度的深度图像中与所述第一像素对应的第二像素的深度值,直至替换最大尺度的深度图像中的深度值后,得到所述最大尺度的深度图像替换深度值后的深度图像。
  13. 根据权利要求12所述的方法,所述方法还包括:
    对于相邻尺度的第一深度图像和第二深度图像,根据所述第一深度图像与所述第二深度图像之间的像素映射关系,将所述第二深度图像中任一第二像素映射到所述第一深度图像中,得到所述第一像素,所述第二深度图像的尺度大于所述第一深度图像的尺度;
    根据所述像素映射关系,将所述第一像素反映射到所述第二深度图像中,得到第三像素;
    响应于所述第一像素与所述第三像素之间的距离小于第一预设阈值,确定所述第一像素与所述第二像素对应。
  14. 一种深度图像生成装置,所述装置包括:
    图像获取模块,用于获取多张目标图像,所述多张目标图像是按照不同视角拍摄目标物体分别得到的;
    卷积处理模块,用于通过卷积模型中的多个卷积层,对所述多张目标图像进行多级卷积处理,得到所述多个卷积层分别输出的特征图集合,每个特征图集合包括所述多张目标图像对应的特征图;
    视角聚合模块,用于分别将所述每个特征图集合中的多个特征图进行视角聚合,得到所述每个特征图集合对应的聚合特征;
    特征融合模块,用于将得到的多个聚合特征进行融合处理,得到深度图像。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行,以实现如权利要求1至13任一权利要求所述的深度图像生成方法。
  16. 一种计算机设备,该计算机设备包括处理器和存储器,存储器中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行,以实现如权利要求1至13任一权利要求所述的深度图像生成方法。
PCT/CN2020/127891 2020-02-26 2020-11-10 深度图像生成方法、装置及存储介质 WO2021169404A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/714,654 US20220230338A1 (en) 2020-02-26 2022-04-06 Depth image generation method, apparatus, and storage medium and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010119713.5 2020-02-26
CN202010119713.5A CN111340866B (zh) 2020-02-26 2020-02-26 深度图像生成方法、装置及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/714,654 Continuation US20220230338A1 (en) 2020-02-26 2022-04-06 Depth image generation method, apparatus, and storage medium and electronic device

Publications (1)

Publication Number Publication Date
WO2021169404A1 true WO2021169404A1 (zh) 2021-09-02

Family

ID=71183737

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127891 WO2021169404A1 (zh) 2020-02-26 2020-11-10 深度图像生成方法、装置及存储介质

Country Status (3)

Country Link
US (1) US20220230338A1 (zh)
CN (1) CN111340866B (zh)
WO (1) WO2021169404A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018059A (zh) * 2022-08-09 2022-09-06 北京灵汐科技有限公司 数据处理方法及装置、神经网络模型、设备、介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340866B (zh) * 2020-02-26 2024-03-01 腾讯科技(深圳)有限公司 深度图像生成方法、装置及存储介质
CN112037142B (zh) * 2020-08-24 2024-02-13 腾讯科技(深圳)有限公司 一种图像去噪方法、装置、计算机及可读存储介质
CN112700483B (zh) * 2021-01-13 2023-02-17 上海微亿智造科技有限公司 用于提高表面缺陷检测精度的三锥视角融合方法、系统及介质
CN113313742A (zh) * 2021-05-06 2021-08-27 Oppo广东移动通信有限公司 图像深度估计方法、装置、电子设备及计算机存储介质
CN113436199B (zh) * 2021-07-23 2022-02-22 人民网股份有限公司 半监督视频目标分割方法及装置
CN116883479B (zh) * 2023-05-29 2023-11-28 杭州飞步科技有限公司 单目图像深度图生成方法、装置、设备及介质
CN117314754B (zh) * 2023-11-28 2024-03-19 深圳因赛德思医疗科技有限公司 一种双摄超光谱图像成像方法、系统及双摄超光谱内窥镜

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109461180A (zh) * 2018-09-25 2019-03-12 北京理工大学 一种基于深度学习的三维场景重建方法
CN110021069A (zh) * 2019-04-15 2019-07-16 武汉大学 一种基于网格形变的三维模型重建方法
CN110378943A (zh) * 2019-06-21 2019-10-25 北京达佳互联信息技术有限公司 图像处理方法、装置、电子设备及存储介质
CN110457515A (zh) * 2019-07-19 2019-11-15 天津理工大学 基于全局特征捕捉聚合的多视角神经网络的三维模型检索方法
US10482334B1 (en) * 2018-09-17 2019-11-19 Honda Motor Co., Ltd. Driver behavior recognition
CN110543581A (zh) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) 基于非局部图卷积网络的多视图三维模型检索方法
CN111340866A (zh) * 2020-02-26 2020-06-26 腾讯科技(深圳)有限公司 深度图像生成方法、装置及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108833785B (zh) * 2018-07-03 2020-07-03 清华-伯克利深圳学院筹备办公室 多视角图像的融合方法、装置、计算机设备和存储介质
US11034357B2 (en) * 2018-09-14 2021-06-15 Honda Motor Co., Ltd. Scene classification prediction
CN109635824A (zh) * 2018-12-14 2019-04-16 深源恒际科技有限公司 一种图像匹配深度学习方法及系统
US11017586B2 (en) * 2019-04-18 2021-05-25 Adobe Inc. 3D motion effect from a 2D image
CN110728707B (zh) * 2019-10-18 2022-02-25 陕西师范大学 基于非对称深度卷积神经网络的多视角深度预测方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10482334B1 (en) * 2018-09-17 2019-11-19 Honda Motor Co., Ltd. Driver behavior recognition
CN109461180A (zh) * 2018-09-25 2019-03-12 北京理工大学 一种基于深度学习的三维场景重建方法
CN110021069A (zh) * 2019-04-15 2019-07-16 武汉大学 一种基于网格形变的三维模型重建方法
CN110378943A (zh) * 2019-06-21 2019-10-25 北京达佳互联信息技术有限公司 图像处理方法、装置、电子设备及存储介质
CN110457515A (zh) * 2019-07-19 2019-11-15 天津理工大学 基于全局特征捕捉聚合的多视角神经网络的三维模型检索方法
CN110543581A (zh) * 2019-09-09 2019-12-06 山东省计算中心(国家超级计算济南中心) 基于非局部图卷积网络的多视图三维模型检索方法
CN111340866A (zh) * 2020-02-26 2020-06-26 腾讯科技(深圳)有限公司 深度图像生成方法、装置及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018059A (zh) * 2022-08-09 2022-09-06 北京灵汐科技有限公司 数据处理方法及装置、神经网络模型、设备、介质

Also Published As

Publication number Publication date
US20220230338A1 (en) 2022-07-21
CN111340866A (zh) 2020-06-26
CN111340866B (zh) 2024-03-01

Similar Documents

Publication Publication Date Title
WO2021169404A1 (zh) 深度图像生成方法、装置及存储介质
WO2020001168A1 (zh) 三维重建方法、装置、设备和存储介质
JP6657214B2 (ja) 画像ベースの深さ検知システムの精度測定
Zhou et al. Omnidirectional image quality assessment by distortion discrimination assisted multi-stream network
CN111369681A (zh) 三维模型的重构方法、装置、设备及存储介质
US9495755B2 (en) Apparatus, a method and a computer program for image processing
CN115205489A (zh) 一种大场景下的三维重建方法、系统及装置
Yang et al. Non-parametric depth distribution modelling based depth inference for multi-view stereo
CN115690382B (zh) 深度学习模型的训练方法、生成全景图的方法和装置
CN112258512A (zh) 点云分割方法、装置、设备和存储介质
CN114627244A (zh) 三维重建方法及装置、电子设备、计算机可读介质
CN115578515B (zh) 三维重建模型的训练方法、三维场景渲染方法及装置
CN112614110A (zh) 评估图像质量的方法、装置及终端设备
CN114792355B (zh) 虚拟形象生成方法、装置、电子设备和存储介质
WO2021105871A1 (en) An automatic 3d image reconstruction process from real-world 2d images
CN114531553A (zh) 生成特效视频的方法、装置、电子设备及存储介质
CN114998433A (zh) 位姿计算方法、装置、存储介质以及电子设备
US20240005541A1 (en) Image depth prediction method and electronic device
WO2024002064A1 (zh) 三维模型构建方法、装置、电子设备及存储介质
CN116363641A (zh) 一种图像处理方法、装置及电子设备
CN111382753B (zh) 光场语义分割方法、系统、电子终端及存储介质
CN112257653B (zh) 空间装饰效果图确定方法、装置、存储介质与电子设备
CN115170767A (zh) 线框结构的生成方法、装置、电子设备及可读存储介质
CN111652831B (zh) 对象融合方法、装置、计算机可读存储介质及电子设备
CN116152586A (zh) 模型训练方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920873

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 06/02/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20920873

Country of ref document: EP

Kind code of ref document: A1