WO2020224244A1

WO2020224244A1 - Method and apparatus for obtaining depth-of-field image

Info

Publication number: WO2020224244A1
Application number: PCT/CN2019/121603
Authority: WO
Inventors: 赵培骁; 黄轩; 王孝宇
Original assignee: 深圳云天励飞技术有限公司
Priority date: 2019-05-07
Filing date: 2019-11-28
Publication date: 2020-11-12
Also published as: CN110223334B; CN110223334A

Abstract

A method and apparatus for obtaining a depth-of-field image, the method comprising: obtaining a single target image (S101); and constructing a neural network which is used for performing multiple feature extraction and fusion on the target image to obtain a depth-of-field image of the target image (S102). In the method, on the basis of a single image obtained by an ordinary camera, a feature image of each layer may be extracted by means of setting a multi-scale and multi-level feature extractor in the neural network, and then a plurality of feature images may be fused so as to obtain a multi-scale and multi-level depth-of-field image, thereby facilitating users in utilizing the depth-of-field image for three-dimensional modeling or simulation, and further facilitating the users in performing complex three-dimensional image processing on the basis of the single image. At the same time, the described method greatly reduces device costs by means of processing a single image to obtain a depth-of-field image.

Description

Method and device for acquiring depth map

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 7, 2019. The application number is 201910377551.2 and the title of the invention is "A method and device for acquiring a depth map". The entire content is incorporated into this application by reference. in.

Technical field

The present invention relates to the field of computers, in particular to a method and device for acquiring a depth map.

Background technique

With the development of science and technology, users have higher and higher requirements for photos. For example, physical technology is used to improve resolution and color contrast to enhance user perception.

In the prior art, the depth of field information is often obtained through a special camera, binocular camera, or laser ranging and other equipment acquisition method, and the equipment cost is relatively high. In addition, the current related algorithms for constructing depth images have the problem of a single feature extraction structure, which makes the extracted feature information relatively limited, which is not conducive to the construction of complex three-dimensional images.

Summary of the invention

The embodiment of the present invention provides a method and device for acquiring a smart depth map. By using the method provided by the present invention, a single image acquired by a common camera can be used by setting multi-scale and multi-level features in the deep learning network architecture. Extractor to obtain depth image information.

The first aspect of the present invention discloses a method for acquiring a depth map, and the method includes:

Obtain a single target image;

A neural network is constructed, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image.

Wherein, optionally, the neural network includes N layers, and each layer includes a cascaded main feature extractor, an extraction and fusion module, and a fusion output device, where N is a positive integer greater than 1;

The main feature extractor of the first layer is used to perform feature extraction on the target image, and output the obtained feature map to the main feature extractor of the second layer and the extraction and fusion module and the fusion output device of the first layer;

The extraction and fusion module of the first layer performs feature extraction on the feature map output by the main feature extractor of the first layer, and outputs the obtained feature map to the fusion output device of the first layer and all The extraction and fusion module and the fusion output device of the second layer;

The main feature extractor of the i-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the i-1th layer, and output the obtained feature map to the main feature extractor of the i+1th layer and the first The extraction and fusion module and fusion exporter of layer i, where i is an integer and 1<i<N;

The extraction and fusion module of the i-th layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the i-1th layer and the feature map output by the main feature extractor of the i-th layer, And output the obtained feature map to the fusion exporter of the i-th layer and the extraction and fusion module and the fusion exporter of the i+1th layer;

The main feature extractor of the Nth layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer, and output the obtained feature map to the extraction and fusion module and the fusion output device of the Nth layer ；

The extraction and fusion module of the Nth layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the N-1 layer and the feature map output by the main feature extractor of the Nth layer, And output the obtained feature map to the Nth layer fusion output device;

The fusion output device of the Nth layer is used to extract the feature map output by the main feature extractor of the Nth layer, the feature map output by the extraction and fusion module of the Nth layer, and the extraction of the N-1th layer And perform feature extraction and fusion on the feature map output by the fusion module, and output the obtained feature map to the N-1th layer fusion output device;

The fusion output device of the i-th layer is used to extract the feature map output by the main feature extractor of the i-th layer, the feature map output by the i-th layer and the fusion module, and the extraction of the i-1th layer Perform feature extraction and fusion on the feature map output by the fusion module and the feature map output by the fusion exporter of the i+1th layer, and output the obtained feature map to the fusion exporter of the i-1th layer;

The fusion output device of the first layer is used for the feature map output by the main feature extractor of the first layer, the feature map output by the extraction and fusion module of the first layer, and the fusion output device of the second layer The output feature map is subjected to feature extraction and fusion to obtain a depth map of the target image.

In addition, it should be pointed out that cascading means that multiple components or functional modules are connected in a straight line, and the output of the previous component or functional module is used as the input of the latter component or functional module; in addition, the depth map means that An image showing the distance of each object in the shooting scene from the camera.

Among them, it should be pointed out that the extraction and fusion module of the jth layer includes N+1-j auxiliary feature extractors, where j is an integer and 1≤j≤N;

The first auxiliary feature extractor of the first layer is used to perform feature extraction on the feature map output by the main feature extractor of the first layer, and output the obtained feature map to the second to second layers of the first layer. The Nth auxiliary feature extractor and fusion output device and the first auxiliary feature extractor of the second layer;

The k-th auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to k-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the first layer and the kth auxiliary feature extractor of the second layer, where k is an integer 1< k<N;

The Nth auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to N-1 auxiliary feature extractors, and The obtained feature map is output to the fusion exporter of the first layer and the fusion exporter of the second layer.

Wherein, optionally, the first auxiliary feature extractor of the mth layer is used to perform feature extraction on the feature map output by the main feature extractor of the mth layer to obtain the first feature map; The second feature map output by the first auxiliary feature extractor of the layer, the first feature map and the second feature map are merged to obtain a third feature map, and the third feature map is output to the The second to nth auxiliary feature extractors and fusion output devices of the mth layer and the first auxiliary feature extractor of the m+1th layer, where m is a positive integer and 1<m<N-1, n is an integer and n=N+1-m;

The x-th auxiliary feature extractor of the m-th layer is used to perform feature extraction and fusion on the feature maps output by the main feature extractor of the m-th layer and the first to x-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the mth layer, and the kth auxiliary feature extractor of the m+1th layer, where x is an integer And 1<x<n.

Wherein, optionally, the first auxiliary feature extractor of the N-1th layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer to obtain a fourth feature map; receiving; The fifth feature map output by the first auxiliary feature extractor of the N-2th layer, the fourth feature map and the fifth feature map are merged to obtain a sixth feature map, and the sixth feature map is Output to the fusion output device of the N-1th layer.

The second aspect of the present invention discloses a depth map acquisition device, which includes an acquisition unit and a construction unit;

The acquiring unit is used to acquire a single target image;

The construction unit is used to construct a neural network, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image.

Wherein, the neural network includes N layers, and each layer includes a cascaded main feature extractor, an extraction and fusion module, and a fusion output device, where N is a positive integer greater than 1;

Among them, the extraction and fusion module of the jth layer includes N+1-j auxiliary feature extractors, where j is an integer and 1≤j≤N;

The x-th auxiliary feature extractor of the m-th layer is used to perform feature extraction and fusion on the feature maps output by the main feature extractor of the m-th layer and the first to x-1th auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the mth layer, and the kth auxiliary feature extractor of the m+1th layer, where x is an integer And 1<x<n.

A third aspect of the present invention discloses a storage medium in which a program code is stored, and when the program code is executed, the method of the first aspect is executed;

A fourth aspect of the present invention discloses an image fusion device, the device includes a processor and a transceiver, wherein the transceiver function described in the second aspect can be implemented by the transceiver, and the logic function described in the second aspect (The specific function of the logic unit) can be realized by the processor;

The fifth aspect of the present invention discloses a computer program product, the computer program product contains program code; when the program code is executed, the method of the first aspect is executed.

It can be seen that a method for acquiring a depth map is disclosed in the embodiment provided by the present invention. In the embodiment provided by the present invention, a single target image is acquired; a neural network is constructed, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image. By using the method provided by the present invention, based on a single image acquired by a common camera, the feature image of each layer can be extracted by setting a multi-scale and multi-level feature extractor in the neural network, and then multiple feature images can be fused to Obtain multi-scale and multi-level depth images, so that users can use the depth images for three-dimensional modeling or simulation, and provide convenience for users to perform complex three-dimensional image processing based on a single image. At the same time, the invention greatly reduces the equipment cost by processing a single image to obtain a depth image.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

FIG. 1 is a schematic diagram of an image fusion network architecture provided by an embodiment of the present invention;

Figure 1a is a schematic diagram of a deep residual network provided by an embodiment of the present invention;

FIG. 2 is a depth prediction diagram provided by an embodiment of the present invention;

3 is a schematic flowchart of a method for acquiring a depth map according to an embodiment of the present invention;

4 is a schematic structural diagram of an apparatus for acquiring a depth map according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the physical structure of a depth map acquiring device provided by an embodiment of the present invention.

Detailed ways

The embodiment of the present invention provides a method and device for acquiring a depth map. The method includes: acquiring a single target image; constructing a neural network for performing multiple feature extraction and fusion on the target image to Obtain a depth map of the target image. By using the method provided by the present invention, based on a single image acquired by a common camera, the feature image of each layer can be extracted by setting a multi-scale and multi-level feature extractor in the neural network, and then multiple feature images can be fused to Obtain multi-scale and multi-level depth images, so that users can use the depth images for three-dimensional modeling or simulation, and provide convenience for users to perform complex three-dimensional image processing based on a single image. At the same time, the invention greatly reduces the equipment cost by processing a single image to obtain a depth image.

First of all, it needs to be pointed out that the single target image acquired by the present invention can be an RGB image, a grayscale image, or a binary image. This embodiment proposes a single image acquired based on a common camera. By setting up a multi-scale and multi-level feature extractor in the deep learning network architecture, the problem that it is difficult to extract rich image features from a single image acquired by a common camera mentioned in the technical background is solved. Theoretically, this technology can be applied to personal computers (PCs) and small devices, including but not limited to mobile devices such as Android\IOS. Among them, the RGB picture refers to the picture obtained by the changes of the three color channels of red (R), green (G), and blue (B) and their mutual superposition. RGB is the color representing the three channels of red, green, and blue. This standard includes almost all the colors that human vision can perceive, and it is one of the most widely used color systems at present.

The network structure diagram shown in Figure 1 (the network structure diagram is only a schematic diagram, the present invention does not limit the number of image layers and the number of auxiliary feature extractors), the figure mainly includes three links: The first is the target image Perform image preprocessing to make the target image meet the requirements of the main feature extractor; the second is to use the main feature extractor and the auxiliary feature to extract feature maps; the third is to fuse the extracted feature maps.

For example, the main feature extractor is a ResNet50 structure, which is called a deep residual network, and performs feature extraction on images. Specifically, feature extraction is to extract information such as object texture, object contour, object and object edge in the image. The method of feature extraction is the result of learning through each layer of feature extractors through sample input. In addition, the main feature extractor can also be an AlexNet structure (a machine learning algorithm named after Alex) or a VGG structure (the structure is a machine learning algorithm proposed by Oxford Visual Geometry Group of the University of Oxford), and the present invention is not limited here. .

The schematic diagram of the deep residual network shown in Figure 1a, because the depth of the network affects the classification and recognition effect of the model, for example, when the conventional network is stacked to a certain depth, the deeper the network layer, the more obvious the disappearance of the gradient, making the network The problem of poor classification effect. The deep residual network structure can deepen the network layer while preventing the gradient from disappearing, so as to achieve a better classification effect. As shown in Figure 1a, the deep residual network has a skip structure. For example, suppose that the input of a certain neural network is x and the expected output is H(x), that is, H(x) is the desired complex latent mapping. If you want to learn such a model, the training will be more difficult; in addition, , If you have learned a more saturated accuracy rate (or when the error of the lower layer is found to become larger), then the next learning goal will be changed to the learning of identity mapping, that is, to make the input x approximate the output H(x), In order to remain in the later levels, the accuracy will not decrease. In this deep residual network structure diagram, through "shortcut connections", the input x is directly passed to the output as the initial result, and the output result is H(x)=F(x)+x, when F When (x)=0, then H(x)=x, which is the identity mapping mentioned above. Therefore, ResNet is equivalent to changing the learning goal. It is no longer learning a complete output, but the difference between the target value H(X) and x, which is the so-called residual F(x):=H(x) -x, therefore, the following training goal is to approach the residual result to 0, so that as the network deepens, the accuracy does not decrease. This residual jump structure breaks the traditional neural network n-1 layer output can only be given to n layers as the input convention, so that the output of a certain layer can directly cross several layers as the input of a later layer Its significance is to provide a new direction for the problem that the error rate of the entire learning model does not drop but rises by superimposing a multilayer network.

For example, image preprocessing refers to preprocessing the input image to meet the input of the deep residual residual network ResNet50. Specifically, image preprocessing essentially scales and crops the image.

For example, the auxiliary feature extractor is used to fuse the feature map extracted by the previous feature extractor in the same layer, the feature map obtained by long sampling from the lower layer, and the feature map obtained by downsampling from the upper layer for fusion. Obtain a new feature map.

For example, the auxiliary feature extractor contains several convolutional structures, whose function is to fuse the feature map extracted from the previous feature extractor in the same layer with the feature map obtained by downsampling from the upper layer feature to form a new Feature map. For example, the number of auxiliary feature extractors in Figure 1 is 4, then 1-4 are the same structure, including the same size and number of convolution kernels. Among them, the auxiliary feature extractor structure includes two convolutional layers and one activation layer. The size of the convolution kernel is 3×3.

As shown in Figure 1, the network structure of image feature collection in this embodiment is divided into 4 layers, each layer will use a main feature extractor and several auxiliary feature extractors, from the first layer of the network shown in Figure 1 to the second The number of auxiliary feature extractors of the four-layer network decreases layer by layer, for example, the first layer has 4 auxiliary feature extractors, the second layer has 3 auxiliary feature extractors, and the third layer has 2 auxiliary feature extractors. One and one auxiliary feature extractor of the fourth layer. It should be pointed out that the secondary feature extractor of the second layer multiplexes any three of the four secondary feature extractors of the first layer. In the same way, the third layer also multiplexes any 2 auxiliary feature extractors from the 4 first layer, which will not be listed one by one in the following. Feature map 4_0, feature map 3_2, feature map 2_3, feature map 1_4, and feature map 0_5 respectively correspond to the result of fusion of the feature map extracted by the main feature extractor with the feature map extracted by the auxiliary feature extractor layer by layer. The feature map 0_5 is the final depth prediction image (the depth prediction image shown in Figure 2). The fusion method is layer addition. After the addition, it undergoes two 3×3 convolution processing and one activation layer processing, and then the processing result is sent to the upper layer and merged with the image features of the upper layer . Among them, it should be pointed out that for Figure 2, it can be simply understood that the closer the object to the camera, the darker the color, and the farther the color is, the lighter.

Please refer to FIG. 3, which is a schematic flowchart of a method for acquiring a depth map according to an embodiment of the present invention. Wherein, as shown in FIG. 3, a method for acquiring a depth map provided by an embodiment of the present invention includes the following contents:

S101, acquiring a single target image;

It is understandable that the execution subject of this embodiment may be a smart phone, a wearable device, an electronic device with a camera function, or a personal computer and other devices. Among them, in this embodiment, the execution subject is a smart phone as an example.

Among them, it should be pointed out that the target image can be downloaded from the Internet, received from other electronic devices, or taken through a lens.

Wherein, the target image may be a single RGB image.

S102: Construct a neural network, which is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image.

Among them, the depth map refers to an image that can indicate the distance of each object in the shooting scene from the camera.

Among them, it should be noted that the neural network includes N layers, and each layer includes a cascaded main feature extractor, an extraction and fusion module, and a fusion output device, where N is a positive integer greater than 1; where the cascade is Refers to multiple components or functional modules in a linear series, and the output of the previous component or functional module is used as the input of the next component or functional module;

Among them, it needs to be pointed out that the more layers the neural network contains, the richer the extracted features. However, as the number of feature extraction increases, the size of the feature map becomes smaller and smaller. At this time, feature extraction Effective features cannot be extracted; and the more feature extractors, the more network parameters, which will lead to higher network speeds, higher hardware requirements, and increased costs. If fewer feature extractors are selected, the speed and cost of the model will decrease, but the accuracy of the model will decrease due to the extraction of fewer feature maps. Therefore, the determination of the number of layers can be determined according to the ability of the executive body. It can be defaulted by the system or manually selected. There is no restriction here. Specific layers will be selected for illustration later.

In addition, it should be pointed out that, before constructing the neural network, the method further includes: determining whether the size of the target image exceeds a threshold; if the size of the target image exceeds the threshold, preprocessing the target image, To get the processed target image.

In addition, it should be pointed out that the main feature extractor is used to extract at least one of the following information in the target image: object texture, object contour, object and object edge.

Specifically, the functions of each layer of the main feature extractor, extraction and fusion module, and fusion exporter are as follows:

The main feature extractor of the first layer is used to perform feature extraction on the target image, and output the obtained feature map to the main feature extractor of the second layer and the extraction and fusion module and the fusion output device of the first layer; The extraction and fusion module of the first layer performs feature extraction on the feature map output by the main feature extractor of the first layer, and outputs the obtained feature map to the fusion output device of the first layer and all The extraction and fusion module and the fusion output device of the second layer;

The main feature extractor of the i-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the i-1th layer, and output the obtained feature map to the main feature extractor of the i+1th layer and the first The extraction and fusion module and fusion output device of layer i, where i is an integer and 1<i<N; the extraction and fusion module of the i-th layer is used to output the extraction and fusion module of the i-1th layer Perform feature extraction and fusion on the feature map of the i-th layer and the feature map output by the main feature extractor of the i-th layer, and output the obtained feature map to the fusion output device of the i-th layer and the i+1-th layer Extraction and fusion module and fusion output device; the main feature extractor of the Nth layer is used to extract the feature map output by the main feature extractor of the N-1th layer, and output the obtained feature map to the Nth layer The extraction and fusion module and the fusion output device;

In addition, it should be further pointed out that the extraction and fusion module of the jth layer includes N+1-j auxiliary feature extractors, where j is an integer and 1≤j≤N;

Specifically, the role of the auxiliary feature extractor of each layer is as follows:

In addition, it should be further pointed out that the first auxiliary feature extractor of the m-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the m-th layer to obtain the first feature map; -1 The second feature map output by the first auxiliary feature extractor of layer, the first feature map and the second feature map are merged to obtain a third feature map, and the third feature map is output to The second to nth auxiliary feature extractors and fusion outputters of the mth layer and the first auxiliary feature extractor of the m+1th layer, where m is a positive integer and 1<m<N- 1, n is an integer and n=N+1-m;

In addition, it should be further pointed out that the first auxiliary feature extractor of the N-1th layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer to obtain a fourth feature map. ; Receive the fifth feature map output by the first auxiliary feature extractor of the N-2th layer, merge the fourth feature map with the fifth feature map to obtain a sixth feature map, and combine the sixth The feature map is output to the fusion output device of the N-1th layer

In addition, it should be pointed out that multi-scale refers to the different sizes of the fused images acquired in each layer. For example, it is divided into 3 layers, M=N-1; where N is the number of layers, and M is the number of auxiliary feature extractors. According to the above formula, the number of auxiliary feature extractors is 2. The length and degree of the target image are X and Y respectively; each time the main feature extractor is processed, the length and width will be scaled to 1/2 of the original image. That is to say, the length and width of the original image of the first layer are X/2 and Y/2 respectively (the original image is the image processed by the main feature extractor of this layer); in the same way, the original image of the second layer is The length and width are X/4 and Y/4 respectively; in the same way, the length and width of the original image of the third layer are X/8 and Y/8 respectively. Since the third layer is the last layer, the original image whose length and width are X/8 and Y/8 respectively is the characteristic image of the third layer. There are one auxiliary feature extractor corresponding to the second layer; there are two auxiliary feature processors corresponding to the first layer. It can be understood that the number of auxiliary feature extractors gradually decreases as the number of layers increases. The auxiliary feature extractor of the first layer will process the extracted feature maps of the same layer. For example, the auxiliary feature extractor numbers of the first layer are 1 and 2 respectively. Then the auxiliary feature extractor 1 needs to process the original image to obtain the feature map 1; next, it needs to input the original image and feature map 1 of the first layer into the auxiliary feature extractor 2 for processing to obtain the feature map 2. According to the feature map 2 and the feature map of the second layer, the feature map of the first layer can be obtained. Since the second layer has only one auxiliary feature extractor, the feature extractor will input the original image of the second layer and the feature image 1 of the first layer to obtain the feature image 3, and then according to the feature image 3 and the third layer The original map and feature map 2 acquire the feature map of the second layer. It should be pointed out that the method of extracting image features of each layer is different. In this embodiment, one main feature extractor and four auxiliary feature extractors are divided into five layers, including four feature extractions. As shown in Figure 1, each feature extractor is numbered in Figure 1, where 0_0, 1_0, 2_0, 3_0 and 4_0 are used to identify the main feature extractor, and the remaining tags are all auxiliary feature extractors. It should be pointed out that the length and width of the feature images from 0_0, 1_0, 2_0, 3_0 and 4_0 become the original

For example, if the input corresponding to the size of 0_0 is 1×1, then the output is

will

Input to 1_0, 1_0 output

In the same way, if you scale sequentially, then 4_0 is

Because after 5 main feature extractors, each reduction

Then 5

Multiplying sequentially is

In addition, for image fusion, for example, for example 1_2 here, the input is the feature map of 1_0, the feature map of 1_1 (the two are the same as the length and width of the feature map of 1_2) and the feature map of 0_2 (0_2 The feature of 1_2 is down-sampled once to make the image length and width half). Except for the top 0_x layer (x=0,1,2,3,4,5), for all other intermediate feature extractors, the input is the output feature maps of all the feature extractors on the left side of the same layer and The feature map corresponding to the upper layer is down-sampled. In addition, it should be pointed out that the method of fusion is channel addition.

For another example, for the feature extractor 1_4, its input is (1_0, 1_1, 1_2, 1_3 and down-sampled 0_4 and up-sampled 2_3). And the uppermost layer, such as 0_5, its input is (0_0, 0_1,0_2, 0_3, 0_4 and up-sampled 1_4). In other words, so the part of the main feature extractor is to use ResNet for the first feature extraction. The subsequent auxiliary feature extractor actually makes the feature map extracted by the main feature richer, and then performs the fusion output to the subsequent calculation .

In addition, it should be pointed out that the maximum pooling method is used for downsampling, and the bilinear interpolation method is used for upsampling. The above two methods are more commonly used and will not be introduced in detail here.

In addition, it should be further pointed out that the shallow feature extractor extracts features such as position, shape, and size; among them, it is understandable that, for example, the feature extractor of the first N/2 layers can be understood as the feature of the shallow layer. Extractor. The deep feature extractor is based on the preset feature matrix to process the features of the upper layer and the features of this layer. Among them, it can be understood that, for example, the feature extractor of the last N/2 layer can be understood as a deep feature extractor. For example, the feature map extracted in the shallow layer corresponds to the feature map extracted in the first layer, and the feature map extracted in the deep layer is the feature map extracted from the remaining layers. For another example, for example, the feature map extracted in the shallow layer corresponds to the feature map extracted in the first two layers, and the feature map extracted in the deep layer is the feature map extracted from the remaining layers.

It can be seen that through the technical solutions disclosed in the embodiments of the present invention, a single target image is obtained; a neural network is constructed, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain the target image Depth of field map. By using the method provided by the present invention, based on a single image acquired by a common camera, the feature image of each layer can be extracted by setting a multi-scale and multi-level feature extractor in the neural network, and then multiple feature images can be fused to Obtain multi-scale and multi-level depth images, so that users can use the depth images for three-dimensional modeling or simulation, and provide convenience for users to perform complex three-dimensional image processing based on a single image. At the same time, the invention greatly reduces the equipment cost by processing a single image to obtain a depth image.

Please refer to FIG. 4, which is a schematic structural diagram of an image fusion provided by an embodiment of the present invention. Wherein, as shown in FIG. 4, an apparatus 200 for acquiring a depth map provided by an embodiment of the present invention, wherein the apparatus 200 includes an acquiring unit 201 and a construction unit 202;

The obtaining unit 201 is used to obtain a single target image;

The construction unit 202 is configured to construct a neural network, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image.

Wherein, the first auxiliary feature extractor of the m-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the m-th layer to obtain the first feature map; The second feature map output by the auxiliary feature extractor, the first feature map and the second feature map are merged to obtain a third feature map, and the third feature map is output to the mth layer The second to nth auxiliary feature extractors and the fusion output device and the first auxiliary feature extractor of the m+1th layer, where m is a positive integer and 1<m<N-1, n is an integer and n=N+1-m;

Wherein, the first auxiliary feature extractor of the N-1th layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer to obtain the fourth feature map; The fifth feature map output by the first auxiliary feature extractor of the layer, the fourth feature map and the fifth feature map are merged to obtain a sixth feature map, and the sixth feature map is output to the The fusion exporter of layer N-1.

The above-mentioned unit may be used to execute the method described in any of the above-mentioned embodiments. For a detailed description, please refer to the description of the method in Embodiment 1, which will not be repeated here.

Consistent with the embodiments shown in FIGS. 3 and 4, please refer to FIG. 5. FIG. 5 is a schematic structural diagram of an electronic device 300 provided by an embodiment of the present application. As shown in the figure, the electronic device 300 includes an application The processor 310, the memory 320, the communication interface 330, and one or more programs 321, wherein the one or more programs 321 are stored in the memory 320 and are configured to be executed by the application processor 310. When the When one or more programs 321 are executed, the processor 310 performs the following operations:

Obtain a single target image;

In another embodiment of the present invention, a storage medium is disclosed. The storage medium stores program code. When the program code is executed, the method in the foregoing method embodiment is executed.

In another embodiment of the present invention, a computer program product is disclosed. The computer program product contains program code; when the program code is executed, the method in the foregoing method embodiment will be executed.

Claims

A method for acquiring a depth map, characterized in that the method includes:

Obtain a single target image;

A neural network is constructed, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain a depth map of the target image.
The method according to claim 1, wherein the neural network includes N layers, and each layer includes a cascaded main feature extractor, an extraction and fusion module, and a fusion output device, wherein N is a positive integer greater than 1. ；

The main feature extractor of the first layer is used to perform feature extraction on the target image, and output the obtained feature map to the main feature extractor of the second layer and the extraction and fusion module and the fusion output device of the first layer;

The extraction and fusion module of the first layer performs feature extraction on the feature map output by the main feature extractor of the first layer, and outputs the obtained feature map to the fusion output device of the first layer and all The extraction and fusion module and the fusion output device of the second layer;

The main feature extractor of the i-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the i-1th layer, and output the obtained feature map to the main feature extractor of the i+1th layer and the first The extraction and fusion module and fusion exporter of layer i, where i is an integer and 1<i<N;

The extraction and fusion module of the i-th layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the i-1th layer and the feature map output by the main feature extractor of the i-th layer, And output the obtained feature map to the fusion exporter of the i-th layer and the extraction and fusion module and the fusion exporter of the i+1th layer;

The main feature extractor of the Nth layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer, and output the obtained feature map to the extraction and fusion module and the fusion output device of the Nth layer ；

The extraction and fusion module of the Nth layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the N-1 layer and the feature map output by the main feature extractor of the Nth layer, And output the obtained feature map to the Nth layer fusion output device;

The fusion output device of the Nth layer is used to extract the feature map output by the main feature extractor of the Nth layer, the feature map output by the extraction and fusion module of the Nth layer, and the extraction of the N-1th layer And perform feature extraction and fusion on the feature map output by the fusion module, and output the obtained feature map to the N-1th layer fusion output device;

The fusion output device of the i-th layer is used to extract the feature map output by the main feature extractor of the i-th layer, the feature map output by the i-th layer and the fusion module, and the extraction of the i-1th layer Perform feature extraction and fusion on the feature map output by the fusion module and the feature map output by the fusion exporter of the i+1th layer, and output the obtained feature map to the fusion exporter of the i-1th layer;

The fusion output device of the first layer is used for the feature map output by the main feature extractor of the first layer, the feature map output by the extraction and fusion module of the first layer, and the fusion output device of the second layer The output feature map is subjected to feature extraction and fusion to obtain a depth map of the target image.
The method according to claim 2, wherein the extraction and fusion module of the jth layer includes N+1-j auxiliary feature extractors, wherein j is an integer and 1≤j≤N;

The first auxiliary feature extractor of the first layer is used to perform feature extraction on the feature map output by the main feature extractor of the first layer, and output the obtained feature map to the second to second layers of the first layer. The Nth auxiliary feature extractor and fusion output device and the first auxiliary feature extractor of the second layer;

The k-th auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to k-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the first layer and the kth auxiliary feature extractor of the second layer, where k is an integer 1< k<N;

The Nth auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to N-1 auxiliary feature extractors, and The obtained feature map is output to the fusion exporter of the first layer and the fusion exporter of the second layer.
The method according to claim 3, wherein:

The first auxiliary feature extractor of the m-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the m-th layer to obtain the first feature map; receiving the first auxiliary feature map of the m-1th layer The second feature map output by the feature extractor, the first feature map and the second feature map are merged to obtain a third feature map, and the third feature map is output to the second feature map of the mth layer To the nth auxiliary feature extractor and fusion output device and the first auxiliary feature extractor of the m+1th layer, where m is a positive integer and 1<m<N-1, n is an integer and n= N+1-m;

The x-th auxiliary feature extractor of the m-th layer is used to perform feature extraction and fusion on the feature maps output by the main feature extractor of the m-th layer and the first to x-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the mth layer, and the kth auxiliary feature extractor of the m+1th layer, where x is an integer And 1<x<n.
The method according to claim 3 or 4, wherein:

The first auxiliary feature extractor of the N-1th layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer to obtain a fourth feature map; The fifth feature map output by the first auxiliary feature extractor, the fourth feature map and the fifth feature map are merged to obtain a sixth feature map, and the sixth feature map is output to the Nth feature map. -1 layer of fusion exporter.
A depth map acquisition device, characterized in that the device includes an acquisition unit and a construction unit;

The acquisition unit is used to acquire a single target image; the construction unit is used to construct a neural network, and the neural network is used to perform multiple feature extraction and fusion on the target image to obtain the target image Depth of field map.
The device according to claim 6, wherein the neural network includes N layers, each layer includes a cascaded main feature extractor, an extraction and fusion module, and a fusion output device, wherein N is a positive integer greater than 1. ；

The main feature extractor of the first layer is used to perform feature extraction on the target image, and output the obtained feature map to the main feature extractor of the second layer and the extraction and fusion module and the fusion output device of the first layer;

The extraction and fusion module of the first layer performs feature extraction on the feature map output by the main feature extractor of the first layer, and outputs the obtained feature map to the fusion output device of the first layer and all The extraction and fusion module and the fusion output device of the second layer;

The main feature extractor of the i-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the i-1th layer, and output the obtained feature map to the main feature extractor of the i+1th layer and the first The extraction and fusion module and fusion exporter of layer i, where i is an integer and 1<i<N;

The extraction and fusion module of the i-th layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the i-1th layer and the feature map output by the main feature extractor of the i-th layer, And output the obtained feature map to the fusion exporter of the i-th layer and the extraction and fusion module and the fusion exporter of the i+1th layer;

The main feature extractor of the Nth layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer, and output the obtained feature map to the extraction and fusion module and the fusion output device of the Nth layer ；

The extraction and fusion module of the Nth layer is used to perform feature extraction and fusion on the feature map output by the extraction and fusion module of the N-1 layer and the feature map output by the main feature extractor of the Nth layer, And output the obtained feature map to the Nth layer fusion output device;

The fusion output device of the Nth layer is used to extract the feature map output by the main feature extractor of the Nth layer, the feature map output by the extraction and fusion module of the Nth layer, and the extraction of the N-1th layer And perform feature extraction and fusion on the feature map output by the fusion module, and output the obtained feature map to the N-1th layer fusion output device;

The fusion output device of the i-th layer is used to extract the feature map output by the main feature extractor of the i-th layer, the feature map output by the i-th layer and the fusion module, and the extraction of the i-1th layer Perform feature extraction and fusion on the feature map output by the fusion module and the feature map output by the fusion exporter of the i+1th layer, and output the obtained feature map to the fusion exporter of the i-1th layer;

The fusion output device of the first layer is used for the feature map output by the main feature extractor of the first layer, the feature map output by the extraction and fusion module of the first layer, and the fusion output device of the second layer The output feature map is subjected to feature extraction and fusion to obtain a depth map of the target image.
The device according to claim 7, wherein the extraction and fusion module of the jth layer includes N+1-j auxiliary feature extractors, wherein j is an integer and 1≤j≤N;

The first auxiliary feature extractor of the first layer is used to perform feature extraction on the feature map output by the main feature extractor of the first layer, and output the obtained feature map to the second to second layers of the first layer. The Nth auxiliary feature extractor and fusion output device and the first auxiliary feature extractor of the second layer;

The k-th auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to k-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the first layer and the kth auxiliary feature extractor of the second layer, where k is an integer 1< k<N;

The Nth auxiliary feature extractor of the first layer is used for feature extraction and fusion of the feature maps output by the main feature extractor of the first layer and the first to N-1 auxiliary feature extractors, and The obtained feature map is output to the fusion exporter of the first layer and the fusion exporter of the second layer.
The device according to claim 8, wherein:

The first auxiliary feature extractor of the m-th layer is used to perform feature extraction on the feature map output by the main feature extractor of the m-th layer to obtain the first feature map; receiving the first auxiliary feature map of the m-1th layer The second feature map output by the feature extractor, the first feature map and the second feature map are merged to obtain a third feature map, and the third feature map is output to the second feature map of the mth layer To the nth auxiliary feature extractor and fusion output device and the first auxiliary feature extractor of the m+1th layer, where m is a positive integer and 1<m<N-1, n is an integer and n= N+1-m;

The x-th auxiliary feature extractor of the m-th layer is used to perform feature extraction and fusion on the feature maps output by the main feature extractor of the m-th layer and the first to x-1 auxiliary feature extractors, and The obtained feature map is output to the k+1 to Nth auxiliary feature extractor and fusion output device of the mth layer, and the kth auxiliary feature extractor of the m+1th layer, where x is an integer And 1<x<n.
The device according to claim 8 or 9, characterized in that:

The first auxiliary feature extractor of the N-1th layer is used to perform feature extraction on the feature map output by the main feature extractor of the N-1th layer to obtain a fourth feature map; The fifth feature map output by the first auxiliary feature extractor, the fourth feature map and the fifth feature map are merged to obtain a sixth feature map, and the sixth feature map is output to the Nth feature map. -1 layer of fusion exporter.