JP2023502793A

JP2023502793A - Method, device and storage medium for generating panoramic image with depth information

Info

Publication number: JP2023502793A
Application number: JP2022554963A
Authority: JP
Inventors: ウェンボーシー，; シャオドング，; チーフイパン，; チェンリンリウ，
Original assignee: Realsee Beijing Technology Co Ltd
Current assignee: Realsee Beijing Technology Co Ltd
Priority date: 2019-11-19
Filing date: 2020-11-11
Publication date: 2023-01-25
Anticipated expiration: 2040-11-11
Also published as: CN111105347B; CN111105347A; WO2021098567A1

Abstract

深度情報付きのパノラマ画像を生成する方法と装置を提供する。深度情報付きのパノラマ画像を生成する方法は、現在のシーンの、球面投影に基づく二次元画像を取得することと、予め設定された数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成することと、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、中間画像の深度情報を確定することと、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得て、トリミングされた画像の深度情報を確定し、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定することとを含むことができる。【選択図】図１A method and apparatus for generating panoramic images with depth information are provided. A method for generating a panoramic image with depth information is to obtain a 2D image of the current scene based on spherical projection, and connect a preset number of 2D images horizontally end-to-end. determining the depth information of the intermediate image using a trained neural network model for predicting image depth; and based on the intermediate position in the horizontal direction of the intermediate image. Obtaining an image equal to the length of the two-dimensional image by cropping the intermediate image, determining the depth information of the cropped image, and determining the cropped image with depth information as the panoramic image of the current scene. and [Selection drawing] Fig. 1

Description

本開示は三次元モデル再構築技術分野に関し、特に深度情報付きのパノラマ画像を生成する方法、装置及び記憶媒体に関する。 TECHNICAL FIELD The present disclosure relates to the technical field of 3D model reconstruction, and more particularly to a method, apparatus and storage medium for generating a panoramic image with depth information.

三次元モデル再構築は工業検出、品質制御とマシンビジョンなどの分野で重要な地位を占めている。室内外のシーンの三次元再構築の分野では、深度データがポイントクラウドを形成してモデル化すると共に、ポイントクラウドの距離情報に基づいて、センサで得られた異なる位置のポイントクラウドをつなぎ合わせる必要がある。ただし、三次元再構築の深度データの取得には、通常、構造化光、時間飛行（ＴｉｍｅＯｆＦｌｉｇｈｔ）原理に基づくレーザーなどの高コストの専用深度センサが必要であり、そのコストが高いため、大規模な工業的実用では、コストが高すぎる。 3D model reconstruction occupies an important position in the fields of industrial detection, quality control and machine vision. In the field of 3D reconstruction of indoor and outdoor scenes, the depth data should be modeled by forming a point cloud, and based on the distance information of the point cloud, the point clouds of different positions obtained by the sensors should be stitched together. There is However, the acquisition of depth data for three-dimensional reconstruction typically requires expensive dedicated depth sensors such as structured light, lasers based on the Time Of Flight principle, etc., which are expensive. The cost is too high for large-scale industrial practical use.

本開示の実施例は深度情報付きのパノラマ画像を生成する方法を提供し、該方法は、現在のシーンの、球面投影に基づく二次元画像を取得することと、予め設定された数の前記二次元画像を水平方向に端と端をつなげて接することで中間画像を構成することと、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、前記中間画像の深度情報を確定することと、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得て、トリミングされた画像の深度情報を確定し、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定することとを含む。 An embodiment of the present disclosure provides a method of generating a panoramic image with depth information, which comprises acquiring a spherical projection-based two-dimensional image of a current scene, Determining the depth information of the intermediate image by constructing an intermediate image by horizontally end-to-end tangent of the dimensional images and utilizing a trained neural network model for predicting image depth. and trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image, determining the depth information of the trimmed image, and cropping with depth information and determining the resulting image as a panoramic image of the current scene.

本開示の一例において、画像深度を予測するためのニューラルネットワークモデルは、プレトレーニングされた畳み込みニューラルネットワークを初期バックボーンネットワークとし、Ｕ－Ｎｅｔネットワーク構造に応じて前記初期バックボーンネットワークを構造的に調整し、複数の深度情報付きのカラー三次元画像を利用して、調整された前記初期バックボーンネットワークをトレーニングし、前記画像深度を予測するためのニューラルネットワークモデルを生成することによってトレーニングされる。 In one example of the present disclosure, a neural network model for predicting image depth includes a pre-trained convolutional neural network as an initial backbone network, structurally adjusting the initial backbone network according to a U-Net network structure, A plurality of depth-informed color 3D images are trained by training the initial adjusted backbone network to generate a neural network model for predicting the image depth.

本開示の一例において、前記プレトレーニングされた畳み込みニューラルネットワークは、ＩｍａｇｅＮｅｔでプレトレーニングされたＤｅｎｓｅＮｅｔネットワークである。Ｕ－Ｎｅｔネットワーク構造に応じて前記初期バックボーンネットワークを構造的に調整することは、
ＤｅｎｓｅＮｅｔネットワークの全接続層を削除することと、
Ｕ－Ｎｅｔネットワーク構造に基づいて、全接続層を削除したＤｅｎｓｅＮｅｔネットワークの最後の層の後ろに複数のアップサンプリング層を添加し、それぞれ複数のアップサンプリング層における各アップサンプリング層に対応する重畳層を設定し、ここで、各アップサンプリング層の入力チャンネル数はその出力チャンネル数の予め設定された倍数であり、添加された各アップサンプリング層において、該アップサンプリング層の入力情報を予め設定された解像度倍数でアップサンプリングして、アップサンプリング結果を該アップサンプリング層に対応する重畳層の出力データと重畳して、重畳結果に少なくとも１回の畳み込み演算を実行し、予め設定された活性化関数を使用して畳み込み演算結果を線形修正することと、
複数のアップサンプリング層における最後のアップサンプリング層の出力に対して、１回の深度情報を出力するための畳み込み演算と、１回の信頼度情報を出力するための畳み込み演算とを行うことと
を含む。 In one example of the present disclosure, the pre-trained convolutional neural network is an ImageNet pre-trained DenseNet network. Structurally adjusting the initial backbone network according to the U-Net network structure,
removing all connectivity layers of a DenseNet network;
Based on the U-Net network structure, add multiple upsampling layers after the last layer of the DenseNet network with all connected layers removed, and add a superposition layer corresponding to each upsampling layer in the multiple upsampling layers, respectively. where the number of input channels of each upsampling layer is a preset multiple of its number of output channels, and in each added upsampling layer, the input information of the upsampling layer is converted to a preset resolution Upsampling by a factor, convolving the upsampling result with the output data of the convolution layer corresponding to the upsampling layer, performing at least one convolution operation on the convolution result, and using a preset activation function. linearly correcting the result of the convolution operation by
Performing one convolution operation for outputting depth information and one convolution operation for outputting reliability information on the output of the last upsampling layer in a plurality of upsampling layers. include.

本開示の一例において、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークであり、複数のアップサンプリング層は４つのアップサンプリング層である。各アップサンプリング層の入力チャンネル数はその出力チャンネル数の２倍である。添加された各アップサンプリング層において、該アップサンプリング層の入力情報を２倍の解像度でアップサンプリングし、少なくとも１回の畳み込み演算は２回の畳み込み演算であってもよい。 In one example of this disclosure, the DenseNet network is a DenseNet-169 network and the plurality of upsampling layers is four upsampling layers. The number of input channels for each upsampling layer is twice its number of output channels. For each added upsampling layer, the input information of the upsampling layer is upsampled at a double resolution, and the at least one convolution operation may be two convolution operations.

本開示の一例において、４つのアップサンプリング層に対して、４つのアップサンプリング層の最初の層から最後の層の順に応じて、この４つのアップサンプリング層に順次対応する重畳層はｐｏｏｌ３＿ｐｏｏｌ、ｐｏｏｌ２＿ｐｏｏｌ、ｐｏｏｌ１、ｃｏｎｖ１／ｒｅｌｕである。 In one example of the present disclosure, for four upsampling layers, according to the order from the first layer to the last layer of the four upsampling layers, the convolution layers corresponding to the four upsampling layers sequentially are: pool3_pool, pool2_pool, pool1, conv1/relu.

本開示の一例において、複数の深度情報付きのカラー三次元画像を利用して、調整された前記初期バックボーンネットワークをトレーニングする前に、本開示の実施例による方法は、
前記複数の深度情報付きのカラー三次元画像を前処理し、前記複数の深度情報付きのカラー三次元画像を前処理する過程において、少なくとも１つの深度情報付きのカラー三次元画像には、深度情報付きのカラー三次元画像において深度情報を確定できない画素点を指すホールがあると確定した場合に、ホールに穴埋め処理を行わないことをさらに含む。 In one example of the present disclosure, prior to training the initial adjusted backbone network utilizing multiple depth-informed color 3D images, a method according to an embodiment of the present disclosure includes:
In the step of preprocessing the plurality of color 3D images with depth information and preprocessing the plurality of color 3D images with depth information, at least one color 3D image with depth information includes depth information If it is determined that there is a hole pointing to a pixel point for which depth information cannot be determined in the color three-dimensional image with the marking, the method further includes not performing the hole-filling process.

本開示の一例において、複数の深度情報付きのカラー三次元画像を利用して、調整された初期バックボーンネットワークをトレーニングする過程において、教師付き学習方法を用いてトレーニングし、ここで、深度推定に用いられる損失関数は、ニューラルネットワークモデルに基づく各画素の深度推定値と信頼度推定値の関数である。 In one example of the present disclosure, the process of training an initial backbone network that is tuned using multiple depth-informed color 3D images is trained using a supervised learning method, where The loss function obtained is a function of the depth and confidence estimates for each pixel based on the neural network model.

本開示の一例において、現在のシーンの、球面投影に基づく二次元画像を取得した後、本開示の実施例による方法は、前記二次元画像に視角ブラインド領域があると確定したことに応答して、前記視角ブラインド領域を黒色に塗りつぶすことをさらに含む。 In one example of the present disclosure, after obtaining a spherical projection-based two-dimensional image of a current scene, a method according to embodiments of the present disclosure, in response to determining that the two-dimensional image has a visual angle blind region, and filling the viewing angle blind area with black.

本開示の一例において、本開示の実施例による方法は、予め設定された数の前記二次元画像を水平方向に端と端をつなげて接することで中間画像を構成した後、前記中間画像の上下縁をトリミングすることをさらに含む。 In one example of the present disclosure, a method according to an embodiment of the present disclosure comprises forming an intermediate image by connecting a predetermined number of the two-dimensional images in a horizontal direction end-to-end, and then Further including trimming the edges.

本開示の一例において、前記中間画像の上下縁をトリミングすることは、前記中間画像の上縁と下縁から、それぞれ前記中間画像の高さの予め設定された比率の高さの画像をトリミングすることを含む。 In one example of the present disclosure, trimming the upper and lower edges of the intermediate image includes cropping an image from the upper and lower edges of the intermediate image to a height of a preset ratio of the height of the intermediate image, respectively. Including.

本開示の一例において、前記予め設定された数は３であり、前記予め設定された比率は１５％である。 In one example of the present disclosure, the preset number is 3 and the preset ratio is 15%.

本開示の実施例は深度情報付きのパノラマ画像を生成する装置をさらに提供し、前記装置は、現在のシーンの、球面投影に基づく二次元画像を取得するための取得ユニットと、予め設定された数の前記二次元画像を水平方向に端と端をつなげて接することで中間画像を構成するためのつなぎ合わせユニットと、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、前記中間画像の深度情報を確定するための処理ユニットと、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得て、トリミングされた画像の深度情報を確定し、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定するためのトリミングユニットとを含む。 An embodiment of the present disclosure further provides an apparatus for generating a panoramic image with depth information, the apparatus comprising: an acquisition unit for acquiring a spherical projection-based two-dimensional image of a current scene; Utilizing a stitching unit for tangling a number of said two-dimensional images horizontally end-to-end to construct an intermediate image and a trained neural network model for predicting image depth, said A processing unit for determining the depth information of the intermediate image and cropping the intermediate image based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image, and the cropped image a cropping unit for determining depth information and determining the cropped image with depth information as a panoramic image of the current scene.

本開示の一例において、画像深度を予測するためのニューラルネットワークモデルは、
プレトレーニングされた畳み込みニューラルネットワークを初期バックボーンネットワークとし、Ｕ－Ｎｅｔネットワーク構造に応じて前記初期バックボーンネットワークを構造的に調整し、複数の深度情報付きのカラー三次元画像を利用して、調整された初期バックボーンネットワークをトレーニングし、前記画像深度を予測するためのニューラルネットワークモデルを生成することによってトレーニングされる。 In one example of this disclosure, a neural network model for predicting image depth is:
A pre-trained convolutional neural network is used as an initial backbone network, the initial backbone network is structurally adjusted according to the U-Net network structure, and a plurality of depth-informed color 3D images are used to obtain the adjusted It is trained by training an initial backbone network and generating a neural network model for predicting the image depth.

本開示の一例において、前記プレトレーニングされた畳み込みニューラルネットワークは、ＩｍａｇｅＮｅｔでプレトレーニングされたＤｅｎｓｅＮｅｔ－１６９ネットワークである。 In one example of the present disclosure, the pretrained convolutional neural network is an ImageNet pretrained DenseNet-169 network.

本開示の一例において、前記処理ユニットは、
ＤｅｎｓｅＮｅｔ－１６９ネットワークの全接続層を削除するための削除サブユニットと、
Ｕ－Ｎｅｔネットワーク構造に基づいて、全接続層を削除したＤｅｎｓｅＮｅｔ－１６９ネットワークの最後の層の後ろに４つのアップサンプリング層を添加し、それぞれ４つのアップサンプリング層における各アップサンプリング層に対応する重畳層を設定し、ここで、各アップサンプリング層の入力チャンネル数は出力チャンネル数の２倍であり、添加された各アップサンプリング層において、該アップサンプリング層の入力情報を２倍の解像度でアップサンプリングして、アップサンプリング結果を該アップサンプリング層に対応する重畳層の出力データと重畳して、重畳結果に２回の畳み込み演算を連続的に実行し、予め設定された活性化関数を使用して畳み込み演算結果を線形修正するための添加サブユニットと、
４つのアップサンプリング層における最後のアップサンプリング層の出力に対して、１回の深度情報を出力するための畳み込み演算と、１回の信頼度情報を出力するための畳み込み演算とを行うための演算サブユニットと
を含む。 In one example of the present disclosure, the processing unit comprises:
a deletion subunit for deleting all connectivity layers of a DenseNet-169 network;
Based on the U-Net network structure, add 4 upsampling layers after the last layer of the DenseNet-169 network with all connected layers removed, and superimpose each corresponding to each upsampling layer in the 4 upsampling layers setting a layer, where the number of input channels in each upsampling layer is twice the number of output channels, and in each added upsampling layer, upsampling the input information of the upsampling layer at twice the resolution and then convolving the upsampling result with the output data of the convolution layer corresponding to the upsampling layer, successively performing two convolution operations on the convolution result, and using a preset activation function an additive subunit for linearly correcting the result of the convolution operation;
An operation for performing one convolution operation for outputting depth information and one convolution operation for outputting reliability information on the output of the last upsampling layer in the four upsampling layers. subunits.

本開示の一例において、４つのアップサンプリング層に対して、４つのアップサンプリング層の最初の層から最後の層の順に応じて、４つのアップサンプリング層に順次対応する重畳層はｐｏｏｌ３＿ｐｏｏｌ、ｐｏｏｌ２＿ｐｏｏｌ、ｐｏｏｌ１、ｃｏｎｖ１／ｒｅｌｕである。 In one example of the present disclosure, for four upsampling layers, the convolution layers corresponding to the four upsampling layers sequentially are pool3_pool, pool2_pool, pool1, according to the order from the first layer to the last layer of the four upsampling layers. , conv1/relu.

本開示の一例において、前記処理ユニットはさらに、
前記複数の深度情報付きのカラー三次元画像を前処理し、前記複数の深度情報付きのカラー三次元画像を前処理する過程において、少なくとも１つの深度情報付きのカラー三次元画像には、深度情報付きのカラー三次元画像において深度情報を確定できない画素点を指すホールがあると確定した場合に、ホールに穴埋め操作を行わないために用いられる。 In one example of the present disclosure, the processing unit further:
In the step of preprocessing the plurality of color 3D images with depth information and preprocessing the plurality of color 3D images with depth information, at least one color 3D image with depth information includes depth information This is used to avoid filling the hole when it is determined that there is a hole pointing to a pixel point for which depth information cannot be determined in the color 3D image with .

本開示の一例において、前記複数の深度情報付きのカラー三次元画像を利用して、調整された前記初期バックボーンネットワークをトレーニングする過程において、教師付き学習方法を用いてトレーニングし、ここで、深度推定に用いられる損失関数は、ニューラルネットワークモデルに基づく各画素の深度推定値と信頼度推定値の関数である。 In one example of the present disclosure, in the step of training the initial adjusted backbone network using the plurality of depth-informed color 3D images, training using a supervised learning method, wherein depth estimation The loss function used in is a function of depth and confidence estimates for each pixel based on a neural network model.

本開示の一例において、前記取得ユニットはさらに、前記二次元画像に視角ブラインド領域があると確定したことに応答して、視角ブラインド領域を黒色に塗りつぶすために用いられる。 In one example of the present disclosure, the acquisition unit is further used to black fill a visual angle blind area in response to determining that there is a visual angle blind area in the two-dimensional image.

本開示の一例において、前記つなぎ合わせユニットはさらに、前記中間画像の上下縁をトリミングするために用いられる。 In one example of the present disclosure, the stitching unit is further used to trim the top and bottom edges of the intermediate image.

本開示の一例において、前記つなぎ合わせユニットはさらに、前記中間画像の上縁と下縁から、それぞれ中間画像の高さの予め設定された比率の高さの画像をトリミングするために用いられる。 In one example of the present disclosure, the stitching unit is further used to crop the image from the top and bottom edges of the intermediate image to a height of a preset ratio of the height of the intermediate image, respectively.

本開示の一例において、前記予め設定された数の値は３であり、前記予め設定された比率は１５％である。 In one example of the present disclosure, the preset number value is 3 and the preset ratio is 15%.

本開示の実施例は、電子機器のプロセッサによって実行されると、前記プロセッサに上記の深度情報付きのパノラマ画像を生成する方法を実行させる命令を含むコンピュータプログラムが記憶された非一時的コンピュータ可読記憶媒体を更に提供する。 An embodiment of the present disclosure is a non-transitory computer readable storage storing a computer program containing instructions that, when executed by a processor of an electronic device, causes the processor to perform the method of generating a panoramic image with depth information described above. It also provides a medium.

本開示の実施例はメモリとプロセッサを含む電子機器を更に提供する。メモリは、プロセッサで実行可能なコンピュータプログラムを記憶するために用いられ、前記プロセッサは前記プログラムを実行すると、以上に記載の深度情報付きのパノラマ画像を生成する方法を実現する。 Embodiments of the disclosure further provide an electronic device including a memory and a processor. The memory is used to store a computer program executable by a processor, and when the processor executes the program, it implements the method of generating a panoramic image with depth information described above.

本開示の実施例による技術的解決手段によって、現在のシーンの、球面投影に基づく二次元画像を取得した後、複数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成する。続いて、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、中間画像の深度情報を確定する。中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得る。中間画像の深度情報を確定したので、中間画像からトリミングされた画像も対応的に深度情報を有し、現在のシーンの深度情報付きのパノラマ画像とすることができる。本開示の実施例による技術的解決手段を応用すると、深度カメラを使用することを必要とせず、現在のシーンの深度情報付きのパノラマ画像を得ることができるので、コストを低減することができる。また、複数の二次元画像をつなぎ合わせた中間画像に対して深度情報を確定するため、二次元画像の端と端の接続部分に情報が欠落することがなく、従って、確定された深度情報結果をより正確にすることができる。 With the technical solution according to the embodiments of the present disclosure, after obtaining the two-dimensional image of the current scene based on spherical projection, the intermediate image is obtained by connecting the multiple two-dimensional images horizontally end-to-end. Constitute. A trained neural network model for predicting image depth is then utilized to determine the depth information of the intermediate images. An image equal to the length of the two-dimensional image is obtained by trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image. Having determined the depth information of the intermediate image, the image cropped from the intermediate image also has corresponding depth information and can be a depth-informed panoramic image of the current scene. Applying the technical solution according to the embodiments of the present disclosure, the panoramic image with depth information of the current scene can be obtained without using a depth camera, thus reducing the cost. In addition, since the depth information is determined for an intermediate image obtained by joining a plurality of two-dimensional images, there is no loss of information in the connecting portions between the ends of the two-dimensional images. can be made more accurate.

以下の図面は、本開示の範囲を限定するものではなく、本開示を例示的に説明し解釈するものに過ぎない。
本開示のいくつかの実施例による深度情報付きのパノラマ画像を生成する方法のフローチャートである。本開示のいくつかの実施例による複数の二次元画像を端と端をつなげて接して構成される中間画像の概略図である。本開示のいくつかの実施例による中間画像の上下縁をトリミングした結果の概略図である。本開示のいくつかの実施例による深度情報付きのパノラマ画像を生成する装置の概略構成図である。本開示の別のいくつかの実施例による深度情報付きのパノラマ画像を生成する装置の概略構成図である。本開示のいくつかの実施例による電子機器の概略構成図である。 The following drawings are not intended to limit the scope of the present disclosure, but merely to exemplify and interpret the present disclosure.
4 is a flowchart of a method of generating a panoramic image with depth information according to some embodiments of the present disclosure; FIG. 4 is a schematic diagram of an intermediate image constructed by joining a plurality of two-dimensional images end-to-end in contact with some embodiments of the present disclosure; FIG. 4B is a schematic diagram of the result of cropping the upper and lower edges of an intermediate image according to some embodiments of the present disclosure; 1 is a schematic block diagram of an apparatus for generating panoramic images with depth information according to some embodiments of the present disclosure; FIG. 4 is a schematic configuration diagram of an apparatus for generating a panoramic image with depth information according to some other embodiments of the present disclosure; FIG. 1 is a schematic block diagram of an electronic device according to some embodiments of the present disclosure; FIG.

本開示の目的、技術的解決手段、および利点をより明確にするために、以下では、図面及び実施例に合わせて、本開示の技術的解決手段を詳細に説明する。 In order to make the objectives, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure are described in detail below along with the drawings and examples.

本開示の実施例では、通常のカメラを利用して撮影し、現在のシーンの二次元画像を得ることができる。複数の同じ現在のシーンの二次元画像を水平方向に端と端をつなげて接することで中間画像を構成し、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して中間画像の深度情報を確定することができる。続いて、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、現在のシーンの二次元画像の長さと等しく画像を得る。中間画像からトリミングされた画像は対応する深度情報を持つため、現在のシーンの深度情報付きのパノラマ画像とすることができる。 In the embodiments of the present disclosure, a normal camera can be used to capture and obtain a two-dimensional image of the current scene. Construct an intermediate image by joining multiple 2D images of the same current scene horizontally edge-to-edge, and utilize a trained neural network model to predict the depth of the intermediate image Information can be confirmed. Subsequently, the intermediate image is trimmed based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image of the current scene. The image cropped from the intermediate image has corresponding depth information, so it can be a depth-informed panoramic image of the current scene.

図１は本開示のいくつかの実施例による深度情報付きのパノラマ画像を生成する方法のフローチャートであり、図１に示すように、該方法は主に以下のステップを含む。 FIG. 1 is a flow chart of a method for generating a panoramic image with depth information according to some embodiments of the present disclosure, as shown in FIG. 1, the method mainly includes the following steps.

ステップ１０１では、現在のシーンの、球面投影に基づく二次元画像を取得する。 In step 101, a two-dimensional image of the current scene based on spherical projection is obtained.

本開示の実施例では、画像キャプチャ装置から現在のシーンの、球面投影に基づく二次元画像を取得することができる。二次元画像にはパノラマ画像のＲＧＢ情報のみを持つことができるが、パノラマ画像の深度情報を持っていないため、画像キャプチャ装置に対する要求は低く、ＲＧＢ魚眼カメラなどのＲＧＢカメラ付きのスキャン装置であればよく、携帯電話などのカメラ付きのモバイルデバイスでもよい。使用した画像キャプチャ装置は深度カメラを必要としないため、パノラマ画像を取得するコストが低い。 Embodiments of the present disclosure may obtain a spherical projection-based two-dimensional image of the current scene from the image capture device. A two-dimensional image can only have RGB information of a panoramic image, but it does not have depth information of a panoramic image. Any mobile device with a camera, such as a mobile phone, may be used. The image capture device used does not require a depth camera, so the cost of acquiring panoramic images is low.

本開示の実施例では、取得された現在のシーンの二次元画像の上縁と下縁については、ニューラルネットワークモデルが二次元画像の深度情報を推測するために、完全な視角を揃える必要はなく、垂直視角で十分なテクスチャ、ライン、物体情報を持っていればよい。 In embodiments of the present disclosure, for the top and bottom edges of the captured current scene 2D image, the neural network model does not need to align the perfect viewing angle for inferring the depth information of the 2D image. , with sufficient texture, line, and object information at vertical viewing angles.

例えば、完全な視角を揃っていない場合、二次元画像における視角ブラインド領域は黒色に塗りつぶすことができる。例示的な実施例によれば、現在のシーンの、球面投影に基づく二次元画像を取得した後、二次元画像に視角ブラインド領域があるかどうかをさらに確定することができる。二次元画像に視角ブラインド領域があると確定したことに応答して、二次元画像における視角ブラインド領域を黒色に塗りつぶす。 For example, if the full viewing angle is not aligned, the viewing angle blind area in the two-dimensional image can be filled with black. According to an exemplary embodiment, after obtaining a spherical projection-based two-dimensional image of the current scene, it can be further determined whether there is a visual angle blind region in the two-dimensional image. In response to determining that there is a visual angle blind area in the two-dimensional image, the visual angle blind area in the two-dimensional image is painted black.

ステップ１０２では、予め設定された数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成する。 In step 102, an intermediate image is constructed by connecting a predetermined number of two-dimensional images in the horizontal direction end-to-end.

本開示の実施例では、予め設定された数Ｎの値は１よりも大きい整数であり、例えば、Ｎの値は３であってもよい。 In embodiments of the present disclosure, the value of the preset number N is an integer greater than 1, for example, the value of N may be 3.

球面投影に基づく二次元画像は、それ自体の情報が端と端をつなげて接するものである。予め設定された数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成することは、１番目の二次元画像の尾部を２番目の二次元画像の頭部に接続し、２番目の二次元画像の尾部を３番目の二次元画像の頭部に接続し、以下同様にして、かつ、１番目二次元画像の頭部は最後の二次元画像の尾部に接続されていないことを指すことができる。予め設定された数の二次元画像は複数の同じ画像であってもよく、すなわち、いずれも現在のシーンの、球面投影に基づく二次元画像である。例えば、図２に示すように、３つの二次元画像によって構成された中間画像の概略図を示す。 A two-dimensional image based on spherical projection touches its own information end-to-end. Constructing an intermediate image by joining a preset number of two-dimensional images horizontally end-to-end connects the tail of the first two-dimensional image to the head of the second two-dimensional image. and the tail of the second two-dimensional image is connected to the head of the third two-dimensional image, and so on, and the head of the first two-dimensional image is connected to the tail of the last two-dimensional image. can point to not The preset number of two-dimensional images may be multiple identical images, ie, all two-dimensional images of the current scene, based on spherical projection. For example, as shown in FIG. 2, a schematic diagram of an intermediate image composed by three two-dimensional images is shown.

いくつかの例では、球面投影に基づく二次元画像の長さ（水平方向における長さである）と幅（垂直方向における長さであり、高さとも呼ばれる）の比率は２：１である。したがって、Ｎ個の二次元画像を水平方向に端と端をつなげて接して構成された中間画像は、その長さと幅の比が２Ｎ：１である。例えばＮ＝３の場合、中間画像のアスペクト比は６：１となる。 In some examples, a two-dimensional image based on spherical projection has a length (horizontal length) to width (vertical length, also called height) ratio of 2:1. Therefore, an intermediate image constructed by connecting N two-dimensional images horizontally end-to-end has a length to width ratio of 2N:1. For example, when N=3, the aspect ratio of the intermediate image is 6:1.

実際の応用では、球面投影に基づく二次元画像の上下縁の部分に歪みが存在する可能性があり、歪みは後続の畳み込みニューラルネットワークのトレーニングと深度推定に影響を与える。この影響を低減するために、本開示の実施例では、予め設定された数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成した後、中間画像の上下縁をトリミングすることもできる。例えば、中間画像の上縁から一定の比率の高さの画像をトリミングし、中間画像の下縁から一定の比率の高さの画像をトリミングすることができる。上縁に対するトリミング比率と下縁に対するトリミング比率は同じであってもよいし、異なっていてもよい。 In practical applications, there may be distortions in the upper and lower edge parts of 2D images based on spherical projection, and the distortions affect subsequent convolutional neural network training and depth estimation. In order to reduce this effect, in the embodiments of the present disclosure, an intermediate image is formed by connecting a predetermined number of two-dimensional images in the horizontal direction end-to-end, and then the upper and lower edges of the intermediate image are You can also trim. For example, the image can be cropped to a proportional height from the top edge of the intermediate image, and the image to a proportional height from the bottom edge of the intermediate image. The trimming ratio for the top edge and the trimming ratio for the bottom edge may be the same or different.

本開示の実施例では、中間画像の上下縁をトリミングすることは、中間画像の上縁と下縁から、それぞれ中間画像の高さの予め設定された比率の高さの画像をトリミングすることを含み、ここでは、予め設定された比率は、二次元画像におけるテクスチャ、ライン、物体情報の欠落が一定の閾値を超えていないことを保証できればよく、１５％であってもよく、他の値であってもよい。この閾値は具体的な必要に応じて確定することができ、本開示はこれに限定されない。例えば、アスペクト比が６：１の中間画像については、この中間画像の上下縁をそれぞれに１５％トリミングした後、そのアスペクト比が６０：７になる。図３に示すように、上下縁をトリミングした中間画像の例を示しており、ここでのスクライブ部分はトリミングされた画像部分である。 In an embodiment of the present disclosure, trimming the upper and lower edges of the intermediate image means trimming the height of the image from the upper and lower edges of the intermediate image to a preset ratio of the height of the intermediate image, respectively. Including, here, the preset ratio may be 15% as long as it can guarantee that the lack of texture, line, and object information in the two-dimensional image does not exceed a certain threshold, and other values There may be. This threshold can be determined according to specific needs, and the present disclosure is not limited thereto. For example, for an intermediate image with an aspect ratio of 6:1, after trimming the top and bottom edges of the intermediate image by 15% respectively, the aspect ratio becomes 60:7. As shown in FIG. 3, an example of an intermediate image with top and bottom edges trimmed is shown, where the scribe portion is the trimmed image portion.

ステップ１０３では、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、中間画像の深度情報を確定する。 At step 103, depth information for intermediate images is determined using a trained neural network model for predicting image depth.

本開示の実施例では、トレーニングによって画像深度を予測するためのニューラルネットワークモデルを得ることができる。例えば、大量の実際のシーンの深度情報付きのカラー三次元画像をトレーニングサンプルとして予め収集しておいてもよい。トレーニングサンプルには画素レベルのアライメントが必要であり、例えば壁の隅、自動車、天井、地面、窓、ドアなどの要素の屋内及び屋外の様々なシーンを含む。 Embodiments of the present disclosure may obtain a neural network model for predicting image depth through training. For example, a large number of color 3D images with depth information of actual scenes may be pre-collected as training samples. The training samples require pixel-level alignment and include a variety of indoor and outdoor scenes, such as wall corners, cars, ceilings, ground, windows, doors, and other elements.

一般的には、実際のシーンの深度情報付きのカラー三次元画像にはホールがあり、本開示の実施例では、収集された大量なトレーニングサンプル（即ち深度情報付きのカラー三次元画像）を前処理（ガウシアンフィルタ、寸法調整など）することができる。説明すべきことは、トレーニングサンプルを前処理する際に、トレーニングサンプルにホールがあれば、ホールに穴埋め処理を行わない。ここでのホールとは画像における深度情報を確定できない画素点を指す。このような深度情報を確定できない画素点については、本開示の実施例では、その深度情報が依然として未知のままに保持され、事前推定や他の方法によりその深度値取得して穴埋め処理を実現することがない。 In general, there are holes in depth-informed color 3D images of real scenes, and in the embodiments of the present disclosure, a large number of collected training samples (i.e., depth-informed color 3D images) are used in advance. Can be processed (Gaussian filter, size adjustment, etc.). What should be explained is that when training samples are preprocessed, if there are holes in the training samples, the holes are not filled. A hole here refers to a pixel point in an image for which depth information cannot be determined. For such pixel points whose depth information cannot be determined, in the embodiments of the present disclosure, the depth information is still unknown, and the depth value is obtained by pre-estimating or other methods to realize the filling process. never

大量なトレーニングサンプルを收集した後、トレーニングサンプルを利用してニューラルネットワークモデルをトレーニングすることで、画像深度を予測するためのニューラルネットワークモデルを得ることができる。 After collecting a large amount of training samples, the training samples can be used to train a neural network model to obtain a neural network model for predicting image depth.

本開示の実施例では、画像深度を予測するためのニューラルネットワークモデルは以下の方式によってトレーニングすることができる。プレトレーニングされた畳み込みニューラルネットワークを初期バックボーンネットワークとし、Ｕ－Ｎｅｔネットワーク構造に応じて初期バックボーンネットワークを構造的に調整する。複数の深度情報付きのカラー三次元画像（即ちトレーニングサンプル）を利用して、調整された初期バックボーンネットワークをトレーニングし、画像深度を予測するためのニューラルネットワークモデルを生成する。 In embodiments of the present disclosure, a neural network model for predicting image depth may be trained by the following scheme. The pre-trained convolutional neural network is taken as the initial backbone network, and the initial backbone network is structurally adjusted according to the U-Net network structure. Multiple depth-informed color 3D images (ie, training samples) are utilized to train an initial tuned backbone network to generate a neural network model for predicting image depth.

プレトレーニングされた畳み込みニューラルネットワークは、ＩｍａｇｅＮｅｔでプレトレーニングされたＤｅｎｓｅＮｅｔネットワークであってもよい。いくつかの例では、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークであってもよい。 The pre-trained convolutional neural network may be a DenseNet network pre-trained with ImageNet. In some examples, the DenseNet network may be a DenseNet-169 network.

Ｕ－Ｎｅｔネットワーク構造に応じて初期バックボーンネットワークを構造的に調整することは、以下のステップを含むことができる。 Structurally adjusting the initial backbone network according to the U-Net network structure can include the following steps.

ＤｅｎｓｅＮｅｔネットワークの全接続層を削除する。Ｕ－Ｎｅｔネットワーク構造に基づいて、全接続層を削除したＤｅｎｓｅＮｅｔネットワークの最後の層の後ろに複数のアップサンプリング層を添加し、それぞれ複数のアップサンプリング層における各アップサンプリング層に対応する重畳層を設定する。いくつかの例では、例えば、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークである場合に、ＤｅｎｓｅＮｅｔ－１６９ネットワークの最後の層の後ろに４つのアップサンプリング層を添加することができる。 Remove all connectivity layers of the DenseNet network. Based on the U-Net network structure, add multiple upsampling layers after the last layer of the DenseNet network with all connected layers removed, and add a superposition layer corresponding to each upsampling layer in the multiple upsampling layers, respectively. set. In some examples, for example, if the DenseNet network is a DenseNet-169 network, four upsampling layers can be added after the last layer of the DenseNet-169 network.

説明すべきことは、プレトレーニングされた畳み込みニューラルネットワークは、ＩｍａｇｅＮｅｔでプレトレーニングされたＧｏｏｇｌｅＩｎｃｅｐｔｉｏｎ、ＲｅｓＮｅｔ、ＶＧＧなどのモデルであってもよい。プレトレーニングされた畳み込みニューラルネットワークはＩｍａｇｅＮｅｔでプレトレーニングされたＧｏｏｇｌｅＩｎｃｅｐｔｉｏｎ、ＲｅｓＮｅｔ、またはＶＧＧである場合、上記のＵ－Ｎｅｔネットワーク構造に基づいてアップサンプリング層を添加し、各アップサンプリング層に対応する重畳層を設定する実現方式が異なる可能性があり、例えば添加された層の名前が変化する可能性がある。 Illustratively, the pre-trained convolutional neural network can be a model such as Google Inception, ResNet, VGG, etc. pre-trained with ImageNet. If the pre-trained convolutional neural network is ImageNet pre-trained Google Inception, ResNet, or VGG, add an upsampling layer based on the above U-Net network structure, and for each upsampling layer, a corresponding superposition The implementation of setting the layers may differ, eg the name of the added layers may change.

本開示の実施例では、添加された４つのアップサンプリング層は、例えばそれぞれＤｅｎｓｅＮｅｔ－１６９ネットワークにおける１つの層とすることができる。添加された４つのアップサンプリング層が最初の層から最後の層の順に、１番目のアップサンプリング層、２番目のアップサンプリング層、３番目のアップサンプリング層、４番目のアップサンプリング層であると仮定すると、全接続層を削除したＤｅｎｓｅＮｅｔ－１６９ネットワークの最後の層の出力は１番目のアップサンプリング層の入力とし、１番目のアップサンプリング層の出力は２番目のアップサンプリング層の入力とし、２番目のアップサンプリング層の出力は３番目のアップサンプリング層の入力とし、３番目のアップサンプリング層の出力は４番目のアップサンプリング層の入力とする。また、１番目のアップサンプリング層、２番目のアップサンプリング層、３番目のアップサンプリング層、４番目のアップサンプリング層に対応する重畳層は、ｐｏｏｌ３＿ｐｏｏｌ、ｐｏｏｌ２＿ｐｏｏｌ、ｐｏｏｌ１、ｃｏｎｖ１／ｒｅｌｕとして順次構成されることができる。 In an embodiment of the present disclosure, the four added upsampling layers can be, for example, one layer each in the DenseNet-169 network. Assume that the four upsampling layers added are, in order from the first layer to the last layer: 1st upsampling layer, 2nd upsampling layer, 3rd upsampling layer, 4th upsampling layer Then the output of the last layer of the DenseNet-169 network with all connected layers removed is the input of the first upsampling layer, the output of the first upsampling layer is the input of the second upsampling layer, and the output of the first upsampling layer is the input of the second upsampling layer. The output of the upsampling layer of is the input of the third upsampling layer, and the output of the third upsampling layer is the input of the fourth upsampling layer. Also, the superimposed layers corresponding to the first upsampling layer, the second upsampling layer, the third upsampling layer, and the fourth upsampling layer are sequentially configured as pool3_pool, pool2_pool, pool1, conv1/relu. be able to.

各アップサンプリング層の入力チャンネル数は２倍などのその出力チャンネル数の予め設定された倍数であってもよい。添加された各アップサンプリング層では、該アップサンプリング層の入力情報を予め設定された解像度倍数（例えば２倍の解像度）でアップサンプリングして、アップサンプリング結果を該アップサンプリング層に対応する重畳層の出力データと重畳して、重畳結果に少なくとも１回の畳み込み演算（例えば、少なくとも１回の畳み込み演算は２回の畳み込み演算であってもよい。２回の畳み込み演算である場合に、第１回の畳み込み演算を実行し、さらに第１回の畳み込み演算結果に第２回の畳み込み演算を実行し、各畳み込み演算は畳み込みカーネルが３×３である二次元畳み込み演算であってもよい）を実行し、予め設定された活性化関数（例えばｒｅｌｕ活性化関数）を使用して畳み込み演算結果を線形修正する。ここでは、アップサンプリング層の入力情報を２倍の解像度でアップサンプリングすることは、それぞれ入力情報の画素の行数と列数を元の２倍に拡張することによって行われるアップサンプリングを指すことができる。 The number of input channels for each upsampling layer may be a preset multiple of its number of output channels, such as twice. In each added upsampling layer, the input information of the upsampling layer is upsampled at a preset resolution multiple (for example, twice the resolution), and the upsampling result is the superposition layer corresponding to the upsampling layer. Convolved with the output data, the convolution result is subjected to at least one convolution operation (for example, the at least one convolution operation may be two convolution operations. If there are two convolution operations, the first and perform a second convolution operation on the result of the first convolution operation, each convolution operation may be a two-dimensional convolution operation with a 3×3 convolution kernel). and linearly correct the convolution result using a preset activation function (eg, relu activation function). Here, upsampling the input information of the upsampling layer to double the resolution can refer to upsampling done by expanding the number of rows and columns of pixels of the input information to twice the original number, respectively. can.

複数のアップサンプリング層における最後のアップサンプリング層の出力に対して、１回の深度情報を出力するための畳み込み演算と、１回の信頼度情報を出力するための畳み込み演算とを行う。または、最後のアップサンプリング層の出力に出力チャネルが２である１回の畳み込み演算を行い、この２つの出力チャネルはそれぞれ信頼度情報と深度情報を出力する。いくつかの例では、畳み込み演算は畳み込みカーネルが３×３である二次元畳み込み演算であってもよい。 One convolution operation for outputting depth information and one convolution operation for outputting reliability information are performed on the output of the last upsampling layer among the plurality of upsampling layers. Alternatively, the output of the last upsampling layer is subjected to one convolution operation with two output channels, which output confidence information and depth information respectively. In some examples, the convolution operation may be a two-dimensional convolution operation with a 3×3 convolution kernel.

深度情報Ｄはメートル単位であってもよい。信頼度情報とは中間画像における各画素点の予測深度値に対する信頼度を指す。画素点の信頼度値が高いことは、この画素点の予測深度値は実際の深度値に近いことを示し、信頼度値が低いことは、この画素点の予測深度値は実際の深度値にあまり近くないことを示し、信頼度値が０であることは、この画素点が１つのホールであり、その深度値を確定／予測できないことを示している。 Depth information D may be in meters. Reliability information refers to the reliability of the predicted depth value of each pixel point in the intermediate image. A high confidence value for a pixel point indicates that the predicted depth value for this pixel point is close to the actual depth value, and a low confidence value indicates that the predicted depth value for this pixel point is close to the actual depth value. A not-so-close confidence value of 0 indicates that this pixel point is a hole and its depth value cannot be determined/predicted.

いくつかの例示的な例によれば、複数の深度情報付きのカラー三次元画像を利用して、調整された初期バックボーンネットワークをトレーニングする過程において、教師付き学習方法を用いてトレーニングすることができ、深度推定に用いられる損失関数は、トレーニングされて得られたニューラルネットワークモデルに基づく各画素の深度推定値と信頼度推定値の関数である。いくつかの実施例において、損失関数は以下の３つの関数の組み合わせであってもよい。 According to some illustrative examples, in the course of training an initial backbone network that is tuned using multiple depth-informed color 3D images, it can be trained using a supervised learning method. , the loss function used for depth estimation is a function of depth and confidence estimates for each pixel based on a trained neural network model. In some embodiments, the loss function may be a combination of the following three functions.

関数１、ニューラルネットワークモデルに基づく各画素ｘの深度推定値の関数ｆ１（ｘ）：ニューラルネットワークモデルによる各画素ｘの深度推定値と深度真値の差値の絶対値をマスクフィルタリングすること、
関数２、ニューラルネットワークモデルに基づく各画素ｘの深度推定値の勾配関数ｆ２（ｘ）：ニューラルネットワークモデルによる各画素ｘの深度推定値の勾配と深度真値の勾配の差値の絶対値をマスクフィルタリングすること、
関数３、ニューラルネットワークモデルに基づく各画素ｘの信頼度推定値の関数ｆ３（ｘ）：ニューラルネットワークモデルによる各画素ｘの信頼度推定値と信頼度真値の差値の絶対値である。 Function 1, function f1(x) of the depth estimate value of each pixel x based on the neural network model: mask filtering the absolute value of the difference between the depth estimate value of each pixel x and the true depth value of the neural network model;
Function 2, Gradient function f2(x) of the depth estimation value of each pixel x based on the neural network model: Mask the absolute value of the difference value between the gradient of the depth estimation value of each pixel x by the neural network model and the gradient of the true depth value filtering,
Function 3, function f3(x) of the reliability estimated value of each pixel x based on the neural network model: the absolute value of the difference between the reliability estimated value of each pixel x based on the neural network model and the true reliability value.

信頼度真値は以下の方法を採用して確定することができる：ニューラルネットワークモデルによる画素ｘの深度推定値が存在しない場合、信頼度真値を０と確定し、ニューラルネットワークモデルによる画素ｘの深度推定値が存在する場合、以下の公式を採用して信頼度真値を確定し、信頼度真値＝１－予め設定された調節ファクター（例えば０．０２）×（ニューラルネットワークモデルによる画素ｘの深度推定値－画素ｘの深度真値）である。 The true confidence value can be determined by adopting the following method: if there is no depth estimate of pixel x by the neural network model, determine the true confidence value as 0; If a depth estimate exists, determine the true confidence value using the following formula: true confidence value=1−preset adjustment factor (eg, 0.02)×(pixels by neural network model× is the depth estimate of -the depth true value of pixel x).

本開示の実施例では、画像における各画素の上記の３つの関数の対応する加重平均結果を累積加算してから平均値を計算して、算出された平均値を損失関数の損失値とすることができる。なお、上記の深度真値とは、画像における画素の実際の深度値を指す。 In the embodiment of the present disclosure, the corresponding weighted average results of the above three functions for each pixel in the image are cumulatively added and then the average value is calculated, and the calculated average value is the loss value of the loss function. can be done. Note that the above true depth value refers to the actual depth value of a pixel in an image.

以上に記載の関数ｆ１と関数ｆ２について、マスクフィルタリングにより、深度真値にホールがある部分の深度推定値を無視することができる。以上に記載の関数ｆ３は、信頼度推定にＬ１絶対値損失を使用するものであり、この関数を採用することにより、深度真値のホールがある部分の信頼度真値が０に設定され、深度推定値が深度真値から遠く離れている画素点に対して、信頼度推定値は０に近づくべきであり、深度推定値と深度真値が比較的近い画素点に対して、信頼度推定値は１に近づくべきである。 For the functions f1 and f2 described above, mask filtering makes it possible to ignore depth estimation values in portions where there is a hole in the true depth value. The function f3 described above uses the L1 absolute value loss for confidence estimation, and by adopting this function, the confidence true value of the portion of the hole in the depth true value is set to 0, For pixel points where the depth estimate is far away from the true depth value, the confidence estimate should be close to 0, and for pixel points where the depth estimate and the true depth value are relatively close, the confidence estimate The value should be close to 1.

ステップ１０４では、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得て、トリミングされた画像の深度情報を確定し、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定する。 In step 104, the intermediate image is trimmed based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image, the depth information of the trimmed image is determined, and the depth information with the depth information is obtained. Accept the cropped image as a panoramic image of the current scene.

本開示の実施例では、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、中間画像の深度情報を確定した後、１枚の深度情報と信頼度情報を持つ中間画像が得られる。中間画像の水平方向における中間位置（例えば中点位置）に基づいて、中間画像をトリミングすることで、現在のシーンの二次元画像の長さと等しく画像を得ることができる。例えば、図２に示すように、３つの同じ現在のシーンの二次元画像を水平方向に端と端をつなげて接して中間画像を構成する場合、中間画像の水平方向における中点位置を基準として左向きに及び右向きにそれぞれ二次元画像の長さの５０％に対応する長さの２つの画像領域を確定し、この２つの画像領域をトリミングすることにより、得られた現在のシーンの二次元画像の長さと等しく画像を中間画像中の２番目の二次元画像に対応させることができる。トリミングされた画像の深度情報を確定することができ、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定することができる。 In embodiments of the present disclosure, a trained neural network model for predicting image depth is used to determine the depth information of intermediate images, and then an intermediate image with depth information and confidence information is obtained. be done. By trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image (for example, the midpoint position), an image equal to the length of the two-dimensional image of the current scene can be obtained. For example, as shown in FIG. 2, when connecting three 2D images of the same current scene horizontally end-to-end to form an intermediate image, the horizontal midpoint position of the intermediate image is used as a reference. A two-dimensional image of the current scene obtained by defining two image regions with lengths corresponding to 50% of the length of the two-dimensional image respectively toward the left and right, and cropping these two image regions. The image can be made to correspond to the second two-dimensional image in the intermediate image, equal to the length of . Depth information for the cropped image can be determined, and the cropped image with depth information can be determined as the panoramic image of the current scene.

実際には、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して確定された中間画像の深度情報は、中間画像における各画素点の深度情報を含む。したがって、中間画像の水平方向における中間位置に基づいて中間画像をトリミングして得られた二次元画像の長さと等しく画像における各画素点の深度情報は直接確定することができる。同様に、トリミングして得られた画像における各画素点の信頼度情報も直接確定することができる。 In practice, the depth information of intermediate images determined using a trained neural network model for predicting image depth includes depth information for each pixel point in the intermediate images. Therefore, the depth information of each pixel point in the image equal to the length of the two-dimensional image obtained by trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image can be directly determined. Similarly, the confidence information of each pixel point in the cropped image can also be determined directly.

現在のシーンのパノラマ画像における信頼できる深度データソースとして、対応する信頼度推定値が予め設定された信頼度閾値（例えば０．８）よりも大きい深度推定値を使用することができる。信頼度閾値の値は、例えば最終的な応用がより多くの深度データを必要とするか、より高い信頼度を有する深度データを必要とするかによって調整することができる。 Depth estimates whose corresponding confidence estimates are greater than a preset confidence threshold (eg, 0.8) can be used as reliable depth data sources in the panoramic image of the current scene. The confidence threshold value can be adjusted, for example, depending on whether the final application requires more depth data or depth data with higher confidence.

本開示の実施例では、現在のシーンの深度情報付きのパノラマ画像を確定した後、後続の画素アライメント、画像つなぎ合わせなどの処理において、この深度情報を使用して高精度の画素アライメントと画像つなぎ合わせなどの操作を支援することができる。同時に、この深度情報は、三角メッシング（ｍｅｓｈｉｎｇ）、テクスチャマッピング（ｔｅｘｔｕｒｅｍａｐｐｉｎｇ）などの屋内外のシーンの全体に後続の三次元再構築の作業を行うために、単一点のポイントクラウドに変換することもできる。 After determining the depth-informed panoramic image of the current scene, embodiments of the present disclosure use this depth information in subsequent processes such as pixel alignment, image stitching, etc. for high-precision pixel alignment and image stitching. Operations such as matching can be supported. At the same time, this depth information can be converted to a single-point point cloud for subsequent 3D reconstruction tasks across indoor and outdoor scenes, such as triangular meshing, texture mapping, etc. can also

以上、本開示の実施例による深度情報付きのパノラマ画像を生成する方法を詳細に説明し、本開示の実施例は深度情報付きのパノラマ画像を生成する装置をさらに提供し、以下、図４を合わせて詳しく説明する。 The above describes in detail the method for generating a panoramic image with depth information according to an embodiment of the present disclosure, and the embodiment of the present disclosure further provides an apparatus for generating a panoramic image with depth information, hereinafter referring to FIG. A detailed explanation will be given together.

図４は、本開示のいくつかの実施例による深度情報付きのパノラマ画像を生成する装置４００の概略構成図である。図４に示すように、この装置４００は取得ユニット４０１、つなぎ合わせユニット４０２、処理ユニット４０３及びトリミングユニット４０４を含むことができる。 FIG. 4 is a schematic block diagram of an apparatus 400 for generating panoramic images with depth information according to some embodiments of the present disclosure. As shown in FIG. 4, the apparatus 400 can include an acquisition unit 401, a stitching unit 402, a processing unit 403 and a trimming unit 404. As shown in FIG.

取得ユニット４０１は、現在のシーンの、球面投影に基づく二次元画像を取得するために用いられる。 Acquisition unit 401 is used to acquire a two-dimensional image of the current scene based on spherical projection.

つなぎ合わせユニット４０２は、予め設定された数の二次元画像を水平方向に端と端をつなげて接することで中間画像を構成するために用いられる。 The stitching unit 402 is used to construct an intermediate image by stitching a preset number of two-dimensional images horizontally end-to-end.

処理ユニット４０３は、画像深度を予測するためのトレーニングされたニューラルネットワークモデルを利用して、中間画像の深度情報を確定するために用いられる。 A processing unit 403 is used to determine the depth information of the intermediate images using a trained neural network model for predicting image depth.

トリミングユニット４０４は、中間画像の水平方向における中間位置に基づいて中間画像をトリミングすることで、二次元画像の長さと等しく画像を得て、トリミングされた画像の深度情報を確定し、深度情報付きのトリミングされた画像を現在のシーンのパノラマ画像として確定するために用いられる。 The trimming unit 404 trims the intermediate image based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image, determines the depth information of the trimmed image, and determines the depth information of the trimmed image. is used to establish the cropped image of the current scene as the panorama image of the current scene.

図５は、本開示の別のいくつかの実施例による深度情報付きのパノラマ画像を生成する装置５００の概略構成図である。図５に示すように、この装置５００は取得ユニット５０１、つなぎ合わせユニット５０２、処理ユニット５０３及びトリミングユニット５０４を含む。いくつかの実施例において、取得ユニット５０１、つなぎ合わせユニット５０２、処理ユニット５０３及びトリミングユニット５０４は、それぞれ上記の取得ユニット４０１、つなぎ合わせユニット４０２、処理ユニット４０３及びトリミングユニット４０４に基づいて実現されることができる。 FIG. 5 is a schematic block diagram of an apparatus 500 for generating panoramic images with depth information according to some further embodiments of the present disclosure. As shown in FIG. 5, the apparatus 500 includes an acquisition unit 501, a stitching unit 502, a processing unit 503 and a trimming unit 504. As shown in FIG. In some embodiments, the acquiring unit 501, the stitching unit 502, the processing unit 503 and the trimming unit 504 are implemented based on the above acquiring unit 401, the stitching unit 402, the processing unit 403 and the trimming unit 404 respectively. be able to.

いくつかの実施例において、画像深度を予測するためのニューラルネットワークモデルは、プレトレーニングされた畳み込みニューラルネットワークを初期バックボーンネットワークとし、Ｕ－Ｎｅｔネットワーク構造に応じて初期バックボーンネットワークを構造的に調整し、複数の深度情報付きのカラー三次元画像を利用して、調整された初期バックボーンネットワークをトレーニングすることで、画像深度を予測するためのニューラルネットワークモデルを生成するによってトレーニングされることができる。 In some embodiments, the neural network model for predicting image depth includes a pre-trained convolutional neural network as an initial backbone network, structurally adjusting the initial backbone network according to the U-Net network structure, It can be trained by utilizing multiple depth-informed color 3D images to generate a neural network model for predicting image depth by training an initial backbone network that is tuned.

いくつかの実施例において、プレトレーニングされた畳み込みニューラルネットワークはＩｍａｇｅＮｅｔでプレトレーニングされたＤｅｎｓｅＮｅｔネットワークである。例えば、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークであってもよい。 In some embodiments, the pre-trained convolutional neural network is an ImageNet pre-trained DenseNet network. For example, the DenseNet network may be a DenseNet-169 network.

いくつかの実施例において、処理ユニット５０３は削除サブユニット５０３１、添加サブユニット５０３２及び演算サブユニット５０３３を含むことができる。 In some embodiments, the processing unit 503 can include a deletion subunit 5031 , an addition subunit 5032 and an arithmetic subunit 5033 .

削除サブユニット５０３１はＤｅｎｓｅＮｅｔネットワークの全接続層を削除するために用いられる。 Deletion subunit 5031 is used to delete all connectivity layers of the DenseNet network.

添加サブユニット５０３２は、Ｕ－Ｎｅｔネットワーク構造に基づいて、全接続層を削除したＤｅｎｓｅＮｅｔネットワークの最後の層の後ろに複数のアップサンプリング層を添加し、それぞれ複数のアップサンプリング層における各アップサンプリング層に対応する重畳層を設定するために用いられる。各アップサンプリング層の入力チャンネル数は出力チャンネル数の予め設定された倍数である。添加された各アップサンプリング層において、該アップサンプリング層の入力情報を予め設定された解像度倍数でアップサンプリングして、アップサンプリング結果を該アップサンプリング層に対応する重畳層の出力データと重畳して、重畳結果に少なくとも１回の畳み込み演算を実行し、予め設定された活性化関数を使用して畳み込み演算結果を線形修正する。 The adding subunit 5032 adds multiple upsampling layers after the last layer of the DenseNet network with all connected layers removed, based on the U-Net network structure, and each upsampling layer in the multiple upsampling layers respectively. is used to set the superposition layer corresponding to . The number of input channels for each upsampling layer is a preset multiple of the number of output channels. for each added upsampling layer, upsampling the input information of the upsampling layer at a preset resolution multiple, superimposing the upsampling result with the output data of the superimposed layer corresponding to the upsampling layer, Perform at least one convolution operation on the convolution result and linearly modify the convolution result using a preset activation function.

いくつかの例では、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークである。ＤｅｎｓｅＮｅｔ－１６９ネットワークの最後の層の後ろに４つのアップサンプリング層を添加することができる。各アップサンプリング層の入力チャンネル数は出力チャンネル数の２倍である。いくつかの例では、添加された各アップサンプリング層において、該アップサンプリング層の入力情報を２倍の解像度でアップサンプリングする。いくつかの例では、重畳結果に２回の畳み込み演算を実行する。 In some examples, the DenseNet network is a DenseNet-169 network. Four upsampling layers can be added after the last layer of the DenseNet-169 network. The number of input channels for each upsampling layer is twice the number of output channels. In some examples, at each added upsampling layer, the input information for that upsampling layer is upsampled at twice the resolution. In some examples, two convolution operations are performed on the convolution result.

演算サブユニット５０３３は、複数のアップサンプリング層における最後のアップサンプリング層の出力に対して、１回の深度情報を出力するための畳み込み演算と、１回の信頼度情報を出力するための畳み込み演算とを行うために用いられる。 The arithmetic subunit 5033 performs one convolution operation for outputting depth information and one convolution operation for outputting reliability information on the output of the last upsampling layer in the plurality of upsampling layers. It is used to perform

いくつかの実施例において、ＤｅｎｓｅＮｅｔネットワークはＤｅｎｓｅＮｅｔ－１６９ネットワークである。４つのアップサンプリング層に対して、４つのアップサンプリング層の最初の層から最後の層の順に応じて、４つのアップサンプリング層に順次対応する重畳層は、それぞれｐｏｏｌ３＿ｐｏｏｌ、ｐｏｏｌ２＿ｐｏｏｌ、ｐｏｏｌ１、ｃｏｎｖ１／ｒｅｌｕであってもよい。 In some embodiments, the DenseNet network is a DenseNet-169 network. For four upsampling layers, according to the order from the first layer to the last layer of the four upsampling layers, the convolution layers corresponding to the four upsampling layers sequentially are pool3_pool, pool2_pool, pool1, conv1/relu, respectively. may be

いくつかの実施例において、処理ユニット５０３はさらに、複数の深度情報付きのカラー三次元画像を前処理し、前記複数の深度情報付きのカラー三次元画像を前処理する過程において、少なくとも１つの深度情報付きのカラー三次元画像には、深度情報付きのカラー三次元画像において深度情報を確定できない画素点を指すホールがあると確定した場合に、ホールに穴埋め操作を行わないために用いられる。 In some embodiments, the processing unit 503 further preprocesses the plurality of depth-informed color 3D images, and in preprocessing the plurality of depth-informed color 3D images, preprocesses at least one depth-informed color 3D image. When it is determined that the color 3D image with depth information has a hole pointing to a pixel point for which depth information cannot be determined in the color 3D image with depth information, it is used to prevent the hole from being filled.

いくつかの実施例において、処理ユニット５０３はさらに、複数の深度情報付きのカラー三次元画像を利用して、調整された初期バックボーンネットワークをトレーニングする過程において、教師付き学習方法を用いてトレーニングし、深度推定に用いられる損失関数は、ニューラルネットワークモデルに基づく各画素の深度推定値と信頼度推定値の関数であるために用いられる。 In some embodiments, the processing unit 503 further trains using a supervised learning method in the course of training the initial adjusted backbone network utilizing the plurality of depth-informed color 3D images, The loss function used for depth estimation is used because it is a function of the depth and confidence estimates for each pixel based on the neural network model.

いくつかの実施例において、取得ユニット５０１はさらに、二次元画像に視角ブラインド領域があると確定したことに応答して、二次元画像における視角ブラインド領域を黒色に塗りつぶすために用いられる。 In some embodiments, the acquisition unit 501 is further used to black fill the visual angle blind area in the two-dimensional image in response to determining that there is a visual angle blind area in the two-dimensional image.

いくつかの実施例において、つなぎ合わせユニット５０２はさらに、中間画像の上下縁をトリミングするために用いられる。 In some embodiments, stitching unit 502 is also used to trim the top and bottom edges of the intermediate image.

いくつかの実施例において、つなぎ合わせユニット５０２は中間画像の上下縁をトリミングする際に、中間画像の上縁と下縁から、それぞれ中間画像の高さの予め設定された比率の高さの画像をトリミングするために用いられる。 In some embodiments, the stitching unit 502 trims the top and bottom edges of the intermediate image by cropping the images from the top and bottom edges of the intermediate image to a height that is a preset percentage of the height of the intermediate image, respectively. used to trim the

いくつかの実施例において、予め設定された数の値は３であってもよく、予め設定された比率１５％であってもよい。 In some embodiments, the preset number value may be 3 and the preset percentage may be 15%.

本開示の実施例は、電子機器のプロセッサによって実行されると、プロセッサに本開示の実施例による深度情報付きのパノラマ画像を生成する方法を実行させる命令を含むコンピュータプログラムが記憶された非一時的コンピュータ可読記憶媒体を更に提供する。 Embodiments of the present disclosure provide a non-transitory computer program stored with a computer program that, when executed by a processor of an electronic device, causes the processor to perform a method of generating a panoramic image with depth information according to embodiments of the present disclosure. A computer-readable storage medium is further provided.

本開示の実施例は、電子機器を更に提供する。図６は、本開示の実施例による電子機器６００の概略構成図である。図６に示すように、電子機器６００はメモリ６０１とプロセッサ６０２を含む。メモリ６０１とプロセッサ６０２はバスを介して接続されることができる。メモリ６０１は、プロセッサ６０２で実行可能なコンピュータプログラムを記憶するために用いられ、プロセッサ６０２は前記プログラムを実行すると、本開示の実施例による深度情報付きのパノラマ画像を生成する方法を実現することができる。 Embodiments of the present disclosure further provide an electronic device. FIG. 6 is a schematic diagram of an electronic device 600 according to an embodiment of the present disclosure. As shown in FIG. 6, electronic device 600 includes memory 601 and processor 602 . Memory 601 and processor 602 can be connected via a bus. The memory 601 is used to store a computer program executable by the processor 602, which, when executed by the processor 602, can implement a method of generating a panoramic image with depth information according to embodiments of the present disclosure. can.

以上は、本開示の例示的な実施例にすぎず、本開示の内容を限定するものではない。本開示における精神および原則から逸脱することなく行われるいかなる修正、等価変更、改善などは、いずれも本開示の保護範囲に含まれるものである。 The foregoing are merely illustrative examples of the present disclosure and are not intended to limit the content of the present disclosure. Any modification, equivalent change, improvement, etc. made without departing from the spirit and principle of this disclosure shall fall within the protection scope of this disclosure.

Claims

A method for generating a panoramic image with depth information, comprising:
obtaining a two-dimensional image of the current scene based on spherical projection;
forming an intermediate image by connecting a predetermined number of the two-dimensional images in the horizontal direction end-to-end;
determining depth information for the intermediate image using a trained neural network model for predicting image depth;
trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image; determining depth information of the trimmed image; establishing the cropped image as a panoramic image of a current scene.

The neural network model for predicting image depth comprises:
taking a pre-trained convolutional neural network as an initial backbone network, and structurally adjusting the initial backbone network according to the U-Net network structure;
2. Trained by utilizing a plurality of depth-informed color 3D images to train the initial adjusted backbone network to generate a neural network model for predicting the image depth. described method.

the pre-trained convolutional neural network is an ImageNet pre-trained DenseNet network;
Structurally adjusting the initial backbone network according to the U-Net network structure includes:
removing all connectivity layers of the DenseNet network;
Based on the U-Net network structure, adding a plurality of upsampling layers after the last layer of the DenseNet network from which the all-connected layer is removed, each corresponding to each upsampling layer in the plurality of upsampling layers. setting up superposition layers, the number of input channels of each upsampling layer being a preset multiple of its number of output channels;
for each added upsampling layer, upsampling the input information of the upsampling layer at a preset resolution multiple, superimposing the upsampling result with the output data of the superimposed layer corresponding to the upsampling layer, performing at least one convolution operation on the convolution result and linearly modifying the convolution operation result using a preset activation function;
Performing one convolution operation for outputting depth information and one convolution operation for outputting reliability information on the output of the last upsampling layer among the plurality of upsampling layers. 3. The method of claim 2, comprising:

the DenseNet network is a DenseNet-169 network, the plurality of upsampling layers is four upsampling layers;
the number of input channels for each upsampling layer is twice its number of output channels;
for each added upsampling layer, upsampling the input information for that upsampling layer at twice the resolution;
4. The method of claim 3, wherein said at least one convolution operation is two convolution operations.

For the four upsampling layers, according to the order from the first layer to the last layer of the four upsampling layers, the convolution layers sequentially corresponding to the four upsampling layers are pool3_pool, pool2_pool, pool1, respectively. 5. The method of claim 4, wherein conv1/relu.

Prior to training the initial adjusted backbone network utilizing multiple depth-informed color 3D images, the method comprises:
In the process of pre-processing the plurality of depth-information-attached color 3D images, and pre-processing the plurality of depth-information-attached color 3-D images, at least one depth-information-attached color 3-D image includes the depth 3. The method of claim 2, further comprising, when determining that there is a hole pointing to a pixel point for which depth information cannot be determined in the color three-dimensional image with information, not performing hole-filling processing on the hole.

training using a supervised learning method in the process of training the initial adjusted backbone network using the plurality of depth-informed color 3D images;
3. The method of claim 2, wherein the loss function used for depth estimation is a function of depth and confidence estimates for each pixel based on a neural network model.

further comprising: after obtaining a spherical projection-based two-dimensional image of a current scene, in response to determining that there is a visual angle blind area in the two-dimensional image, filling the visual angle blind area with black. Item 8. The method according to any one of Items 1 to 7.

The method according to any one of claims 1 to 7, further comprising trimming the upper and lower edges of the intermediate image after constructing an intermediate image by connecting a preset number of the two-dimensional images horizontally end-to-end. A method according to any one of paragraphs.

Trimming the upper and lower edges of the intermediate image includes:
10. The method of claim 9, comprising cropping an image of height a preset percentage of the height of the intermediate image from the top and bottom edges of the intermediate image, respectively.

A device for generating a panoramic image with depth information,
an acquisition unit for acquiring a two-dimensional image of the current scene based on spherical projection;
a stitching unit for constructing an intermediate image by stitching a preset number of said two-dimensional images horizontally end-to-end;
a processing unit for determining depth information for said intermediate image utilizing a trained neural network model for predicting image depth;
trimming the intermediate image based on the intermediate position in the horizontal direction of the intermediate image to obtain an image equal to the length of the two-dimensional image; determining depth information of the trimmed image; a cropping unit for establishing the cropped image as a panoramic image of a current scene.

The neural network model for predicting image depth comprises:
taking a pre-trained convolutional neural network as an initial backbone network, and structurally adjusting the initial backbone network according to the U-Net network structure;
12. Trained by utilizing a plurality of depth-informed color 3D images to train the initial adjusted backbone network to generate a neural network model for predicting the image depth. Apparatus as described.

the pre-trained convolutional neural network is an ImageNet pre-trained DenseNet network;
The processing unit is
a deletion subunit for deleting all connectivity layers of said DenseNet network;
Based on the U-Net network structure, adding a plurality of upsampling layers after the last layer of the DenseNet network from which the all-connected layer is removed, each corresponding to each upsampling layer in the plurality of upsampling layers. setting up a superposition layer, the number of input channels of each upsampling layer is a preset multiple of its output channel number, and in each added upsampling layer, the input information of the upsampling layer is converted to a preset resolution Upsampling by a factor, convolving the upsampling result with the output data of the convolution layer corresponding to the upsampling layer, performing at least one convolution operation on the convolution result, and using a preset activation function. an additive subunit for linearly correcting the convolution operation result by
for performing one convolution operation for outputting depth information and one convolution operation for outputting reliability information on the output of the last upsampling layer among the plurality of upsampling layers; 13. The apparatus of claim 12, comprising a computational subunit.

the DenseNet network is a DenseNet-169 network, the plurality of upsampling layers is four upsampling layers;
the number of input channels for each upsampling layer is twice its number of output channels;
for each added upsampling layer, upsampling the input information for that upsampling layer at twice the resolution;
14. The apparatus of claim 13, wherein said at least one convolution operation is two convolution operations.

The processing unit further comprises:
In the process of pre-processing the plurality of depth-information-attached color 3D images, and pre-processing the plurality of depth-information-attached color 3-D images, at least one depth-information-attached color 3-D image includes the depth 13. Apparatus according to claim 12, wherein when it is determined that there is a hole pointing to a pixel point for which depth information cannot be determined in the color three-dimensional image with information, the hole is not filled.

The acquisition unit further comprises:
Apparatus according to any one of claims 11 to 15, adapted for black-filling the visual angle blind area in response to determining that there is a visual angle blind area in the two-dimensional image.

The splicing unit further comprises:
A device according to any one of claims 11 to 15, used for trimming the upper and lower edges of said intermediate image.

The splicing unit further comprises:
Apparatus according to any one of claims 11 to 15, used for cropping an image with a height of a preset ratio of the height of the intermediate image from the upper and lower edges of said intermediate image respectively. .

A non-transitory computer readable storage medium having stored thereon a computer program comprising instructions which, when executed by a processor of an electronic device, cause said processor to perform the method of any one of claims 1 to 10.

an electronic device,
a processor;
a memory for storing a computer program executable by said processor;
Electronic equipment, wherein the computer program, when executed by the processor, causes the processor to perform the method according to any one of claims 1 to 10.