WO2023073971A1

WO2023073971A1 - Three-dimensional reconstruction device, three-dimensional reconstruction method and program

Info

Publication number: WO2023073971A1
Application number: PCT/JP2021/040164
Authority: WO
Inventors: みずき田端; 潤一郎玉松; 亮田中; 陽祐竹内
Original assignee: 日本電信電話株式会社
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-05-04
Also published as: JPWO2023073971A1

Abstract

This three-dimensional reconstruction program (1) is provided with: an input unit (11) which inputs a panorama image (21) of a target indoor space, and three-dimensional coordinates (22) interpolated so as to connect corners to be imaged, shown on a drawing of the indoor space; an image encoding unit (12) which extracts image feature values from the panorama image (21); a shape encoding unit (13) which extracts three-dimensional shape feature values from the interpolated three-dimensional coordinates (22); a link calculation unit (14) which generates feature values that link image feature values and three-dimensional shape feature values; an image coordinate decoding unit (15) which generates boundary image coordinates by decoding the linked feature values; and a three-dimensional reconstruction unit (16) which performs three-dimensional reconstruction of the indoor space on the basis of the boundary image coordinates.

Description

3D reconstruction device, 3D reconstruction method, and program

The present disclosure relates to a three-dimensional reconstruction device, a three-dimensional reconstruction method, and a program that perform three-dimensional reconstruction of indoor space.

　Conventionally, technologies for 3D reconstruction from panoramic images of indoor spaces such as rooms include HorizonNet, which uses a deep neural network. Three-dimensional reconstruction refers to reconstructing the original three-dimensional structure from a projected image of a three-dimensional object or a planar image. A neural network is a mathematical model that simulates nerve cells (neurons) and their connections in the human brain, that is, a neural network, reproduced on a computer and applied to operations such as machine learning. Say. A neural network constitutes a multi-layered network consisting of an input layer, one or more hidden layers (intermediate layers), and an output layer. Also, a deep neural network is a neural network that supports deep learning and has four or more layers of the network. Non-Patent Document 1 describes a method of estimating a three-dimensional room layout from a single panoramic image using HorizonNet. FIG. 5 cites a photograph disclosed in Non-Patent Document 1. FIG. As shown in FIG. 5, existing technologies such as HorizonNet use only panoramic images 21 as input information, and learn the corners and boundaries of wall surfaces only from the images. Three-dimensional reconstruction 23 is performed by detecting the boundary B and estimating the three-dimensional coordinates. In addition, Non-Patent Document 2 describes an outline of PointNet, which is a deep neural network that can directly input measured point groups.

However, conventional boundary detection using only panoramic images is based on image features (edges, corners, etc.). There are many indoor spaces where boundary detection is difficult. If the conventional technology is applied to such an indoor space, there is a problem that the accuracy of boundary detection is lowered, and the reconstruction accuracy of the three-dimensional image is lowered.

An object of the present invention, which has been made in view of such circumstances, is to provide highly accurate three-dimensional reconstruction of an indoor space even when boundary detection from image feature values is difficult, such as when a large number of objects exist in the indoor space. The purpose is to make the configuration possible.

In order to solve the above problems, a three-dimensional reconstruction device according to one embodiment is a three-dimensional reconstruction device that performs three-dimensional reconstruction of an indoor space, comprising: a panoramic image of a target indoor space; An input unit for inputting three-dimensional coordinates interpolated so as to connect the corners of the object to be photographed described in the drawing, an image encoding unit for extracting an image feature amount from the panoramic image, and the a shape encoding unit that extracts a three-dimensional shape feature amount from the interpolated three-dimensional coordinates; a concatenation operation unit that generates a feature amount by concatenating the image feature amount and the three-dimensional shape feature amount; and a three-dimensional reconstruction unit for performing three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary.

In order to solve the above problems, a three-dimensional reconstruction method according to one embodiment is a three-dimensional reconstruction method for performing three-dimensional reconstruction of an indoor space. inputting a panorama image and three-dimensional coordinates interpolated so as to connect corners of an object to be photographed described in a drawing of the indoor space; and extracting an image feature quantity from the panorama image. a step of extracting a three-dimensional shape feature quantity from the interpolated three-dimensional coordinates; a step of generating a feature quantity by connecting the image feature quantity and the three-dimensional shape feature quantity; generating image coordinates of a boundary by decoding quantities; and performing a three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary.

In order to solve the above problems, a program according to one embodiment causes a computer to function as the three-dimensional reconstruction device.

According to the present disclosure, even in a space where it is difficult to detect the boundary from the image feature amount, such as when there are many objects in the indoor space, by using both the panoramic image and the three-dimensional coordinates such as the point cloud , high-precision boundary detection can be realized, so that high-precision three-dimensional reconstruction of an indoor space is possible.

1 is a block diagram showing a configuration example of a three-dimensional reconstruction device according to one embodiment; FIG. 4 is a flow chart showing an example of a three-dimensional reconstruction method executed by a three-dimensional reconstruction device according to one embodiment; FIG. 10 is a flowchart for explaining a procedure for a shape encoding unit to extract a three-dimensional shape feature amount; FIG. 1 is a block diagram showing a schematic configuration of a computer functioning as a three-dimensional reconstruction device; FIG. 1 is a diagram showing a conventional procedure for three-dimensional reconstruction from panoramic images; FIG.

Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.

FIG. 1 is a block diagram showing a configuration example of a three-dimensional reconstruction device according to one embodiment. The three-dimensional reconstruction device 1 shown in FIG. And prepare. A three-dimensional reconstruction device 1 performs three-dimensional reconstruction of an indoor space.

In this embodiment, by using both the panoramic image 21 of the indoor space and the drawing of the indoor space, highly accurate boundary detection is realized even for a space where it is difficult to obtain a sufficient image feature amount a. . For this reason, one panorama image 21 of the indoor space is prepared in advance, and the three-dimensional coordinates 22 (with an arbitrary point as the origin) interpolated so as to connect the corners of the photographing object described in the drawing of the indoor space. ) should be prepared as input data.

The input unit 11 inputs a panorama image 21 of the target indoor space and three-dimensional coordinates 22 interpolated so as to connect the corners of the shooting target described in the drawing of the indoor space. The input unit 11 needs to input one panorama image and the corresponding three-dimensional coordinates 22 (acquired from the dimensions described in the drawing of the indoor space, with an arbitrary point as the origin). The input unit 11 outputs the target indoor space panoramic image 21 to the image encoding unit 12 and outputs the interpolated three-dimensional coordinates 22 to the shape encoding unit 13 .

The image encoding unit 12 extracts the image feature amount a represented by a matrix from the panorama image 21 . The image encoding unit 12 may extract the image feature amount a from the panoramic image 21 using HorizonNet. The image encoding unit 12 outputs the image feature quantity a to the concatenation calculation unit 14 .

HorizonNet is a neural network that addresses the task of estimating a 3D room layout from a single panoramic image. A HorizonNet feature extractor receives a single panorama image 21 and extracts a plurality of feature quantities represented by a matrix. Next, it learns by inputting a feature map, which is a combination of multiple feature values, and generates boundary image coordinates representing the boundary positions between the floor and walls, the ceiling and walls, and the boundaries between walls. . Post-processing is then applied to reconstruct the three-dimensional room layout from the image coordinates of the boundaries.

The shape encoding unit 13 extracts the three-dimensional shape feature quantity b represented by a matrix from the interpolated three-dimensional coordinates 22 . The shape encoding unit 13 may extract the three-dimensional shape feature quantity b from the interpolated three-dimensional coordinates 22 using PointNet. The shape encoding unit 13 outputs the three-dimensional shape feature quantity b to the concatenation calculation unit 14 . The procedure for extracting the three-dimensional shape feature amount b by the shape encoding unit 13 will be described later with reference to FIG.

The connection operation unit 14 connects the image feature amount a extracted from the panorama image 21 and the three-dimensional shape feature amount b extracted from the interpolated three-dimensional coordinates 22, represented by respective matrices, into a feature amount c. to generate The concatenation calculation unit 14 outputs the concatenated feature amount c to the image coordinate decoding unit 15 .

The image coordinate decoding unit 15 generates the image coordinate d of the boundary by decoding the connected feature amount c. The image coordinate decoding unit 15 may decode the concatenated feature quantity c using HorizonNet.

The three-dimensional reconstruction unit 16 performs three-dimensional reconstruction of the indoor space based on the boundary image coordinates d.

FIG. 2 is a flowchart showing an example of a three-dimensional reconstruction method executed by a three-dimensional reconstruction device according to one embodiment.
<Step S01>

In step S<b>01 , the input unit 11 inputs the panoramic image 21 . In the panorama image 21 prepared in advance before executing this step, the vertical direction of the image is made to coincide with the zenith direction.
<Step S02>

In step S<b>02 , the input unit 11 inputs the three-dimensional coordinates 22 interpolated so as to connect the corners of the panorama image 21 to be photographed. Before executing this step, the three-dimensional coordinates of the imaging target of the panorama image 21 (the three-dimensional coordinates of the corner obtained by setting the origin of the coordinate system to an arbitrary corner). prepare. The three-dimensional coordinates 22 are interpolated so as to connect the corners of the three-dimensional coordinates. The three-dimensional coordinates before interpolation include the three-dimensional coordinates of each point having only the corners as a point group. It also has the three-dimensional coordinates of each point in the point group of the part of the line that crosses the line.
<Step S03>

In step S<b>03 , the image encoding unit 12 extracts the image feature amount a represented by a matrix from the panorama image 21 . The above-mentioned HorizonNet or the like may be used to extract the image feature amount a.
<Step S04>

In step S04, the shape encoding unit 13 extracts the three-dimensional shape feature quantity b represented by a matrix from the three-dimensional coordinates 22. The shape encoding unit 13 may extract the three-dimensional shape feature quantity b from the three-dimensional coordinates 22 using PointNet. PointNet is a deep neural network that can directly input measured point clouds. FIG. 3 is a flow chart for explaining the procedure for the shape encoding unit 13 to extract the three-dimensional shape feature quantity using PointNet. The procedure for extracting the three-dimensional shape feature quantity by the shape encoding unit 13 will be described below by dividing it into steps S041 to S044.

At step S041, T-net inputs a point group and outputs a 3×3 affine transformation matrix. A point cloud is represented as a set of three-dimensional points {Pi| i = 1,..., n}, where each point Pi has (x, y, z) coordinates. T-net is a network that inputs a point cloud and outputs an affine transformation matrix. Affine transformation refers to performing scaling, rotation, translation, etc. of an image collectively using a matrix. As shown in FIG. 3, the internal structure of T-net has a structure similar to that of the shape coding section 13 .

In step S042, the matrix multiplier multiplies the input point group by the affine transformation matrix output by the T-net (Matrix multiply). This results in an n×3 matrix. This operation makes it possible to eliminate the effects of translation and rotation of the point group.

As mentioned above, a neural network is a mathematical model that simulates nerve cells (neurons) and their connections in the human brain, that is, a neural network, reproduced on a computer and applied to operations such as machine learning. say something A “perceptron” is the smallest unit that constitutes a neural network, and is a function that outputs one value for multiple inputs. A "perceptron" is a function that outputs one signal for n input signals, and is represented by the following equation (1). In equation (1), y is the output signal, function f () is the "activation function", x _i is the i-th input signal (i = 0, 1, ---, n-1), w _i is i A variable representing a weight applied to the th input signal, b is a variable called a bias.

y = f( _w0 * _x0 +---+ _wn-1 *xn _-1 +b) (1)

In step S043, the n×3 matrix obtained in step S042 is input to the multi-layer perceptron (the multi-layer perceptron is denoted as mlp in FIG. 3). "mlp(64,128,256)" shown in FIG. 3 is a multi-layer perceptron with output sizes 64,128,256. Multilayer perceptrons are constructed, for example, by connecting all-connected layers and activation functions. A fully-connected layer integrates multiple numerical inputs from multiple nodes (modeled neurons) into a single numerical value through linear transformation processing. An activation function is a function that non-linearly converts an input value to another value and outputs it when outputting from one node to the next node. In this example, ReLU is used as the activation function. ReLU (Rectified Linear Unit) is a function whose output value is always 0 when the input value to the function is 0 or less, and whose output value is the same value as the input value when the input value is greater than 0. .

In step S044, a three-dimensional shape feature amount is obtained by Max Pooling. Max Pooling means selecting and compressing the maximum value from the output values in each range. This operation makes it possible to eliminate the influence of the order of each point in the point group.
<Step S05>

In step S05, the connection operation unit 14 connects the image feature amount a and the three-dimensional shape feature amount b to generate the connected feature amount c. Methods for connecting feature amounts include a method of simply connecting matrices, a method of summing and multiplying elements of matrices, a method of using a bilinear model, and the like.
<Step S06>

In step S06, the image coordinate decoding unit 15 decodes the connected feature quantity c to generate the image coordinate d of the boundary. A HorizonNet decoder or the like may be used for decoding the concatenated feature amount c.
<Step S07>

In step S07, the three-dimensional reconstruction unit 16 performs three-dimensional reconstruction based on the decoded boundary image coordinates d. The three-dimensional reconstruction unit 16 performs processing such as determining a wall surface by principal component analysis based on the Manhattan world hypothesis. The Manhattan World Assumption states that artificial objects such as ceilings and walls in three-dimensional space have dominant three axes that are orthogonal to each other, and the surfaces of the ceilings, walls, etc. that make up the artificial objects are It is a provisional arrangement that they are arranged perpendicularly or parallel to the three axes.

As described above, in order to achieve the object of the present invention of realizing highly accurate boundary detection even in an indoor space where it is difficult to obtain a sufficient image feature amount a, the panoramic image 21 of the indoor space, The drawing of the indoor space was also used.

In the present disclosure, what has been implemented in order to realize highly accurate boundary detection by using drawings together is (i) a three-dimensional image interpolated between the corners of the imaging target of the panorama image 21 described in the drawings; The coordinates 22 are prepared as input data, (ii) the addition of the shape encoding unit 13 for extracting the three-dimensional shape feature amount (point group feature amount) b from the three-dimensional coordinates 22, and (iii) the image feature amount. By decoding the feature quantity c, which is a concatenation of a and the three-dimensional shape feature quantity b, the boundary is estimated in consideration of the three-dimensional information. By the above-described implementation, it becomes possible to perform boundary detection in consideration of the three-dimensional shape feature amount b in addition to the image feature amount a, thereby improving the three-dimensional reconstruction accuracy.

In addition to the panorama image 21, by utilizing the three-dimensional information described in the drawings, etc., it is possible to detect the boundaries more in line with the drawing information. Also, if highly accurate boundary detection can be achieved, highly accurate three-dimensional reconstruction will be possible. Furthermore, high-precision three-dimensional reconstruction can reduce image reprojection errors, leading to improved visibility.

Also, by using the inspection image of the structure as the panoramic image 21, three-dimensional data including deterioration information can be obtained, so structural calculations can be performed and quantitative soundness evaluation can be performed.

Therefore, according to the three-dimensional image reconstruction device 1 according to the present disclosure, the panorama image 21 and the , and the three-dimensional coordinates 22 such as the point cloud, it is possible to realize highly accurate boundary detection, and thus highly accurate three-dimensional reconstruction is possible.

More specifically, according to the three-dimensional image reconstruction device 1 according to the present disclosure, (i) in the three-dimensional reconstruction technique from the conventional panoramic image, the reconstruction accuracy depends on the image feature amount. In the present disclosure, by using both a panoramic image and three-dimensional coordinates such as a point cloud, it is possible to realize high-precision boundary detection even in a space where it is difficult to obtain sufficient image feature amounts. (ii) high-precision boundary detection between walls that does not require the camera position and orientation when acquiring a panoramic image; It is possible to reconstruct a three-dimensional image with a large number of images.

The input unit 11, the image encoding unit 12, the shape encoding unit 13, the concatenation calculation unit 14, the image coordinate decoding unit 15, and the three-dimensional reconstruction unit 16 in the three-dimensional reconstruction device 1 are included in the control device (controller). constitute a part. The control device may be composed of dedicated hardware such as ASIC (Application Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array), or may be composed of a processor, or may be composed of both. may

It is also possible to use a computer capable of executing program instructions in order to function the three-dimensional reconstruction apparatus 1 described above. FIG. 4 is a block diagram showing a schematic configuration of a computer that functions as a three-dimensional reconstruction device. Here, the computer functioning as the three-dimensional reconstruction device 1 may be a general-purpose computer, a dedicated computer, a workstation, a PC (Personal Computer), an electronic notepad, or the like. Program instructions may be program code, code segments, etc. for performing the required tasks.

As shown in FIG. 4, the computer 100 includes a processor 110, a ROM (Read Only Memory) 120, a RAM (Random Access Memory) 130, and a storage 140 as storage units, an input unit 150, an output unit 160, and communication and an interface (I/F) 170 . Each component is communicatively connected to each other via a bus 180 .

The ROM 120 stores various programs and various data. RAM 130 temporarily stores programs or data as a work area. The storage 140 is configured by a HDD (Hard Disk Drive) or SSD (Solid State Drive) and stores various programs including an operating system and various data. In the present disclosure, the ROM 120 or the storage 140 stores programs according to the present disclosure.

The processor 110 is specifically a CPU (Central Processing Unit), MPU (Micro Processing Unit), GPU (Graphics Processing Unit), DSP (Digital Signal Processor), SoC (System on a Chip), or the like. may be configured by a plurality of processors of The processor 110 reads a program from the ROM 120 or the storage 140 and executes the program using the RAM 130 as a work area, thereby performing control of each configuration and various arithmetic processing. Note that at least part of these processing contents may be realized by hardware.

The program may be recorded on a recording medium readable by the three-dimensional reconstruction device 1. By using such a recording medium, it can be installed in the three-dimensional reconstruction device 1 . Here, the recording medium on which the program is recorded may be a non-transitory recording medium. The non-transitory recording medium is not particularly limited, but may be, for example, a CD-ROM, a DVD-ROM, a USB (Universal Serial Bus) memory, or the like. Also, this program may be downloaded from an external device via a network.

Regarding the above embodiments, the following additional remarks are disclosed.

(Appendix 1)
A three-dimensional reconstruction device that performs three-dimensional reconstruction of an indoor space,
memory;
a controller connected to the memory;
with
The controller is
Input a panoramic image of the target indoor space and three-dimensional coordinates interpolated so as to connect the corners of the shooting target described in the drawing of the indoor space,
extracting an image feature amount from the panoramic image;
extracting a three-dimensional shape feature amount from the interpolated three-dimensional coordinates;
generating a feature amount by connecting the image feature amount and the three-dimensional shape feature amount;
generating boundary image coordinates by decoding the concatenated features;
A three-dimensional reconstruction device that performs three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary.
(Appendix 2)
The controller is
3. The three-dimensional reconstruction device according to claim 1, wherein image feature amounts are extracted from the panorama image using HorizonNet.
(Appendix 3)
The controller is
3. The three-dimensional reconstruction apparatus according to item 1 or 2, wherein a three-dimensional shape feature amount is extracted from the interpolated three-dimensional coordinates using PointNet.
(Appendix 4)
The controller is
4. The three-dimensional reconstruction device according to any one of additional items 1 to 3, wherein the concatenated feature quantity is decoded using HorizonNet.
(Appendix 5)
A three-dimensional reconstruction method for three-dimensional reconstruction of an indoor space, comprising:
With a three-dimensional reconstruction device,
a step of inputting a panoramic image of a target indoor space and three-dimensional coordinates interpolated so as to connect the corners of the shooting target described in the drawing of the indoor space;
a step of extracting an image feature quantity from the panoramic image;
a step of extracting a three-dimensional shape feature quantity from the interpolated three-dimensional coordinates;
generating a feature amount by connecting the image feature amount and the three-dimensional shape feature amount;
generating image coordinates of boundaries by decoding the concatenated features;
performing three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary;
A three-dimensional reconstruction method comprising:
(Appendix 6)
A non-temporary storage medium storing a computer-executable program, the non-temporary storage storing a program that causes the computer to function as the three-dimensional reconstruction device according to any one of appendices 1 to 4. medium.

Although the above-described embodiments have been described as representative examples, it will be apparent to those skilled in the art that many modifications and substitutions can be made within the spirit and scope of the present disclosure. Therefore, the present invention should not be construed as limited by the embodiments described above, and various modifications and changes are possible without departing from the scope of the claims. For example, it is possible to combine a plurality of configuration blocks described in the configuration diagrams of the embodiments into one, or divide one configuration block.

1 three-dimensional reconstruction device 11 input unit 12 image encoding unit 13 shape encoding unit 14 concatenation operation unit 15 image coordinate decoding unit 16 three-dimensional reconstruction unit 21 panorama image 22 three-dimensional coordinates (interpolated three-dimensional coordinates)
23 three-dimensional reconstruction 100 computer 110 processor 120 ROM
130 RAM
140 storage 150 input unit 160 output unit 170 communication interface (I/F)
180 bus

Claims

A three-dimensional reconstruction device that performs three-dimensional reconstruction of an indoor space,
an input unit for inputting a panoramic image of a target indoor space and three-dimensional coordinates interpolated so as to connect the corners of the object to be photographed described in the drawing of the indoor space;
an image encoding unit that extracts an image feature amount from the panoramic image;
a shape encoding unit that extracts a three-dimensional shape feature amount from the interpolated three-dimensional coordinates;
a concatenation calculation unit that generates a feature amount by concatenating the image feature amount and the three-dimensional shape feature amount;
an image coordinate decoding unit that generates boundary image coordinates by decoding the connected feature amounts;
a three-dimensional reconstruction unit that performs three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary;
A three-dimensional reconstruction device comprising:
The image encoding unit is
2. The three-dimensional reconstruction apparatus according to claim 1, wherein image features are extracted from said panorama image using HorizonNet.
The shape encoding unit
3. The three-dimensional reconstruction apparatus according to claim 1, wherein PointNet is used to extract three-dimensional shape features from the interpolated three-dimensional coordinates.
The image coordinate decoding unit is
4. The three-dimensional reconstruction apparatus according to any one of claims 1 to 3, wherein the concatenated feature quantity is decoded using HorizonNet.
A three-dimensional reconstruction method for three-dimensional reconstruction of an indoor space, comprising:
With a three-dimensional reconstruction device,
a step of inputting a panoramic image of a target indoor space and three-dimensional coordinates interpolated so as to connect the corners of the shooting target described in the drawing of the indoor space;
a step of extracting an image feature quantity from the panoramic image;
a step of extracting a three-dimensional shape feature quantity from the interpolated three-dimensional coordinates;
generating a feature amount by connecting the image feature amount and the three-dimensional shape feature amount;
generating image coordinates of boundaries by decoding the concatenated features;
performing three-dimensional reconstruction of the indoor space based on the image coordinates of the boundary;
A three-dimensional reconstruction method comprising:
A program for causing a computer to function as the three-dimensional reconstruction device according to any one of claims 1 to 4.