CN113034675A

CN113034675A - Scene model construction method, intelligent terminal and computer readable storage medium

Info

Publication number: CN113034675A
Application number: CN202110325406.7A
Authority: CN
Inventors: 齐越; 杨朔; 王君义
Original assignee: Beihang University; Peng Cheng Laboratory
Current assignee: Beihang University; Peng Cheng Laboratory
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-25

Abstract

The invention discloses a scene model construction method, an intelligent terminal and a computer readable storage medium, wherein the method comprises the following steps: when an Nth original depth image aiming at the same scene is obtained, scene fusion is carried out on the Nth original scene model according to the Nth original depth image to obtain an Nth intermediate scene model; extracting neighborhood characteristics of each voxel in the Nth intermediate scene model according to a preset extraction rule; calculating a voxel prediction value corresponding to each voxel according to the neighborhood characteristics; and for each voxel, when the voxel value corresponding to the voxel in the Nth original scene model is a non-voxel observation value, updating the Nth original scene model according to the voxel prediction value corresponding to the voxel to obtain the (N +1) th original scene model. According to the method, the prediction of the voxel value of the scene model is realized through the neighborhood characteristics of the voxel, so that the cavities of the subsequent model are reduced, and the integrity of the model is improved.

Description

Scene model construction method, intelligent terminal and computer readable storage medium

Technology neighborhood

The present invention relates to a scene model construction neighborhood, and in particular, to a scene model construction method, an intelligent terminal, and a computer-readable storage medium.

Background

RGB is a color model, and an RGB image represents an image including three color channels of red, green, and blue. Whereas in 3D computer graphics, a Depth Map (Depth Map) is an image or image channel containing information about the distance of the surface of the scene object from the viewpoint. In the depth map, each pixel value represents a shooting distance from an object. Usually, the RGB image and the depth map are matched with each other, and therefore, a one-to-one relationship exists between pixel points of the RGB image and the depth map. And because the depth map can represent the distance between the object and the depth map, the reconstruction of the scene can be realized by combining the RGB image and the depth map.

The current scene reconstruction method mainly uses a depth camera to surround a scene so as to acquire different depth maps. And then converting the depth map into three-dimensional point clouds according to the internal parameters of the camera and the pixel values in the depth map, and calculating a normal vector corresponding to each point cloud. For example, an Iterative Closest Point (ICP) algorithm is used to iteratively minimize the distance from a Point to a plane, and then pose transformation between two frames is calculated, thereby solving the camera pose of the current frame. And fusing the point cloud to a Truncated Signed Distance Function (TSDF) model according to the camera posture of the current frame, obtaining the TSDF model surface, the point cloud and a normal vector under the current camera view angle according to a reprojection algorithm, and then carrying out ICP iteration on the TSDF model surface, the point cloud and the normal vector and the next frame of data to solve the next frame posture.

However, when the current depth camera collects the distance of an object, Time of Flight (TOF), binocular shooting and structured light techniques are mainly adopted, so that a blocked area often cannot be reconstructed because depth data cannot be collected. Meanwhile, objects are often shielded from each other in a shooting scene, so that in practice, it is very difficult to ensure that a depth camera scans all points capable of covering indoor objects, and a large number of cavities appear in a model reconstructed based on a current three-dimensional reconstruction method. Taking an augmented reality system as an example, based on the three-dimensional model reconstruction of the augmented reality system, a scene is rapidly scanned in a local region generally, a local three-dimensional model of the scene is obtained, and then the effect of virtual-real rendering is realized by using the model. However, because the obtained model is usually not complete enough, an area which is not reconstructed is inevitably observed in the moving process of the camera, and at this time, a correct virtual-real occlusion relationship cannot be rendered in the augmented reality application.

Disclosure of Invention

The invention mainly aims to provide a scene model construction method, an intelligent terminal and a computer readable storage medium, and aims to solve the problem that the relation between objects cannot be accurately reflected by a model due to the fact that holes are easy to appear in the existing three-dimensional reconstruction technology.

In order to achieve the above object, the present invention provides a scene model construction method, including the steps of:

when an Nth original depth image aiming at the same scene is obtained, carrying out scene fusion on an Nth original scene model according to the Nth original depth image to obtain an Nth intermediate scene model, wherein N is a natural number which is less than or equal to the total number of the original depth images, and when N is equal to 1, the first original scene model is a preset blank scene model;

extracting neighborhood characteristics of each voxel in the Nth intermediate scene model according to a preset extraction rule;

calculating a voxel prediction value corresponding to each voxel according to the neighborhood characteristics;

and for each voxel, when the voxel value corresponding to the voxel in the Nth original scene model is a non-voxel observation value, updating the Nth original scene model according to the voxel prediction value corresponding to the voxel to obtain an (N +1) th original scene model.

Optionally, the method for constructing a scene model, where, when an nth original depth image for a same scene is obtained, scene fusion is performed on the nth original scene model according to the nth original depth image to obtain an nth intermediate scene model, specifically includes:

when an Nth original depth image aiming at the same scene is obtained, filtering the Nth original depth image to generate an Nth noise reduction depth image;

calculating point clouds corresponding to all pixel points in the Nth noise reduction depth image according to camera internal parameters corresponding to the Nth original depth image to obtain a plurality of point clouds;

aiming at a pixel in each N-th noise reduction depth image, determining a normal vector of a point cloud corresponding to the pixel according to a neighborhood point cloud corresponding to the pixel, wherein the neighborhood point cloud corresponds to the neighborhood pixel of the pixel;

and carrying out scene fusion on each voxel in the Nth scene model according to the normal vector of each point cloud to obtain an Nth intermediate scene model.

Optionally, in the method for constructing a scene model, for each voxel, a neighborhood characteristic corresponding to the voxel includes a scene characteristic of a neighborhood voxel of the voxel, where the neighborhood voxel is a voxel corresponding to a neighborhood pixel of a pixel point corresponding to the voxel.

Optionally, the scene model construction method, wherein the extracting, according to a preset extraction rule, neighborhood features of each voxel in the nth intermediate scene model specifically includes:

extracting scene features of each voxel in the Nth intermediate scene model according to a preset extraction rule;

aiming at each voxel, screening neighborhood voxels according to a preset screening rule and coordinates of the neighborhood voxels corresponding to the voxel to obtain a target voxel;

and taking the scene characteristic corresponding to the target voxel as the neighborhood characteristic corresponding to the voxel.

Optionally, the method for constructing a scene model, where the calculating a voxel prediction value corresponding to each voxel according to the neighborhood characteristics specifically includes:

for each voxel, inputting the neighborhood characteristics corresponding to the voxel into a trained structured random forest model, and predicting the voxel value through the structured random forest model according to the input neighborhood characteristics to obtain a plurality of initial predicted values corresponding to the voxel;

calculating an error value corresponding to each initial predicted value according to a preset error loss function;

determining a number of intermediate predicted values of the initial predicted values according to the error value;

and calculating a voxel prediction value corresponding to the voxel according to the intermediate prediction value.

Optionally, the scene model construction method, wherein the training process of the structured random forest model includes:

acquiring training depth images of different scenes;

according to a preset sampling rule, screening pixel points in the training depth image to obtain training pixel points;

aiming at each training pixel point, taking a voxel value corresponding to the pixel point as label data, and taking a neighborhood voxel value corresponding to the pixel point as training data;

inputting the training data into a preset structured random forest model, and calculating a corresponding training predicted value according to the training data through the structured random forest model;

and according to the training predicted value and the label data, carrying out parameter adjustment on the structured random forest model until the structured random forest model is converged.

Optionally, in the method for constructing a scene model, the structured random forest model includes a plurality of decision trees; the method for calculating the training prediction value comprises the following steps of inputting the training data into a preset structured random forest model, and calculating the corresponding training prediction value according to the training data through the structured random forest model, wherein the method specifically comprises the following steps:

generating a plurality of training subsets according to the label data and the training data, wherein the number of the training subsets is the same as that of the decision trees;

inputting training data in the training subset into each decision tree aiming at each decision tree, and performing dimensionality reduction and clustering on the training data by each node of the decision tree to obtain a main component value corresponding to the training data;

and determining child nodes corresponding to the training data according to the main component values until leaf nodes are reached, and obtaining a prediction training value corresponding to the decision number.

Optionally, the method for constructing a scene model, where the calculating a voxel prediction value corresponding to the voxel according to the intermediate prediction value specifically includes:

aiming at each intermediate predicted value, determining a weight value corresponding to the intermediate predicted value according to an error value corresponding to the intermediate predicted value;

and calculating the voxel predicted value corresponding to the voxel according to the weight value corresponding to each intermediate predicted value.

In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: a memory, a processor and a scene model builder stored on the memory and executable on the processor, the scene model builder when executed by the processor implementing the steps of the scene model building method as described above.

Further, to achieve the above object, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a scene model construction program, and the scene model construction program, when executed by a processor, implements the steps of the scene model construction method as described above.

The invention provides a scene model construction method, an intelligent terminal and a computer readable storage medium. And then calculating a voxel predicted value corresponding to each voxel based on the neighborhood characteristics of each voxel. And on the basis of obtaining each voxel predicted value, fusing the voxel predicted value with the Nth original scene model to obtain an N +1 th original scene model. And continuously updating and optimizing the original scene model in the continuous scanning process, and when the voxel value of the voxel is not directly obtained, predicting the corresponding voxel according to the neighborhood characteristic of the voxel so as to fill up the cavity area in the initial model, reduce the number of cavities and increase the integrity of the model.

Drawings

FIG. 1 is a flow chart of a preferred embodiment provided by the scene model construction method of the present invention;

FIG. 2 is a schematic diagram of a TSDF model according to a preferred embodiment of the present invention;

FIG. 3 is a view showing a scene model obtained by voxel prediction according to a preferred embodiment of the present invention;

fig. 4 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method for constructing the scene model according to the preferred embodiment of the invention can be executed by an intelligent terminal, and the intelligent terminal comprises a smart phone, a virtual reality technology terminal and other terminals. In this embodiment, a smart phone is taken as an example to describe a scene model building process. As shown in fig. 1, the scene model construction method includes the following steps:

step S100, when an Nth original depth image aiming at the same scene is obtained, scene fusion is carried out on an Nth original scene model according to the Nth original depth image, and an Nth intermediate scene model is obtained.

Specifically, the depth image may be obtained by shooting with a depth camera, a binocular camera, or the like. Taking the depth camera as an example, the depth camera obtains the distance between the object in the environment and the camera through a time-of-flight technology, a laser scanning technology, and the like, so as to obtain a depth image. Since the depth camera is in operation, it is common to take multiple angles for the agreed scene to obtain multiple depth images.

In this embodiment, taking a certain indoor scene as an example, when a plurality of depth images are captured in the indoor scene, the depth images are taken as original depth images, and the original depth images are sequentially acquired. And when the Nth original depth image is obtained, carrying out scene fusion on the corresponding Nth original scene model according to the Nth original depth image to obtain an Nth intermediate scene model.

In this embodiment, N is a natural number less than or equal to the total number of original depth images, and this embodiment will be described by taking the case where N is equal to 1 as an example. When N is equal to 1, the first original scene model is a preset blank scene model. The scene model adopted in this embodiment is a TSDF model, each voxel in the TSDF model corresponds to a pixel point of the original depth image, and the scene model corresponding to the original depth image is completed by calculating a voxel value of each voxel. When N is equal to 1, the TSDF model corresponding to the first original depth image is a scene model in which a voxel value of each voxel is empty. As shown in fig. 2, when N is greater than one, the nth original scene model may also be a scene model in which voxel values of partial voxels are already known.

Further, in the process of fusing the first original depth image and the first original scene model, the adopted fusion mode can be realized by a KinectFusion algorithm, a kininuous algorithm, an elastic reconstruction offline scene model construction algorithm and the like. The fusion process comprises the following steps:

filtering the Nth original depth image to generate an Nth noise reduction depth image;

aiming at each pixel in the N noise reduction depth image, determining a normal vector of a point cloud corresponding to the pixel according to a neighborhood point cloud corresponding to the pixel;

Specifically, each pixel point in the first original depth image is filtered, and the selected filtering mode may adopt bilateral filtering, smooth filtering, and the like. Bilateral filtering is a non-linear filtering that combines two gaussian filters. The bilateral filtering comprises a pixel space Euclidean distance kernel function and a pixel depth value difference kernel function, when a pixel point is located in an internal flat area with small depth value change, the change of the depth value is close to 0, the weight of the pixel depth value difference kernel function approaches to 1, and at the moment, the Euclidean distance kernel function plays a main role in bilateral filtering, namely Gaussian fuzzy processing is conducted on an original image. When the pixel point is located in the edge area with large depth value change, the change of the depth value is large, the weight of the pixel depth value difference kernel function is increased, and even if the weight of the Euclidean distance kernel function in the pixel space is small, the geometric edge area of an object in the image can be ensured not to be fuzzified. The filtering mode adopted by the embodiment is bilateral rapid filtering, the bilateral rapid filtering is rapid and simple, and the bilateral rapid filtering can be realized at a weak augmented reality terminal. Taking a pixel point p in the first original depth image as an example, an equation for performing bilateral fast filtering on the pixel point is as follows: d (p) 1/s ∑_q∈Nc(||p-q||₂)s(||d₀(q)-d₀(p)||₂)·d₀(p)。

D (p) represents the depth value of a pixel point p in the depth map, N represents the set of all pixel points of which the surrounding area of the point p may influence the value, and the pixel points are also called as neighborhood points of the point p; s represents the number of pixel points in the set N; d₀And (p) is the depth value of the pixel point p after filtering. c (| | p-q | | non-luminous flux)₂) Representing the geometric proximity, s (| | d), of the pixel point p and its neighborhood point₀(q)-d₀(p)||₂) The similarity between the pixel point p and the filtered depth value of the neighborhood point is measured.

Based on the filtering processing, the first original depth image can be denoised to obtain a first denoised depth image.

The depth map provides the z-coordinate in the camera coordinate system, i.e. the distance in space between the camera and the object corresponding to each pixel point of the camera. After the first noise-reduction depth image is obtained, calculating a point cloud mapping map under a camera local coordinate system according to the first noise-reduction depth image by using internal parameters of the depth camera, and if the converted pixel point coordinate is (u, v), performing a conversion formula as follows:

wherein, K_dIs the camera parameter of the depth camera, T is the transpose matrix, D_k(u, v) is the depth value of the pixel,

namely, the spatial coordinates corresponding to the pixel points, and each spatial coordinate corresponds to a point cloud. And converting the coordinates of each pixel point in the first noise-reduction depth image to obtain a plurality of point clouds.

And after the space coordinates are obtained, aiming at each pixel point, calculating a corresponding normal vector according to the three-dimensional coordinates of the neighborhood pixels of the pixel point. The three-dimensional coordinates of the neighborhood pixels are included in the shot image, the pixel points of the coordinates adjacent to the pixel points are in the same direction as the vertical direction, the horizontal direction and the like of the pixel points, and the pixel points corresponding to the coordinates of the directly adjacent pixels can be selected as the adjacent pixels. When the coordinates of the pixel point in the first noise-reduced depth image are (u, v), the normal vector formula adopted in this embodiment is:

wherein normalfze refers to normalization,

is the space vector of the point cloud corresponding to the pixel point,

and

is the spatial vector of the point cloud corresponding to the neighborhood pixels of the pixel point,

the length of the point cloud corresponding to the pixel point is a normal vector of one. Further, if the three-dimensional coordinates corresponding to the pixel points are invalid or the three-dimensional coordinates corresponding to the adjacent pixel points are invalid, the normal vectors corresponding to the pixel points are regarded as external points to be removed.

And then based on the normal vector of each point cloud, carrying out scene fusion by means of a Kinectfusion algorithm and the like. This process is prior art and is not described herein.

And S200, extracting the neighborhood characteristics of each voxel in the Nth intermediate scene model according to a preset extraction rule.

Specifically, an extraction rule is preset, and the extraction rule is mainly used for extracting the neighborhood characteristics of each voxel according to the attributes of the neighborhood voxels of the voxel, including distance characteristics, voxel value characteristics and the like. For each voxel, the neighborhood characteristic corresponding to the voxel comprises the scene characteristic of the neighborhood voxel of the voxel, and the neighborhood voxel is a voxel corresponding to a neighborhood pixel of the pixel point corresponding to the voxel.

In addition, in order to reduce unnecessary computation and increase computation speed, the neighborhood features extracted according to the extraction rule in this embodiment are features for performing certain screening on neighborhood voxels. Firstly, extracting scene characteristics of each voxel in the Nth intermediate scene model according to a preset extraction rule, wherein the scene characteristics refer to characteristics related to the scene model, such as a voxel value and a distance characteristic corresponding to the voxel. And then, aiming at each voxel, screening the neighborhood voxels according to a preset screening rule and the coordinates of the neighborhood voxels corresponding to the voxel to obtain a target voxel. The screening rule is mainly used for eliminating voxels with poor correlation with the voxel value of the voxel in the neighborhood. In this embodiment, the target voxel is a fixed-size cube, the x-axis of which coincides with the normal vector direction of the voxel, and the z-axis of which coincides with the z-axis direction of the world coordinate system. And finally, taking the scene characteristic corresponding to the target voxel as the neighborhood characteristic corresponding to the voxel.

Further, since the most critical of the voxels whose voxel values are to be updated subsequently is the voxel located on the scene surface, in this embodiment, it is preferable to select the voxel corresponding to each vertex in the vertex map in which the virtual camera is re-projected by the first original scene model without calculating the neighborhood feature of each voxel, and only the voxel corresponding to each vertex is needed to be calculated when the voxel prediction value is calculated subsequently, so as to reduce the amount of calculation and improve the calculation efficiency.

And step S300, calculating a voxel prediction value corresponding to each voxel according to the neighborhood characteristics.

Specifically, for each voxel, a voxel prediction value corresponding to the voxel is calculated according to a neighborhood characteristic corresponding to the voxel. The voxel prediction value is a numerical value for predicting the voxel value of the voxel according to the neighborhood characteristic corresponding to the voxel.

In this embodiment, the prediction mode may be implemented by deep learning, machine learning, or the like. In this embodiment, a structured random forest model is used to calculate a voxel prediction value. The specific process comprises the following steps:

and A10, inputting the neighborhood characteristics corresponding to each voxel into the trained structured random forest model, and predicting the voxel value through the structured random forest model according to the input neighborhood characteristics to obtain a plurality of initial predicted values corresponding to the voxel.

Specifically, for each voxel, it may also be a voxel corresponding to a vertex, and the neighborhood characteristics of the voxel are input into the trained structured random forest model. And each decision tree in the structured random forest model calculates an initial prediction value corresponding to the decision tree according to the input neighborhood characteristics. Since the structured random forest model typically comprises several decision trees, several initial prediction models are available.

Further, when the structured random forest model is trained, a large number of training depth images are obtained first, and training pixel points in the training depth images are screened. And then aiming at each training pixel point, taking a voxel value corresponding to the pixel point as label data, taking a neighborhood voxel value corresponding to the pixel point as training data, inputting the training data into a preset structured random forest model, and calculating a corresponding training predicted value according to the training data through the structured random forest model.

Because each decision tree predicts the predicted voxel value of a voxel according to the scene characteristics of the neighborhood voxels of one voxel, in the vertex image, two adjacent points usually have similar effect on prediction, and therefore, the voxel value prediction of the voxel corresponding to each pixel point in each training depth image is not required. Training data that is helpful for prediction can be selected by a preset sampling rule. In this embodiment, pixel points of which the corresponding normal vectors are far away from the depth camera and pixel points of which the normal vectors are the same as the positive direction of the z axis of the world coordinate system are taken as elimination points, and scene coordinates of voxels corresponding to the elimination points are eliminated from neighborhood features corresponding to the voxels, so as to obtain training data.

When the decision tree is trained, the label data and the corresponding training data are used as a data pair, and for each decision tree, a plurality of data pairs are selectively selected from all the data pairs to be used as the training subsets corresponding to the decision tree, so that a plurality of training subsets are obtained.

The present embodiment describes a process of training a decision tree by taking a training data as an example. And inputting the training data into a root node of the decision tree aiming at each decision tree, classifying layer by layer from the root node until reaching a leaf node, and classifying the characteristics of each node according to numerical values in the training data so that the characteristics of the sub-nodes are similar as much as possible, namely the characteristics of the neighborhoods are similar as much as possible. When the classification is performed, the embodiment firstly adopts a random sampling mode to randomly sample the dimensionality of three dimensionalities in the training data, so that the dimensionality reduction of the data is realized, and then the data is divided into two temporary classes through a clustering mode. The dimensionality reduction and clustering can be performed in a random sampling mode to obtain two principal component values. Training comfort is then assigned to the most appropriate one of the two child nodes based on the sign of the largest numerical primary component value. At the last leaf node, a predictive training value is output. And then, according to the predicted training value and label data corresponding to the training data, parameter adjustment is carried out on each node of the decision tree until the training of all the decision trees is completed, and the structured random forest model completes convergence, namely the training is completed.

And A20, calculating an error value corresponding to each initial predicted value according to a preset error loss function.

Specifically, since the calculation results of each decision tree are different and there are some results that have a large distance from the actual geometric structure, an error value corresponding to each initial prediction value is calculated through a preset error loss function, so that a part of initial prediction values with large errors are removed. In this embodiment, a three-dimensional point cloud corresponding to the first original depth image is calculated by reprojection, and a three-dimensional point cloud falling in a neighborhood voxel is found, and an error loss function is adopted as follows:

wherein E is an error value, and E is an error value,

the three-dimensional point cloud falling into the neighborhood voxel is represented by p, and the initial predicted value is represented by y (p).

A30, determining a plurality of intermediate predicted values in the initial predicted values according to the error value.

Specifically, an intermediate prediction value of the initial prediction values is determined according to the error value. The determination method may be that the error values corresponding to the initial predicted values are sorted according to the sizes of the error values to obtain an error sequence, and then the initial predicted values with the same number as the selected number in the initial predicted values are sequentially selected as intermediate predicted values according to a preset selected number, for example, the number is 2. And a preset error value threshold value can be adopted, and the initial predicted value corresponding to the error value smaller than the threshold value is used as an intermediate predicted value.

And A40, calculating a voxel predicted value corresponding to the voxel according to the intermediate predicted value.

Specifically, after the intermediate prediction values are obtained, the number of the intermediate prediction values may also be greater than one, so that the corresponding voxel prediction values need to be calculated according to the intermediate prediction values. The manner used may be a weighted average. The weighted value added to the intermediate predicted value in this embodiment is obtained by adjusting the error value calculated in the foregoing. The weighted average calculation formula adopted in this embodiment is:

TSDF_globle＝∑_i∈Mw_i·TSDF_i；

wherein, TSDF_globleFor voxel predictor, M is the number of intermediate predictors, TSDF_iIs the ith intermediate predictor, w_iIs TSDF_iCorresponding weight value, w_iExp (- α E). Wherein, the weighting parameter is equivalent to a naive average without weighting when α is equal to zero. As α approaches infinity, the predicted values that are most consistent with the observed geometry get higher weight. α in the present embodiment is set to 100.

Step S400, aiming at each voxel, when the voxel value corresponding to the voxel in the Nth original scene model is a non-voxel observation value, updating the Nth original scene model according to the voxel prediction value corresponding to the voxel to obtain an (N +1) th original scene model.

Specifically, after a voxel prediction value corresponding to each voxel is obtained, for each voxel, when a voxel value corresponding to the voxel in the nth original scene model is a non-voxel observation value, the nth original scene model is updated according to the voxel prediction value corresponding to the voxel, so as to obtain an (N +1) th original scene model. The non-body-side observation value corresponds to the body-side observation value, and the voxel observation value refers to a voxel value directly determined through the depth image, but not null, 1, -1 and other invalid values or voxel prediction values. That is, when the voxel value of a certain voxel is a voxel observed value, the voxel value of the voxel is not changed; and when the voxel value of a certain voxel is a non-voxel observation value, updating the Nth original scene model by taking the currently obtained voxel prediction value corresponding to the voxel as the voxel value corresponding to the voxel to obtain the (N +1) th original scene model.

Labeling each voxel with a label value, wherein the label value is labeled as P and is used for judging the type of the voxel value corresponding to the voxel.

And when the voxel value of the voxel is null, +1 or-1, the label value P corresponding to the voxel is assigned to be 0.

And when the voxel value of the voxel value is a numerical value determined according to the three-dimensional point cloud, assigning the label value P as 1.

When the voxel value of the voxel value is a value obtained by the foregoing prediction method, the label value P is assigned to-1.

In the updating process, a label value corresponding to each voxel in the Nth original scene model is obtained, when the label value is 0 or-1, a predicted voxel value obtained through prediction is used as a voxel value after the voxel is updated, and the label value is updated to be-1; when the label value is 1, the voxel value of the voxel is not changed; and when the voxel observation value exists in the voxel and the corresponding label value is-1 or 0, taking the voxel observation value as the voxel value corresponding to the voxel, and updating the label value to be 1. By the method, the predicted area in the scene three-dimensional model can be eliminated, the model is perfected by using the real geometric data of the scene, and the accurate estimation of the scene surface is realized. And simultaneously, camera attitude estimation is carried out according to the normal vector of each pixel point, and the reconstruction of a scene model is completed.

Subsequently, a vertex and a triangular patch can be extracted from the scene model obtained after the updating is finished through a Marching Cube algorithm and the like, so that a reconstructed model is obtained.

As shown in fig. 3, the left image is the three-dimensional model obtained without using the scene construction model of the present embodiment, and the right image is the three-dimensional model obtained using the scene construction model of the present embodiment, where the dashed-line frame selection part is a cavity region, it can be obviously found that by the scene model construction method of the present embodiment, the number and range of cavities can be greatly reduced, and the accuracy is high.

Further, as shown in fig. 4, based on the above scene model construction method, the present invention also provides an intelligent terminal, which includes a processor 10, a memory 20, and a display 30. Fig. 4 shows only some of the components of the smart terminal, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a scene model building program 40, and the scene model building program 40 can be executed by the processor 10, so as to implement the scene model building method in the present application.

The processor 10 may be a Central Processing Unit (CPU), a microprocessor or other data Processing chip in some embodiments, and is configured to run program codes stored in the memory 20 or process data, for example, execute the scene model building method, and the like.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.

In one embodiment, when processor 10 executes scene model building program 40 in memory 20, the following steps are implemented:

The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a scene model construction program, which when executed by a processor implements the steps of the scene model construction method as described above.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program can be stored in a computer readable storage medium, and the program can include the processes of the method embodiments described above when executed. The computer readable storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited to the above-described embodiments, and that modifications and variations may be made by persons skilled in the art in light of the above teachings, and all such modifications and variations are intended to fall within the scope of the invention as defined in the appended claims.

Claims

1. A scene model construction method is characterized by comprising the following steps:

2. The method for constructing a scene model according to claim 1, wherein when an nth original depth image for a same scene is obtained, scene fusion is performed on the nth original scene model according to the nth original depth image to obtain an nth intermediate scene model, specifically including:

3. The method according to claim 1, wherein for each of the voxels, the neighborhood feature corresponding to the voxel comprises a scene feature of a neighborhood voxel of the voxel, and the neighborhood voxel is a voxel corresponding to a neighborhood pixel of a pixel point corresponding to the voxel.

4. The method for constructing a scene model according to claim 1, wherein the extracting neighborhood characteristics of each voxel in the nth intermediate scene model according to a preset extraction rule specifically includes:

5. The method for constructing a scene model according to claim 1, wherein the calculating a voxel prediction value corresponding to each voxel according to the neighborhood characteristics specifically includes:

6. The method for constructing a scene model according to claim 5, wherein the training process of the structured random forest model comprises:

acquiring training depth images of different scenes;

7. The method of constructing a scene model according to claim 6, wherein the structured random forest model comprises a number of decision trees; the method for calculating the training prediction value comprises the following steps of inputting the training data into a preset structured random forest model, and calculating the corresponding training prediction value according to the training data through the structured random forest model, wherein the method specifically comprises the following steps:

8. The method for constructing a scene model according to claim 5, wherein the calculating a voxel prediction value corresponding to the voxel according to the intermediate prediction value specifically includes:

9. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and a scene model builder stored on the memory and executable on the processor, the scene model builder when executed by the processor implementing the steps of the scene model building method as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a scene model construction program which, when executed by a processor, implements the steps of the scene model construction method according to any one of claims 1 to 8.