CN115375836A

CN115375836A - Point cloud fusion three-dimensional reconstruction method and system based on multivariate confidence filtering

Info

Publication number: CN115375836A
Application number: CN202210910035.3A
Authority: CN
Inventors: 贺飏; 张双力; 丛林; 王成
Original assignee: Hangzhou Yixian Advanced Technology Co ltd
Current assignee: Hangzhou Yixian Advanced Technology Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-22

Abstract

The application relates to a point cloud fusion three-dimensional reconstruction method and system based on multivariate confidence coefficient filtering, wherein the point cloud fusion three-dimensional reconstruction method based on multivariate confidence coefficient filtering comprises the following steps: acquiring images, determining multivariate confidence coefficients of depth observation to be fused according to RGB information, pose information and corresponding depth maps of each frame of image, and merging the multivariate confidence coefficients to obtain a merged confidence coefficient; for each observation, performing fusion according to the corresponding merging confidence; determining a reconstruction point according to the fusion result, and performing three-dimensional reconstruction based on the reconstruction point; by the method and the device, the problem of lower reconstruction precision of the AR large scene navigation three-dimensional reconstruction in the related technology is solved, and the precision of the AR large scene three-dimensional reconstruction is improved.

Description

Point cloud fusion three-dimensional reconstruction method and system based on multivariate confidence filtering

Technical Field

The application relates to the technical field of three-dimensional reconstruction, in particular to a point cloud fusion three-dimensional reconstruction method and system based on multivariate confidence filtering.

Background

Dense scene reconstruction has been a core problem of three-dimensional vision and plays an important role in applications such as Augmented Reality (AR). In the AR application, to realize a real and immersive virtual-real fusion experience, it is necessary to correctly process the occlusion relationship between a real scene and a virtual AR object, and correctly render the shadow equivalent effect, so as to achieve reasonable placement of virtual content, and setting it and interaction with the real scene. In summary, to achieve these effects, a real-time and accurate three-dimensional reconstruction of a scene is required.

Currently common three-dimensional reconstruction schemes such as KinectFusion, bundleFusion, etc., rely heavily on depth measurements provided by depth sensors. However, depth sensors are still less popular because they are expensive and consume more power, and often only a few high-end models of mobile devices are left outfitted. Therefore, the monocular multi-view image is used for realizing real-time three-dimensional reconstruction, and the method has a very large application prospect. On the premise of not adding a sensor, the intelligent device can be directly used in the existing intelligent device.

multi-View-Stereo (MVS) is a basic task in the field of computers, and aims to derive three-dimensional information of objects in real environment through images shot by cameras and camera parameters. The basic principle is that images taken at different angles have some common observation parts, and reasonable analysis and utilization of 2D correlation of different images are the basis of three-dimensional reconstruction. The 3D coordinates of the spaceobject can be recovered from 2D-2D triangles. The main scheme is that a depth map of each frame is recovered based on a series of images and corresponding poses, and then a dense point cloud three-dimensional model is obtained by using all the depth maps through a point cloud fusion algorithm. Among them, depth map fusion includes voxel fusion (denoted as TSDF (truncated signed distance function)) and point cloud fusion.

The TSDF is a surface reconstruction algorithm which utilizes structured point cloud data and expresses the surface by parameters, the core is to map the point cloud data into a predefined three-dimensional space, a truncated symbolic distance function is used for expressing the area near the surface of a real scene, an implicit function F is arranged in each voxel, fitting is continuously carried out through TSDF and weight, after all voxels near the point cloud are fitted, all voxel points with the F being 0 are found, and the point clouds can express the surface point cloud of a scene model. However, for three-dimensional reconstruction applied to large-scale AR experiences, the scene contains a lot of irregular complexity (such as tortuous roads, vegetation cover, artistic sculptures, etc.). The body volume that needs to be constructed is very large and discrete. In addition, since the depth of field of the outdoor scene often varies greatly, it is difficult to process the depth value of each pixel with reasonable truncation parameters.

The open source algorithm libraries colmap and openMVS both adopt a point cloud fusion technology, and the core is to calculate the average value of 3D points obtained by back projection of all depth maps which accord with the reprojection error. Fig. 1 is a schematic diagram of a library point cloud fusion according to a classical open source algorithm of the related art, and as shown in fig. 1, it can be seen that this method often results in a great reconstruction error in the observation direction by observing the reconstruction error along the ray. For general scenarios, better recovery is possible. However, the use of depth maps with large depth ranges, more noise, and dynamic objects often results in undesirable results.

Watch 1

As can be seen, the mainstream three-dimensional reconstruction commercialized software in the market is mainly oriented to aerial photography and object level/indoor reconstruction. They all have reasonable depth range of reconstructed object, planned camera acquisition track and controllable image frame number. However, for three-dimensional reconstruction of AR large scenes (hundreds of meters, kilometers, and even above), images with large depth of field are often used, and local acquisition is not uniform, which all pose great challenges to reconstruction accuracy.

In terms of reconstruction accuracy, in a general depth fusion algorithm, the number of discrete points of an image recovered by visual depth is large, a fused model has high uncertainty, and the fused model often contains flying points, so that the fused model cannot be directly applied to gridding reconstruction; in the existing probability fusion method, the difference of depth uncertainty caused by matching errors of 1 pixel is considered, but the method is used for fusing a multi-frame depth map into a depth map with high confidence level and cannot directly reconstruct dense point cloud; the method for implicit space expression based on deep learning has good reconstruction integrity and precision, but at present, academic frontier work such as neuralRecon is mainly researched and developed aiming at indoor scenes, and for outdoor scenes with larger depth, the method is not good in performance and extremely depends on the computing power and the video memory size of a GPU.

Aiming at the problem that the reconstruction precision is low in the related technology for the three-dimensional reconstruction of the AR large scene navigation, no effective solution is provided yet.

Disclosure of Invention

The embodiment of the application provides a point cloud fusion three-dimensional reconstruction method and system based on multivariate confidence filtering, and at least solves the problem that in the related technology, the reconstruction precision is low for the three-dimensional reconstruction of AR large scene navigation.

In a first aspect, an embodiment of the present application provides a point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering, where the method includes:

acquiring images, determining the multi-element confidence coefficient of depth observation to be fused according to the RGB information, the pose information and the corresponding depth map of each frame of image, and merging the multi-element confidence coefficient to obtain a merged confidence coefficient;

for each observation, performing fusion according to the corresponding merging confidence; determining a reconstruction point according to the fusion result; and performing three-dimensional reconstruction based on the reconstruction point.

In some embodiments, the process of performing fusion and determining a reconstruction point according to the fusion result includes: and fusing the observation directions on the groups of common-view frames to a reference frame through Bayesian filtering, and determining reconstruction points according to the convergence of the point cloud distribution after fusion.

In some of these embodiments, the process of determining the merging confidence includes:

determining the geometric confidence of the depth observation to be fused according to the pose information of the image reference frame and the common-view frame; determining texture matching confidence of depth observation to be fused according to the RGB information of the image and the pose information of the reference frame and the common-view frame; determining semantic confidence of depth observation to be fused according to semantic information solved based on image RGB information;

and merging the geometric confidence, the texture matching confidence and the semantic confidence to obtain a merged confidence.

In some of these embodiments, before determining the confidence of the multivariate to be fused depth observations, the method comprises:

traversing pixels of a depth image of a reference frame to obtain a first pixel, and determining the coordinate of a target point of the first pixel under a coordinate system of the reference frame through back projection of the depth value of the first pixel to obtain a first coordinate;

determining the coordinate of the target point under a world coordinate system through pose transformation to obtain a second coordinate; and selecting one unoperated common-view frame of the reference frames, and determining the observation value according to the second coordinate.

In some of these embodiments, the process of determining the observed value includes:

determining the pixel position of the target point on the common-view frame through pose transformation and a projection equation according to the second coordinate to obtain a second pixel;

determining the coordinate of a target point of the second pixel in a world coordinate system through back projection of the depth value of the second pixel to obtain a third coordinate;

and determining the coordinates of the projection of the position corresponding to the third coordinate on the observation vector of the reference frame according to the first coordinate, the third coordinate and the optical center of the reference frame, and taking the result as the observation value.

In some of these embodiments, the process of determining the confidence of the multivariate of depth observations to be fused comprises:

determining the matching confidence of the depth observation texture to be fused and expressing the confidence by using a matching standard deviation;

determining a transformation matrix of relative poses according to pose information of the reference frame and the common-view frame, and determining the distance between the optical center of the reference frame and the optical center of the common-view frame according to the transformation matrix to obtain the optical center distance;

determining geometric confidence coefficient according to the square of the observation value, the focal length of the camera internal reference of the reference frame, the optical center distance and the matching standard deviation, and expressing the geometric confidence coefficient by using the geometric standard deviation;

and querying a semantic category label of a pixel to which each observation belongs, and determining the semantic confidence of the observation, wherein the standard deviation corresponding to the observation of the dynamic object is infinite.

In some embodiments, the determining the reconstruction point according to the convergence of the fused point cloud distribution comprises:

for Gaussian distribution, determining whether the standard deviation after fusion is smaller than a preset threshold value, and if so, determining convergence; if the convergence is achieved, determining a reconstruction point under a world coordinate system according to the average value of Gaussian distribution, and marking a merging position corresponding to the common-view frame;

and if the convergence is not achieved, continuously selecting one unoperated common-view frame of the reference frames, determining the observed value and fusing the observed value until all the common-view frames under the reference frames are operated.

In a second aspect, an embodiment of the present application provides a point cloud fusion three-dimensional reconstruction system based on multivariate confidence filtering, where the system includes:

the determining module is used for acquiring images, determining the multivariate confidence coefficients of the depth observation to be fused according to the RGB information, the pose information and the corresponding depth map of each frame of image, and merging the multivariate confidence coefficients to obtain a merged confidence coefficient;

the fusion module is used for executing fusion according to the corresponding fusion confidence coefficient of each observation; determining a reconstruction point according to the fusion result; and performing three-dimensional reconstruction based on the reconstruction point.

In a third aspect, an embodiment of the present application provides an electronic apparatus, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to execute the point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering.

In a fourth aspect, the present application provides a storage medium, in which a computer program is stored, where the computer program is configured to execute the point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering when the computer program runs.

Compared with the prior art, the method and the device have the advantages that for the problem of low reconstruction precision of the three-dimensional reconstruction of the navigation guide of the AR large scene, through acquiring the image, determining multivariate confidence coefficients of depth observation to be fused according to the RGB information, pose information and corresponding depth maps of the frames of images, and merging the multivariate confidence coefficients to obtain a merged confidence coefficient; for each observation, performing fusion according to the corresponding merging confidence; determining a reconstruction point according to the fusion result, and performing three-dimensional reconstruction based on the reconstruction point; because the confidence coefficient influence is considered, the reconstruction precision is better guaranteed, the problem that the reconstruction precision is lower for the three-dimensional reconstruction of the AR large scene navigation in the related technology is solved, and the precision of the AR large scene three-dimensional reconstruction is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a classical open source algorithm library point cloud fusion according to the related art;

FIG. 2 is a schematic diagram of an application environment of a point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering according to a first embodiment of the present application;

FIG. 4 is a schematic illustration of the preceding steps of determining a confidence level of a depth observation to be fused according to a second embodiment of the application;

FIG. 5 is a geometric schematic of a method of parameterizing 3D point fusion into 1D observations on rays according to a second embodiment of the present application;

FIG. 6 is a schematic illustration of a process of determining a merge confidence according to a third embodiment of the present application;

FIG. 7 is a schematic diagram of a transmission process of a point cloud fused three-dimensional reconstruction model according to a third embodiment of the present application;

FIG. 8 is a schematic diagram of a process of determining a reconstruction point according to convergence of a fused point cloud distribution according to a fourth embodiment of the present application;

fig. 9 is a schematic diagram of a three-dimensional reconstruction effect of a colop AR large scene according to the related art;

FIG. 10 is a schematic diagram of the three-dimensional reconstruction effect of the AR large scene according to the embodiment of the application;

fig. 11 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless otherwise defined, technical or scientific terms referred to herein should have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a single reference) are to be construed in a non-limiting sense as indicating either the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering provided by the application can be applied to an application environment shown in fig. 2, fig. 2 is an application environment schematic diagram of the point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering according to the embodiment of the application, and as shown in fig. 2, a terminal 202 and a server 204 communicate through a network. The server 204 acquires the images through the terminal 202, the server 204 determines the confidence coefficients of the multiple elements of the depth observation to be fused according to the RGB information, the pose information and the corresponding depth map of each frame of image, and the confidence coefficients of the multiple elements are merged to obtain a merged confidence coefficient; for each observation, the server 204 performs fusion according to its corresponding merge confidence; and determining a reconstruction point according to the fusion result, and performing three-dimensional reconstruction based on the reconstruction point. The terminal 202 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 204 may be implemented by an independent server or a server cluster composed of a plurality of servers.

The application provides a point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering, which can perform high-precision point cloud fusion of an AR large scene, and fig. 3 is a schematic diagram of the point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering according to the first embodiment of the application, and as shown in fig. 3, the process comprises the following steps:

step S301, acquiring images, determining multivariate confidence coefficients of depth observation to be fused according to RGB information, pose information and corresponding depth maps of each frame of image, and merging the multivariate confidence coefficients to obtain a merged confidence coefficient;

for example, the pose information of each frame image and the depth map corresponding to the pose information are input, the scene point cloud reconstruction is output, and in the middle process, the following three confidence coefficients are solved: firstly, solving the geometric confidence coefficient of a depth observation to be fused by using the pose information of a reference frame and a common-view frame, secondly, solving the texture matching confidence coefficient of the depth observation to be fused by using the RGB image information and the pose information of the reference frame and the common-view frame, and thirdly, obtaining the semantic confidence coefficient by using the semantic information solved by the RGB image information; for each observation to be fused, the algorithm merges the confidence coefficients to obtain a merged confidence coefficient, so that each observation contains a depth value and a confidence value (or uncertainty value);

step S302, for each observation, executing fusion according to the corresponding merging confidence; determining a reconstruction point according to the fusion result; performing three-dimensional reconstruction based on the reconstruction point;

optionally, the observation directions on each group of common-view frames may be fused to the reference frame through bayesian filtering, reconstruction points are determined according to convergence of point cloud distribution after fusion, and a final fusion result enables each fused 3D point cloud to be represented by probability distribution, and it can be determined which are high-quality reconstruction points and which are discrete outer points through the convergence of distribution, thereby ensuring reconstruction accuracy.

Through the steps S301 to S302, compared with the problem of low reconstruction accuracy in the related art for the three-dimensional reconstruction of AR large scene navigation, in the embodiment of the present application, by acquiring an image, determining the multivariate confidence of depth observation to be fused according to the RGB information, pose information, and corresponding depth map of each frame of image, and merging the multivariate confidence to obtain a merged confidence; for each observation, executing fusion according to the corresponding merging confidence coefficient; determining a reconstruction point according to the fusion result, and performing three-dimensional reconstruction based on the reconstruction point; because the confidence coefficient influence is considered, the reconstruction precision is better guaranteed, the problem that the reconstruction precision is lower for the three-dimensional reconstruction of the AR large scene navigation in the related technology is solved, and the precision of the AR large scene three-dimensional reconstruction is improved.

In some embodiments, before determining the confidence of the plurality of elements of the depth observation to be fused, the method further includes selecting a fusion prior, selecting a common-view frame, and determining an observation value of a point to be fused, and fig. 4 is a schematic diagram of a preceding step of determining the confidence of the depth observation to be fused according to a second embodiment of the present application, as shown in fig. 4, the process includes the following steps:

step S401, traversing pixels of a depth map of a reference frame to obtain a first pixel, and determining coordinates of a target point of the first pixel in a coordinate system of the reference frame through back projection of the depth value of the first pixel to obtain first coordinates;

for example, a pixel (u, v) of a ref depth map of a reference frame is traversed, a depth value depth corresponding to the pixel (u, v) is determined, and the depth value is back-projected to a 3D point X in a coordinate system of the reference frame according to formula 1 _ref ；

Step S402, determining the coordinate of the target point in a world coordinate system through pose transformation to obtain a second coordinate;

for example, by camera pose T, according to equation 2 _w-ref Transforming to world coordinate system to obtain 3D point X _w ；

X _w ＝T _w-ref *X _ref ＝T _w-ref *π ^-1 (u, v, depth) formula 2

Step S403, selecting one unoperated common-view frame of the reference frames, and determining the pixel position of the target point on the common-view frame through pose transformation and a projection equation according to the second coordinate to obtain a second pixel;

for example, an unoperated common-view frame src of the reference frame ref is selected, and the following operations of steps S303 to S305 are performed; the 3D point X _w By pose T _src-w Transforming to the coordinate system of the common view frame src according to the public

Equation 3, which takes its pixel position (u ', v') on the common view frame by the projection equation;

step S404, determining the coordinate of the target point of the second pixel in the world coordinate system through the depth value back projection of the second pixel, and obtaining a third coordinate;

for example, according to formula 4, the depth value d of the pixel in the common-view frame is taken from the pixel position (u', v _src ；

d _src ＝depth Map _src (u ', v') equation 4

According to equation 5, from d _src Calculating depth value and back projecting to world coordinate system to obtain 3D point X _src ；

X _src ＝T _w-src *π ^-1 (u′，v′，d _src ) Equation 5

Step S405, determining the coordinates of the projection of the position corresponding to the third coordinate on the observation vector of the reference frame according to the first coordinate, the third coordinate and the optical center of the reference frame, and taking the result as the observation value;

for example, the observation vector on the current reference frame may be represented by (X) _ref -C _ref ) Is represented by the formula (I) in which C _ref An optical center representing a reference frame; the observation vector on the current co-view frame may be represented by (X) _src -C _ref ) Represents; solving the included angle of the two vectors according to a formula 6;

finally, the 3D position calculated on the common view frame is projected onto the observation vector (X) on the reference frame according to equation 7 _ref -C _ref ) And taking the result as a posterior observation d _obs ；

d _obs ＝|X _src -C _ref Equation 7 of |. Cos (θ)

In some embodiments, fig. 5 is a geometric schematic diagram of a method for parameterizing 3D point fusion into a 1D observation value on a ray according to a second embodiment of the present application, and as shown in fig. 5, the embodiment of the present application models a position of a 3D point to be fused as a distribution on an observation vector of the reference frame, so that a problem of three-dimensional distribution is transformed into a problem of one-dimensional distribution, which greatly simplifies the computational complexity of the problem and greatly improves the computational efficiency. Alternatively, the distribution can be expressed as a Gaussian distribution, such that the distribution can be represented by the mean and variance of two parameters, see equation 8;

N(x|μ _k ，σ ² ) Formula (II)8, the operation can be greatly accelerated by selecting a reasonable error distribution hypothesis, such as Gaussian distribution; of course, other distributions may be used instead of the above-mentioned Gaussian distribution model, such as a mixture Gaussian model, a Beta-Gaussian mixture model, and a Uniform-Gaussian mixture model.

Specifically, fig. 6 is a schematic diagram of a process for determining merging confidence according to a third embodiment of the present application, and as shown in fig. 6, the process includes the following steps:

step S601, determining texture matching confidence of depth observation to be fused according to RGB information of the image and pose information of the reference frame and the common-view frame;

the confidence of the texture matching can be realized in different modes, even a deep learning method, as long as the local texture of the pixel can be characterized more effectively to provide the matching capability, for example, for the confidence of the texture matching, the matching standard deviation e is used _d To express, solve for e _d ：

A pair of matching points x, x' on the reference frame and the common view frame conform to epipolar geometric constraint and can be represented by a basic matrix F according to formula 9;

x F x' =0 formula 9

Solving epipolar line 1 on the reference frame through the basis matrix according to formula 10;

l＝F ^T x' equation 10

The epipolar line can be decomposed into its directions, and the matching points on the reference frame must be on the epipolar line, which is the epipolar geometric constraint. If the gradient direction of the texture of the matching point on the epipolar line is consistent with that of the epipolar line, better matching precision can be found. However, once these two directions are perpendicular, the matching accuracy is poor. According to the formula 11 and the formula 12, the gradient Gx and Gy of the image I in the x direction and the y direction can be respectively calculated by using a sobel operator;

calculating the magnitude G of the gradient from the gradients Gx and Gy in the x and y directions according to equation 13;

calculating the inverse of the gradient according to equation 14 and equation 15;

α＝atan(l _y /l _x ) Equation 14

β＝atan(G _y /G _x ) Equation 15

The texture matching standard deviation e of the observation is defined by the gradient direction and epipolar direction according to equation 16 _d (ii) a It can be seen that, for the case that the gradient direction and the epipolar line direction are relatively consistent, the standard deviation can be controlled at a lower value, and the larger the gradient amplitude G is, the smaller the standard deviation is; for the case that the gradient direction and the epipolar line direction are deviated to be vertical, a larger value is assigned, and the value is set as 10 here, which represents the matching error of 10 pixels;

step S602, determining the geometric confidence of the depth observation to be fused according to the pose information of the image reference frame and the common-view frame; for example, for geometric confidence, the geometric standard deviation e is used _z To express, solve for e _z ：

According to formula 17, for the reference frame and the common-view frame, the relative pose can be solved through their poses;

T _ref-src ＝T _ref-w ^* (T _src-w ) ^-1 equation 17

According to the formula 18, from the last column of the 4 × 4 transformation matrix of the relative pose, a 3-dimensional vector can be disassembled to calculate the modular length, i.e., the base length b; the length of the baseline is also the distance between the optical centers of the reference frame and the common-view frame;

b＝||T _ref-src [：，3]equation 18

Solving the geometric standard deviation e according to equation 19 _z Wherein f is the focal length of the camera reference of the reference frame;

from equation 19, the square of the geometric standard deviation (d) to the observed distance _obs Square of (d) is positive, indicating that the farther away the point is, the worse the reconstruction accuracy is; the geometric standard deviation is inversely related to the base line length b, and the larger the base line is, the better the 3D position can be triangulated; the geometric standard deviation is negatively correlated with the focal length f of the reference frame camera internal reference, which shows that the higher the resolution is, the finer the image can be reconstructed; finally, the geometric standard deviation and the matching standard deviation e calculated in the last step _d Positive correlation, so that formula 19 fuses the texture matching confidence and the geometric confidence together;

step S603, determining semantic confidence of depth observation to be fused according to semantic information calculated based on image RGB information; among them, semantic segmentation (semantic segmentation) is classification at the pixel level, and pixels belonging to the same class are classified into one class, so that semantic segmentation understands images from the pixel level, for example, pixels belonging to people are classified into one class, pixels belonging to a wall surface are classified into one class, and in addition, elevator pixels are classified into one class; the current semantic segmentation technology based on deep learning can segment scenes with higher precision and controllable computation cost;

for example, for semantic confidence, represented by the last fused standard deviation ef, we solve for ef as follows:

calculating a semantic graph for the RGB image of the common-view frame, wherein the solution of the semantic graph is independent of a specific method, the existing latest scheme is applicable at present, and optionally, the calculation can be performed by using a deep learning classical FCN (fuzzy conditional Networks); the calculated semantic graph and the image have the same resolution, and the semantic class label to which the semantic class label belongs can be inquired through the pixel position (u ', v') according to the formula 20;

label = sematicMap (u ', v') equation 20

The semantic category label is an integer value which is bound with unique semantic information; all semantic class labels can be divided into two types, namely static object type and dynamic object type; such as: people, vehicles and airplanes belong to dynamic objects; walls, floors, buildings, etc. belong to static objects;

according to formula 21, the semantic confidence value can be assigned to each observation by querying the semantic class label of the pixel to which the observation belongs, and the previous standard deviation is inherited for the observation belonging to the static object; for the observation belonging to the dynamic object, giving it an infinite standard deviation without fusion;

and step S604, merging the geometric confidence coefficient, the texture matching confidence coefficient and the semantic confidence coefficient to obtain a merged confidence coefficient.

Through steps S601 to S604, in the related art, there are often a large number of dynamic objects in the place of the AR large scene, such as pedestrians, automobiles, etc., but the influence of these dynamic objects is not considered in the existing commercial software technology, and the embodiment of the present application provides a fusion method based on multiple confidences by modeling from multiple aspects of texture confidence, semantic confidence and geometric confidence of three-dimensional reconstruction, and the method can effectively cope with different scenes, especially complex scenes required by the application of the AR large scene, such as shopping malls, parks, etc.

In some embodiments, fig. 7 is a schematic diagram of a transmission process of a point cloud fused three-dimensional reconstruction model according to a third embodiment of the present application, and as shown in fig. 7, the transmission process of the point cloud fused three-dimensional reconstruction model based on multivariate confidence filtering includes inputting a camera pose, an RGB image, and a depth map to be fused, thereby determining a geometric confidence, a matching confidence, and a semantic confidence, performing multiple confidence fusion, and finally outputting scene point cloud reconstruction.

In some embodiments, fig. 8 is a schematic diagram of a process of determining a reconstruction point according to convergence of a post-fusion point cloud distribution according to a fourth embodiment of the present application, and as shown in fig. 8, the process includes the following steps:

step S801, determining whether the fused standard deviation is smaller than a preset threshold value or not for Gaussian distribution, and if so, determining convergence;

for example, after obtaining the observed values and the fused standard deviations of the observed values, for a gaussian distribution, the fusion can be performed using equation 22;

if the standard deviation after the fusion is smaller than the threshold value, judging that the standard deviation is convergence, and stopping the iterative updating;

step S802, if the convergence is achieved, determining a reconstruction point under a world coordinate system according to the mean value of Gaussian distribution, and marking a merging position corresponding to the common-view frame; if not, continuing to select one of the unoperated common-view frames of the reference frame, determining the observation value and fusing the observation value until all the common-view frames under the reference frame are operated;

for example, if convergence occurs, the average value of the current gaussian distribution is substituted into equation 23 to obtain a reconstruction point X in a world coordinate system _new ；

X _new ＝T _w-ref *π ^-1 (u, v,. Mu.') equation 23

The final reconstruction point cloud is formed by all the reconstruction points which are subjected to the steps; on the other hand, if the standard deviation after fusion is larger than the threshold value, other common-view frames are continuously searched, a new observation value and a fused observation standard deviation are continuously determined in the same mode, and iterative fusion is continuously carried out.

The method can be better suitable for the depth map which has a large depth range and more noise points and contains dynamic objects, the embodiment of the method also provides the contrast between the colomap reconstruction effect in the related technology and the reconstruction effect of the method, fig. 9 is a schematic diagram of the colomap large-scene three-dimensional reconstruction effect according to the related technology, fig. 10 is a schematic diagram of the AR large-scene three-dimensional reconstruction effect according to the embodiment of the method, as shown in fig. 9 and 10, the reconstruction generated by the colomap can be seen, and more obvious reconstruction errors along rays exist.

Compared with other schemes in the related art, the technical scheme of the application has the advantages of high quality, low memory consumption, high timeliness and the like:

firstly, in terms of reconstruction accuracy, in a general depth fusion algorithm, discrete points of an image of visual depth recovery are more, a fused model has larger uncertainty, and the fused model often contains flying points, so that the fused model cannot be directly applied to gridding reconstruction;

secondly, compared with the voxel fusion method, the depth distance range of the visual depth is large, if the TSDF method is used, too many voxels are needed, which means extremely high memory consumption, and the difference of different scenes is huge, so that it is difficult to represent the reconstructed scene by using a proper voxel resolution, and the traditional method needs to load each depth image into a memory for fusion, for the AR large scene, there are often tens of millions (even tens of millions) of depth images to be fused, and the memory consumption level is large; the algorithm of the embodiment of the application has smaller memory consumption compared with the TSDF scheme, can be deployed on a cloud server, and can also be deployed on a mobile terminal due to the light computing weight and the low memory consumption;

thirdly, compared with the existing probability fusion method, the difference of depth uncertainty caused by matching the error of 1 pixel is considered, but the method is used for fusing a multi-frame depth map into a depth map with high confidence level and cannot directly reconstruct dense point cloud; the algorithm of the embodiment of the application aims at the point cloud fusion of the image recovery depth map, and if the depth map is acquired by other sensing equipment (such as a ToF camera), the algorithm can be compatible; it should be noted that the embodiment of the application is applicable to a scene without depending on an additional depth sensor, and the technical scheme can be directly used in the existing intelligent device on the premise of not increasing the sensors; the embodiment of the application can also be compatible with data acquired by a depth sensor;

fourthly, compared with a method based on deep learning implicit space expression, the deep learning method has good reconstruction integrity and precision, but the current academic frontier work such as Neural Recon is mainly developed aiming at indoor scenes, and for outdoor scenes with larger depth, the method has poor performance and extremely depends on the computing power and the video memory size of a GPU; the algorithm of the embodiment of the application does not depend on the support of specific equipment such as a GPU (graphics processing unit), a confidence unified model is constructed from three aspects of texture matching, geometry and semantics for the first time, and a confidence fusion mode is provided, namely fusion is carried out in a filtering mode, so that the operation efficiency is greatly improved; meanwhile, three-dimensional point cloud observed quantities are integrated into one dimension, the calculation complexity of the problem is greatly simplified, the operation efficiency is greatly improved, and the operation efficiency is further improved by selecting a reasonable error distribution hypothesis such as Gaussian distribution.

In combination with the point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering in the above embodiments, the embodiments of the present application can be implemented by providing a storage medium. The storage medium having stored thereon a computer program; when being executed by a processor, the computer program realizes any one of the point cloud fusion three-dimensional reconstruction methods based on multivariate confidence filtering in the embodiments.

In one embodiment, a computer device is provided, which may be a terminal. The computer device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a point cloud fusion three-dimensional reconstruction method based on multivariate confidence filtering. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In an embodiment, fig. 11 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 11, there is provided an electronic device, which may be a server, and an internal structure diagram of which may be as shown in fig. 11. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capabilities, the network interface is used for being connected and communicated with an external terminal through a network, the internal memory is used for providing an environment for the operation of an operating system and a computer program, the computer program is executed by the processor to realize a point cloud fusion three-dimensional reconstruction method based on multivariate confidence degree filtering, and the database is used for storing data.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is a block diagram of only a portion of the structure associated with the present application, and does not constitute a limitation on the electronic devices to which the present application may be applied, and a particular electronic device may include more or fewer components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, the computer program may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

It will be understood by those skilled in the art that for simplicity of description, not all possible combinations of the various features of the embodiments described above have been described, but such combinations should be considered within the scope of the present disclosure as long as there is no conflict between such features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A point cloud fusion three-dimensional reconstruction method based on multivariate confidence degree filtering is characterized by comprising the following steps:

for each observation, executing fusion according to the corresponding merging confidence coefficient; determining a reconstruction point according to the fusion result; and performing three-dimensional reconstruction based on the reconstruction point.

2. The method of claim 1, wherein the process of performing fusion and determining the reconstruction point according to the fusion result comprises: and fusing the observation direction reference frames on each group of common view frames through Bayes filtering, and determining reconstruction points according to the convergence of point cloud distribution after fusion.

3. The method of claim 2, wherein determining the merging confidence comprises:

and combining the geometric confidence coefficient, the texture matching confidence coefficient and the semantic confidence coefficient to obtain a combined confidence coefficient.

4. The method of claim 3, wherein prior to determining the confidence of the plurality of depth observations to be fused, the method comprises:

determining the coordinate of the target point under a world coordinate system through pose transformation to obtain a second coordinate; and selecting one unoperated common-view frame of the reference frames, and determining an observed value according to the second coordinate.

5. The method of claim 4, wherein the determining of the observation comprises:

6. The method of claim 5, wherein the determining the confidence of the plurality of depth observations to be fused comprises:

determining the matching confidence of the depth observation texture to be fused and expressing the matching confidence by using a matching standard deviation;

7. The method of claim 6, wherein determining the reconstruction point according to the convergence of the fused point cloud distribution comprises:

and if the convergence is not achieved, continuously selecting one unoperated common-view frame of the reference frames, determining the observation value and fusing the observation value until all the common-view frames under the reference frames are operated.

8. A point cloud fusion three-dimensional reconstruction system based on multivariate confidence filtering, characterized in that the system comprises:

the determining module is used for acquiring images, determining the multi-element confidence coefficient of depth observation to be fused according to the RGB information, the pose information and the corresponding depth map of each frame of image, and combining the multi-element confidence coefficient to obtain a combined confidence coefficient;

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for point cloud fusion three-dimensional reconstruction based on multivariate confidence filtering according to any one of claims 1 to 7.

10. A storage medium, in which a computer program is stored, wherein the computer program is configured to execute the point cloud fusion three-dimensional reconstruction method based on multiple confidence filtering according to any one of claims 1 to 7 when running.