CN113436130A

CN113436130A - Intelligent sensing system and device for unstructured light field

Info

Publication number: CN113436130A
Application number: CN202110978131.7A
Authority: CN
Inventors: 方璐; 戴琼海; 张嘉凝; 袁肖赟; 毛适; 赵严; 温建伟
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-09-24
Anticipated expiration: 2041-08-25
Also published as: CN113436130B

Abstract

The invention provides an intelligent sensing system and device for an unstructured light field, wherein the system comprises: the non-structural heterogeneous high-resolution imaging unit consists of at least one global camera and a plurality of long-focus cameras and is used for fusing data captured by different heterogeneous image sensors in the non-structural heterogeneous high-resolution imaging unit to obtain a first fused image; the variable baseline light field module consists of a plurality of non-structural heterogeneous high-resolution imaging units and is used for fusing the first fused image according to variable baselines among different cameras to obtain a second fused image; the camera array consists of four groups of super wide angle image sensors and is used for fusing the second fused image obtained by the variable baseline light field module to obtain an annular panoramic image. The method has very good robustness and expansibility, can extract large-range accurate depth information in a large scene, and can seamlessly embed the high-resolution picture of the flexible unstructured local camera into the panoramic image.

Description

Intelligent sensing system and device for unstructured light field

Technical Field

The invention relates to the technical field of virtual reality video acquisition, in particular to an intelligent sensing system and device for an unstructured light field.

Background

Three-dimensional vision is an important perception path of human beings, and in five perception paths of human beings, the vision occupies 70% -80% of information sources; while the brain has about 50% of its capacity for visual information and perception. The existing image acquisition scheme is a two-dimensional scheme, the whole three-dimensional scene cannot be recovered, the light rays in the three-dimensional world are completely represented by the light field, and the complete three-dimensional perception capability even can exceed the visual system of a human. Currently relevant researchers have proposed a series of systems and theories for light field acquisition, with classical design approaches including: 1) light field acquisition based on a Microlens Array (Microlens Array); 2) a Camera Array (Camera Array) based light field acquisition; 3) light field acquisition based on Coded masks (Coded Mask).

The main defects of the light field acquisition scheme based on the micro-lens array are that the image resolution loss of light field viewpoints is serious, and the parallax between the viewpoints of the light field is small; the array-based light field acquisition scheme has the main defects of complex system bloat, high hardware cost, strict requirement on the synchronous control precision of a camera, large data transmission pressure and complex calibration of the camera among camera arrays; the method based on the coding mask loses the intensity of light signals, the imaging signal-to-noise ratio is very low, and the collected light field needs to be restored through algorithms such as compressed sensing, so that the fidelity is greatly reduced.

Therefore, the existing light field system is difficult to realize the light field collection and perception with ultra-wide visual angle, ultra-high definition and wide range exceeding the human visual perception capability.

Based on the inherent contradiction of wide view field and high resolution when recording videos from daily behaviors to complex operation states in large scenes, a light field acquisition method for generating virtual reality contents with high robustness and high quality is lacked in the prior art.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide an intelligent sensing system for an unstructured light field, which obtains ultra-wide view field and ultra-high resolution images and videos through multi-reconstruction image sensor fusion imaging, and realizes large-range three-dimensional depth sensing by a variable baseline accurate reconstruction technology. Meanwhile, the device also breaks through the limitation that the image sensor of the existing light field system follows uniform distribution, and does not depend on a structured system architecture needing fine calibration.

Furthermore, the method breaks through the bottleneck of long-term restriction on the space-time bandwidth product of the imaging of the optical image sensor, improves the data flux of the light field sensing from the international highest million-level pixel to a hundred million-level pixel, and also realizes the real-time three-dimension of the wide-field high-resolution dynamic light field.

The second purpose of the invention is to provide an unstructured optical field intelligent sensing device.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides an unstructured light field intelligent sensing system, configured to acquire light field video data, where the system includes:

the non-structural heterogeneous high-resolution imaging unit consists of at least one global camera and a plurality of long-focus cameras and is used for fusing data captured by different heterogeneous image sensors in the non-structural heterogeneous high-resolution imaging unit to obtain a first fused image;

the variable baseline light field module consists of a plurality of non-structural heterogeneous high-resolution imaging units and is used for fusing the first fused image according to variable baselines among different cameras to obtain a second fused image;

the camera array of the four-girdle three-dimensional panoramic acquisition module consists of four groups of super-wide angle image sensors and is used for fusing the second fused image obtained by the variable baseline light field module to obtain an girdle panoramic image.

The unstructured light field intelligent sensing system comprises a unstructured heterogeneous high-resolution imaging unit, at least one global camera and a plurality of long-focus cameras, and is used for fusing data captured by different heterogeneous image sensors in the unstructured heterogeneous high-resolution imaging unit to obtain a first fused image; the variable baseline light field module consists of a plurality of non-structural heterogeneous high-resolution imaging units and is used for fusing the first fused image according to variable baselines among different cameras to obtain a second fused image; the camera array of the four-girdle stereo panoramic acquisition module consists of four groups of super wide-angle image sensors and is used for fusing the second fused image obtained by the variable baseline light field module to obtain an girdle panoramic image. Super wide view field ultrahigh resolution images and videos are obtained through multiple heterogeneous image sensor fusion imaging, and large-range three-dimensional depth perception is achieved through a variable baseline accurate reconstruction technology. Meanwhile, the system breaks through the limitation that the image sensor of the existing light field system follows uniform distribution, and does not depend on a structured system architecture needing fine calibration. Meanwhile, the invention breaks through the bottleneck of restricting the space-time bandwidth product of the imaging of the optical image sensor for a long time, improves the data flux of the light field sensing from the international highest million-level pixel to a hundred million-level pixel, and also realizes the real-time three-dimension of the wide-field high-resolution dynamic light field.

In addition, the unstructured light field intelligent sensing system according to the above embodiment of the present invention may also have the following additional technical features:

further, in one embodiment of the present invention, the system further comprises:

a rendering module for implementing annulus panoramic free viewpoint rendering according to a layered rendering strategy, the rendering module comprising:

an original layer module to render the 3D video above a threshold resolution using an original layer;

the fuzzy layer module is used for processing the dragging problem in the picture by adopting a fuzzy layer;

and the dynamic layer module is used for adopting the dynamic layer to perform dynamic foreground rendering.

Further, in an embodiment of the present invention, the system further includes:

the device comprises a support piece, a camera cloud platform and a frame;

the support is used for supporting the four-ring-band stereoscopic panoramic acquisition module and the variable baseline light field module;

the camera cloud deck is used for supporting the different heterogeneous image sensors in the non-structural heterogeneous high-resolution imaging unit;

and the rack is used for supporting the unstructured optical field intelligent sensing system through a connecting piece.

Further, in an embodiment of the present invention, in an image in video data of a global camera in the non-structural heterogeneous high resolution imaging unit, a feature-based stitching algorithm estimates internal and external parameters of the global camera to obtain a preset position of the global camera, and embeds an image of a local camera in the non-structural heterogeneous high resolution imaging unit into the preset position of the global camera by using non-structural embedding.

Further, in an embodiment of the present invention, the variable baseline light field module is further configured to:

carrying out girdle stereo matching on the girdle images, and extracting a feature map from the girdle images through a neural network;

constructing a matching cost value according to the feature map to obtain a 4D parallax cost value;

obtaining the matching cost under each candidate parallax according to the 4D parallax cost amount;

performing cost aggregation on the cost matching result to obtain an optimized cost matching result;

and determining the parallax of each position from the optimized cost matching result to obtain an annular zone parallax map.

In order to achieve the above object, a second embodiment of the present invention provides an unstructured light field intelligent sensing apparatus, which includes:

the first fusion module is used for fusing data captured by different heterogeneous image sensors to obtain a first fusion image;

the second fusion module is used for fusing the first fusion image according to variable baselines among different cameras to obtain a second fusion image;

and the panoramic image fusion module is used for fusing the second fusion image to obtain an annular panoramic image.

The multi-unstructured-light-field intelligent sensing device provided by the embodiment of the invention is used for fusing data captured by different heterogeneous image sensors through the first fusion module to obtain a first fusion image; the second fusion module is used for fusing the first fusion image according to the variable baseline among different cameras to obtain a second fusion image; and the annular panoramic image fusion module is used for fusing the second fusion image to obtain an annular panoramic image. According to the invention, ultra-wide view field ultrahigh resolution images and videos are obtained through multiple heterogeneous image sensor fusion imaging, and large-range three-dimensional depth perception is realized by a variable baseline accurate reconstruction technology. Meanwhile, the invention also breaks through the limitation that the image sensor of the existing light field system follows uniform distribution, and does not depend on a structured system architecture which needs fine calibration. Meanwhile, the invention breaks through the bottleneck of restricting the space-time bandwidth product of the imaging of the optical image sensor for a long time, improves the data flux of the light field sensing from the international highest million-level pixel to a hundred million-level pixel, and also realizes the real-time three-dimension of the wide-field high-resolution dynamic light field.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic structural distribution diagram of an unstructured light field intelligent sensing system according to one embodiment of the invention;

FIG. 2 is a block diagram of an unstructured light field intelligent sensing system according to one embodiment of the invention;

FIG. 3 is a schematic structural diagram of an unstructured light field intelligent sensing system according to one embodiment of the invention;

FIG. 4 is a schematic structural diagram of a non-structural heterogeneous high resolution imaging unit according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of a variable baseline light field module, according to one embodiment of the present invention;

fig. 6 is a schematic structural diagram of a four-girdle stereo panoramic acquisition module according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of intelligent sensing of unstructured light fields, according to one embodiment of the invention;

fig. 8 is a schematic structural diagram of an unstructured light field intelligent sensing device according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes an unstructured light field intelligent sensing system and device according to an embodiment of the invention with reference to the drawings.

The system of the invention relies on a novel mixed spherically distributed unstructured heterogeneous camera array for billion-pixel-level, large-range, remote 3D panoramic VR photography and achieves large-range depth perception, and has imaging capability with both high resolution and wide field of view. In addition, dense camera arrangement is not needed, flexible arrangement of a high-resolution area is realized through unstructured local camera compensation under sparse camera arrangement, and the cost is greatly improved compared with that of a traditional camera array. As shown in fig. 1 and 2.

Fig. 3 is a schematic structural diagram of an unstructured light field intelligent sensing system according to an embodiment of the present invention.

As shown in fig. 3, the system 10 includes: the system comprises a non-structural heterogeneous high-resolution imaging unit 100, a variable baseline light field module 200 and a four-ring band stereo panoramic acquisition module 300.

The non-structural heterogeneous high-resolution imaging unit 100 is composed of at least one global camera and a plurality of telephoto cameras, and is configured to fuse data captured by different heterogeneous image sensors in the non-structural heterogeneous high-resolution imaging unit to obtain a first fused image.

Specifically, as shown in fig. 4, the non-structural heterogeneous high-resolution imaging unit 100 is composed of at least one wide-field global camera and a plurality of high-resolution telephoto cameras, so that the system has imaging capability with both high resolution and wide field.

The variable baseline light field module 200 is composed of a plurality of non-structural heterogeneous high-resolution imaging units 100, and is configured to fuse the first fused image according to the variable baseline between different cameras to obtain a second fused image.

Specifically, as shown in fig. 5, the variable baseline light field module 200 is composed of a plurality of non-structural heterogeneous high-resolution imaging units 100, and can be freely combined into a light field imaging system with different baselines, so as to realize wide-range depth perception.

It can be understood that, for images in the video data of the global camera in the non-structural heterogeneous high-resolution imaging unit 100, the feature-based stitching algorithm estimates the internal and external parameters of the global camera to obtain the corresponding position of the global camera, and embeds the images of the local cameras in the non-structural heterogeneous high-resolution imaging unit 100 into the corresponding position of the global camera using the unstructured embedding.

Further, the present invention relies on a novel hybrid spherically distributed unstructured heterogeneous camera array for billion pixel-level, large-scale distant 3D panoramic VR photography. To enable multi-scale, unstructured and extensible VR content capture, the parameters of all cameras in the system need to be designed accordingly. To capture a large scene VR scene, each zonal camera employs a custom lens with FoV of 360 degrees landscape, 60 degrees portrait, for each non-structural heterogeneous imaging unit 100, a wide field of view global camera in the unit employs a 5mm lens with a 1/1.7 "CMOS sensor to provide a sufficient FoV, while a high definition local camera in the imaging unit employs a 40mm lens with a 1/1.8" CMOS sensor to capture high resolution local details. Notably, the focal length of the partial camera lens is adjustable to accommodate various VR scenes.

In the aspect of mechanical layout, a light aluminum alloy stereo camera frame and a connecting piece made of thermal-stable polylactic acid (PLA) are adopted for connecting the camera frame. The spherical variable basis light field module 200 support made of a thermally stable polylactic acid is used to support each of the non-structural heterogeneous imaging units 100. Every heterogeneous formation of image of non-structure unit 100 units support piece is hexagon light aluminum alloy frame, and unit support piece provides a plurality of extra installation anchor points and is used for the installation of camera cloud platform, and the camera except that four ring belt camera systems all is connected with system 10 through the cloud platform.

Compared with the traditional camera array for light field acquisition, the invention does not need dense camera arrangement. Under the condition of sparse camera arrangement, flexible arrangement of a high-resolution area is realized through unstructured local camera compensation. The cost is greatly improved compared with the traditional camera array.

The camera array of the four-zonal stereo panoramic acquisition module 300 is composed of four groups of super wide-angle sensors, and is used for fusing the second fused image obtained by the variable baseline light field module 200 to obtain an zonal panoramic image.

Further, in order to improve resolution and detail of the fused panorama of the different level sensors, an unstructured embedding scheme is used to reshape the pictures of the local cameras in all the unstructured heterogeneous imaging units 100 to the positions of the global cameras of their corresponding imaging units. The warping field is represented by first finding matching points between global-local pictures using a cross-resolution matching algorithm and then estimating a grid-based multi-homography model. Also, a linear Monge-kantorovitch (mkl) solution is applied for mapping the color patterns of the local cameras to the global panorama to achieve local-global color consistency. By applying the same technical scheme, the fused images of all the non-structural heterogeneous imaging units 100 can be embedded into the panoramic image of the four-sideband imaging module 300.

The invention provides application of a variable baseline accurate reconstruction algorithm and a system thereof, the algorithm can adaptively adjust the length of an optical field baseline according to a scene, and realize accurate large-range three-dimensional depth reconstruction, and in addition, the details of the three-dimensional depth reconstruction can be further optimized by utilizing images formed by a high-strength local camera in each non-structural heterogeneous imaging unit 100.

As shown in fig. 6, the four-zonal stereo panorama acquisition module 300 finally generates two upper and lower panoramic stereo images with larger parallax, the stereo images generated by the four-zonal stereo imaging system extract feature maps from the two stereo zonal images by using a shared weight feature pyramid through a deformable neural network, and the calculation of the zonal parallax is characterized in that the deformable neural network is used for feature extraction, different from the conventional convolutional neural network, the feature extraction is performed by using a common convolution, the convolution kernel size is 3x3, the arrangement of the sampling points is very regular, the sampling points are square, the convolution kernel of the deformable neural network is not a regular 3x3 square convolution kernel, but an offset obtained by learning of an additional convolution layer is added to each sampling point, and the image is subjected to feature extraction by using the convolution kernel with irregular arrangement. The ring zone image is distorted to a greater extent than the conventional image, so that the deformable convolutional neural network can well extract the features therein.

Specifically, a matching cost value is constructed by using the feature map, so that a 4D parallax cost value is obtained, a matching cost under each candidate parallax is obtained by the 4D parallax cost value, cost aggregation is performed on the cost matching result to obtain an optimized cost matching result, and the parallax of each position is determined from the optimized cost matching result by using a differentiable soft-argmin operation, so that a panoramic annular zone parallax map is obtained.

In one embodiment of the present invention, to reduce the complexity of solving large-scale feature maps, a coarse-to-fine strategy is used to extract four decreasing spatial resolution feature maps. And then, fusing feature maps of different levels by adopting a skip-connected encoder-decoder structure, and expanding a receiving range and a searching range by adopting an SPP (spatial pyramid pooling) structure.

After the feature extraction is completed, the extracted feature map is used to construct a matching cost amount. The selected disparity candidate range is 0-384 pixels, so we need to construct a matching cost map corresponding to each candidate disparity. Specifically, to construct the cost matching amount under the candidate disparity x, we need to move all pixels of all feature maps extracted from the right map by x pixels in the disparity matching direction, and then construct the matching cost amount under the candidate disparity by using the distance metric between the left and right feature maps at the disparity level, thereby forming a 4D (channel number C, height H, width W, and disparity D) disparity cost amount. And there are four different proportional cost indices 1/8,1/16,1/32,1/64, respectively, corresponding to the four coarse to fine feature pyramids, respectively. The matching cost under the candidate disparity can be reflected by the cost amount.

After the initial cost matching result is obtained, since only local correlation is considered, the initial cost matching result is very sensitive to noise and cannot be directly used for calculating the optimal parallax, and further optimization, namely cost aggregation, needs to be performed on the initial cost matching result. Traditionally, this problem is usually solved by optimization, in a neural network, the system performs cost aggregation on the preliminarily calculated cost matching results by using a 3D convolutional layer, and the 3D convolutional layer can extract semantic information and summarize matching cost to improve parallax quality. Here we use a stacked hourglass structure to learn more semantic information so that the final result has the correct semantic structure.

Further, as shown in fig. 5 and 6, the system 10 further includes: the device comprises a support piece, a camera cloud platform and a frame; the support is used for supporting the four-sideband stereoscopic panorama acquisition module 300 and the variable baseline light field module 200, and the camera holder is used in the non-structural heterogeneous high-resolution imaging unit 100 and supports different heterogeneous image sensors; the frame supports the entire system 10 by means of connectors.

Further, the system 10 further includes: a rendering module for implementing annulus panoramic free viewpoint rendering according to a layered rendering strategy, the rendering module comprising:

Meanwhile, the process of intelligently fusing the data is the process of fusing the clear image into the blurred image, so that the fused image is clearer, and meanwhile, the system has wide-range depth perception.

According to the intelligent sensing system for the unstructured light field, which is provided by the embodiment of the invention, the unstructured heterogeneous high-resolution imaging unit is composed of at least one global camera and a plurality of long-focus cameras and is used for fusing data captured by different heterogeneous image sensors in the unstructured heterogeneous high-resolution imaging unit to obtain a first fused image; the variable baseline light field module consists of a plurality of non-structural heterogeneous high-resolution imaging units and is used for fusing the first fused image according to variable baselines among different cameras to obtain a second fused image; the camera array of the four-girdle stereo panoramic acquisition module consists of four groups of super wide-angle image sensors and is used for fusing the second fused image obtained by the variable baseline light field module to obtain an girdle panoramic image. Super wide view field ultrahigh resolution images and videos are obtained through multiple heterogeneous image sensor fusion imaging, and large-range three-dimensional depth perception is achieved through a variable baseline accurate reconstruction technology. Meanwhile, the system breaks through the limitation that the image sensor of the existing light field system follows uniform distribution, and does not depend on a structured system architecture needing fine calibration. Meanwhile, the invention breaks through the bottleneck of restricting the space-time bandwidth product of the imaging of the optical image sensor for a long time, improves the data flux of the light field sensing from the international highest million-level pixel to a hundred million-level pixel, and also realizes the real-time three-dimension of the wide-field high-resolution dynamic light field.

Next, an unstructured light field intelligent sensing apparatus according to one embodiment of the present invention is described with reference to the accompanying drawings.

As shown in fig. 8, the apparatus 20 includes: a first fusion module 400, a second fusion module 500, and an annular panoramic image fusion module 600.

The first fusion module 400 is configured to fuse data captured by different heterogeneous image sensors to obtain a first fusion image;

a second fusion module 500, configured to fuse the first fusion image according to variable baselines between different cameras to obtain a second fusion image;

and an annulus panoramic image fusion module 600, configured to fuse the second fusion image to obtain an annulus panoramic image.

Further, the apparatus 20 further comprises: and the layered rendering strategy module is used for realizing annular panoramic free viewpoint rendering according to the layered rendering strategy.

As shown in fig. 7, it can be understood that the device 20 collects, acquires, and receives video data by controlling an unstructured camera array, specifically, multiple heterogeneous image sensors perform fusion imaging, performs intelligent fusion on RGB data captured by image sensors at different levels, generates ultra-wide view field and ultra-high definition images and videos, and uses a variable baseline accurate reconstruction technique to realize a large-range depth perception function by using variable baselines between different cameras and realize panoramic annular free viewpoint rendering by using a layered rendering strategy. In order to present a high-resolution panoramic VR scene, a feature-based stitching algorithm is used to estimate the intrinsic and extrinsic parameters of each group of global cameras. In addition, in order to reduce obvious artifacts caused by camera positioning errors and color inconsistency in the region near the stitching slit boundary, when calculating the camera pose, a graph-cut is applied to estimate a seamless mask and eliminate non-mask regions in the image. Finally, a linear Monge-Kantorovitch solution is used to achieve color consistency between cameras.

It can be understood that in the conventional processing, only the candidate disparity corresponding to the minimum matching cost at each position needs to be found as the disparity value at the position, but the method which cannot be guided cannot be realized in the neural network, so that the invention obtains the disparity map from the cost value by using a differentiable soft-argmin operation. The probability of the point disparity at each candidate disparity value is calculated from the prediction cost using softmax, the prediction disparity being the sum of each disparity value weighted by its probability. Thereby obtaining the parallax value of each point.

Using softmax based on predicted cost c_dCalculating the probability of the parallax at each position under each candidate parallax value, wherein the predicted parallax is the weighted sum of each candidate parallax value according to the probability, and the specific steps are as follows:

wherein,

is the predicted disparity, D is the true disparity, D_maxIs the maximum value of the candidate disparity value, σ represents the softmax operation, c_dIs the cost of the parallax candidate value d. Loss function L of the neural network_s1The method comprises the following steps:

where N is the number of true disparity values,

is the predicted disparity and d is the true disparity.

After the preliminary parallax is obtained, since the base length of the annular system and the focal length of the camera are known, the relation is converted by the parallax depth:

wherein b is a camera system baseline, f is a focal length of the camera system, d is the pixel parallax, and Z is the actual depth of the pixel, a depth prior of each region in the scene can be obtained, and the prior can better guide further depth estimation. Because the variable baseline light field consists of a plurality of groups of non-structural heterogeneous imaging units, a plurality of groups of light field imaging combinations with different baselines can be formed among different units, and further, the large-range depth perception is realized on the basis of the depth prior of the girdle band. As can be known from the above parallax depth conversion relation formula, the parallax and the depth are in an inverse relation under the condition of a fixed focal length and a fixed base length, so that it is difficult to recover good three-dimensional depth information at a place where the parallax is small or large and the calculation difficulty is high, and therefore, for different parallax regions, it is necessary to select an appropriate combination of the base and the focal length to accurately recover the depth information. Generally speaking, for the calculation of the parallax of the two images, the parallax range is more suitable to be 10-60 pixels, so that the optimal baseline composition at the position can be determined through the depth prior of the ring zone, and the parallax at the position can be acquired more accurately.

After the optimal baseline is determined, the exact disparity is obtained by the same steps as the zone depth calculation.

Further, the cross-scale mapping fusion optimization can be further performed on the parallax by using the non-structural heterogeneous high-resolution imaging unit. The depth acquired in the variable baseline wide-range depth perception is a depth map of the wide-field global camera view angle of each non-structural heterogeneous high-resolution imaging unit, and the depth map of the global camera view angle can be further optimized through an embedded high-definition local camera,

for the embedded local RGB image, a bilateral operator solver is adopted, and a local disparity map is refined based on the structure of the high-resolution local RGB image: assuming that the target disparity map is t and the per-pixel confidence map is c, an improved disparity map x is obtained by solving the following function:

wherein,

is a correlation matrix that can be obtained from a reference image in YUV color space.

After the RGB pictures and the corresponding disparity maps are obtained, we propose an efficient 3-layer rendering scheme for rendering our billion-pixel 3D shots in real time.

A layered rendering policy module for implementing annulus panoramic free viewpoint rendering according to a layered rendering policy, wherein the layered rendering policy module comprises:

Wherein rendering the high-resolution 3D video using the original layer comprises: projecting the stitched disparity map onto three-dimensional coordinates to generate a background grid, and drawing the stitched panorama on the background grid:

wherein K and R represent internal and external parameters of the camera,

and

is the pixel position of the point p in the image plane,

is a pixel depth value

，

And

representing the rendering position of pixel p.

For the area covered by the local camera, the mesh vertex density is increased at magnification to obtain better depth quality.

When rendering using a single layer mesh, stretched triangle artifacts are easily created at the depth edges when moving the viewpoint. To optimize these artifacts, we first tear the grid apart by removing the grid whose normal direction makes a large angle with the view direction.

Wherein, adopt the fuzzy layer to shelter from the problem of dragging that the sudden change of depth of processing department caused, include:

removing the pulling area: the pull area affecting the visual effect is removed by removing the grid with normal direction at a large angle to the view direction:

wherein,

is the normal vector of the grid surface,

showing the direction of the view from the facing center to the optical center,

to represent

And

the included angle between them;

and adding a fuzzy layer behind the original layer to repair holes generated when the viewpoint is moved.

After the dragging area is removed, a hole appears in the rendering effect, and the hole appearing when the viewpoint is moved is repaired by adding a fuzzy layer behind the original layer, so that the dragging area with sudden change of the shielding area becomes smooth, and the whole visual effect is not influenced.

Wherein, in order to realize efficient rendering, the dynamic layer is adopted to perform a dynamic foreground rendering layer comprising: the grid of dynamic foreground is updated.

The updating of the grid of the dynamic foreground specifically includes:

the extraction module is used for initially extracting the dynamic foreground grid through Gaussian mixture model background subtraction;

the optimization module is used for optimizing a dynamic mask with clear dynamic foreground grids by adopting a fully connected conditional random field model;

a rendering module to recalculate 3D vertices belonging to the dynamic mask based on the dynamic mask to render the dynamic foreground mesh.

The foreground may be initially extracted by a Gaussian Mixture Model (GMM) background subtraction method. Since the dynamic mask generated by GMM is coarser in the object boundary, an efficient dense conditional random field (denseCRF) inference model is employed to obtain a clear boundary mask. For each new frame, the 3D vertices belonging to the dynamic mask are recalculated based on the high quality dynamic mask to render the dynamic object.

A rendering procedure based on the entire layer may generate high quality panoramic rendering results, especially in local areas, which may improve the visual effect and provide a zoom-in function. Further, by using the diffusion layer, artifacts due to occlusion can be eliminated, and the dynamic region can be efficiently updated by using the dynamic layer.

The invention shows excellent robustness and expansibility after multiple on-site shooting experiments. The method can extract large-range accurate depth information in a large scene, and the high-resolution picture of the flexible unstructured local camera can be seamlessly embedded into the panoramic image.

It should be noted that the foregoing explanation of the embodiment of the unstructured light field intelligent sensing system is also applicable to the unstructured light field intelligent sensing apparatus of this embodiment, and details are not repeated here.

According to the intelligent sensing device for the unstructured light field, which is provided by the embodiment of the invention, the first fusion module is used for fusing data captured by different heterogeneous image sensors to obtain a first fusion image; the second fusion module is used for fusing the first fusion image according to the variable baseline among different cameras to obtain a second fusion image; and the annular panoramic image fusion module is used for fusing the second fusion image to obtain an annular panoramic image. According to the invention, ultra-wide view field ultrahigh resolution images and videos are obtained through multiple heterogeneous image sensor fusion imaging, and large-range three-dimensional depth perception is realized by a variable baseline accurate reconstruction technology. Meanwhile, the invention also breaks through the limitation that the image sensor of the existing light field system follows uniform distribution, and does not depend on a structured system architecture which needs fine calibration. Meanwhile, the invention breaks through the bottleneck of restricting the space-time bandwidth product of the imaging of the optical image sensor for a long time, improves the data flux of the light field sensing from the international highest million-level pixel to a hundred million-level pixel, and also realizes the real-time three-dimension of the wide-field high-resolution dynamic light field.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An unstructured light field intelligent sensing system for acquiring light field video data, the system comprising:

2. The unstructured light field intelligent perception system of claim 1, further comprising:

3. The unstructured light field intelligent perception system of claim 1, further comprising:

the device comprises a support piece, a camera cloud platform and a frame;

4. The unstructured light field intelligent perception system according to claim 1, wherein images in video data of a global camera in the unstructured heterogeneous high resolution imaging unit are processed by a feature-based stitching algorithm to estimate internal and external parameters of the global camera to obtain a preset position of the global camera, and images of local cameras in the unstructured heterogeneous high resolution imaging unit are embedded into the preset position of the global camera by using unstructured embedding.

5. The unstructured light field intelligent perception system of claim 1, wherein the variable baseline light field module is further configured to:

6. An intelligent sensing device for an unstructured light field, comprising:

7. The unstructured light field intelligent perception device of claim 6, further comprising:

8. The unstructured light field intelligent sensing apparatus of claim 7, further comprising:

the optimization module is used for optimizing the dynamic mask with clear dynamic foreground grids by adopting a fully connected conditional random field model;

9. The intelligent sensing apparatus for unstructured light fields according to claim 6, wherein the zonal panoramic image is characterized by the bias obtained by additional convolutional layer learning by using convolution kernel of the neural network arranged irregularly to check the characteristicExtracting, using softmaxsoftmax according to the predicted cost c_dCalculating the probability of the parallax at each position under each candidate parallax value, wherein the predicted parallax is the sum weighted by the probability of each candidate parallax value, and the predicted parallax calculation formula is as follows:

the loss function Ls1 of the neural network is:

wherein,

is the predicted disparity, d is the true disparity, Dmax is the maximum value of the candidate disparity values, σ represents the softmax operation, c_dIs the amount of cost for the disparity candidate value d, and N is the number of true disparity values.

10. The intelligent sensing apparatus for unstructured light fields according to claim 6, further comprising a disparity map module for:

adopting a bilateral operator solver for the embedded local image, and obtaining an improved disparity map x based on the structure of the local image, wherein the improved disparity map x is obtained by solving the following function:

where t is the target disparity map, c is the per-pixel confidence map,

is the correlation matrix obtained with reference to the image.