CN118354053A

CN118354053A - Stereoscopic vision video communication method suitable for computing power resource constraint environment

Info

Publication number: CN118354053A
Application number: CN202410527594.5A
Authority: CN
Inventors: 程咏阳; 孙上; 张涛; 惠钊; 秦伯钦
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2024-04-29
Filing date: 2024-04-29
Publication date: 2024-07-16

Abstract

The invention discloses a stereoscopic vision video communication method suitable for a computing power resource constraint environment, which uses low-precision and low-resolution data acquisition equipment to process and transmit data, thereby greatly reducing bandwidth consumption; only two low-resolution parallax images are rendered based on eye tracking by a parallax image rendering method, and image quality enhancement is performed after rendering, so that the parallax image quality is improved, the hardware cost is reduced, and the processing time delay is reduced; the nonlinear mapping low-resolution image smoothing processing method effectively suppresses isolated noise points and artifacts, reduces the consumption of network bandwidth, GPU computing power and storage space while not reducing the final image fidelity and resolution, and further compresses hardware cost, so that the method is suitable for an execution environment with computing power resource constraint.

Description

Stereoscopic vision video communication method suitable for computing power resource constraint environment

Technical Field

The invention relates to the technical field of IT and software, in particular to a stereoscopic vision video communication method suitable for a computing power resource constraint environment.

Background

In modern society, video communication has become one of the important ways of people's daily business, work and study, and along with the rapid development of cloud computing, 6G and AI technology, traditional processing method and 2D picture presentation mode based on simple push-pull flow have the disadvantage that communication efficiency is low, immersion sense is insufficient. The stereoscopic vision video communication method presents the three-dimensional scene or object in a high-quality mode, so that a user can obtain more realistic and immersive experience, the participation sense of the user is improved, and a foundation is laid for the formation of the immersive video communication industry. However, the three-dimensional stereoscopic vision presenting process with high image quality involves a plurality of steps, especially high-precision data acquisition, transmission, three-dimensional model reconstruction and rendering, has extremely high requirements on computing power resources such as precision of acquisition equipment, network bandwidth, GPU, storage space and the like, and has high cost computing power threshold which is difficult to span in the execution environment of computing power resource constraint, such as small and medium enterprises or individual users. In the prior art, a stereoscopic vision video communication method suitable for an algorithm resource constraint environment is lacked.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application.

Therefore, in order to solve the technical problems, the invention provides the following technical scheme: a stereoscopic vision video communication method suitable for computing power resource constraint environment, the method uses the low-precision, low-resolution data acquisition equipment to process and transmit data, reduce the bandwidth consumption greatly;

Only two low-resolution parallax images are rendered based on eye tracking by a parallax image rendering method, and image quality enhancement is performed after rendering, so that the parallax image quality is improved, the hardware cost is reduced, and the processing time delay is reduced;

The nonlinear mapping low-resolution image smoothing processing method effectively suppresses isolated noise points, artifacts and the like, reduces the consumption of network bandwidth, GPU computing power and storage space while not reducing the final image fidelity and resolution, and further compresses hardware cost, so that the method is suitable for an execution environment with computing power resource constraint.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: and after the high-definition video stream is transmitted to the cloud, the GPU performs three-dimensional model reconstruction and rendering, and mainly comprises depth fusion, color fusion and parallax image rendering.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: the three-dimensional model reconstruction and rendering method comprises the following specific steps:

S1: performing depth image alignment; because Depth images acquired under different view angles may have distortion and errors, the Depth images need to be aligned by matching adjacent frame Depth images and using a characteristic point matching mode, so that Depth information from different view angles corresponds to the same position in a three-dimensional space;

S2: combining depth information from different visual angles by a weighted average-based method to obtain a more accurate and complete three-dimensional model;

S3: combining RGB color information from different visual angles to obtain more real and natural colors, generating a texture map by the combined color information, and mapping the texture map onto a three-dimensional model, so that the mapping of the colors is realized;

S4: the three-dimensional model and texture map are rendered into a two-dimensional image for presentation on a screen.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: in step S3, in order to ensure that the finally generated three-dimensional model has high quality and high precision, consistency processing of color and brightness is performed at the same time, so that the situation of discontinuous color and brightness is avoided.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: in the process of combining depth information from different visual angles by a weighted average-based method, factors such as internal parameters and external parameters of a camera, distortion correction, precision of a depth image, possible noise and abnormal points and the like need to be considered, data cleaning and filtering are performed, and the generated three-dimensional point cloud is guaranteed to have high quality and high precision.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: a low-precision RGB camera with the resolution ratio of 1280 x 720 is adopted, the frame rate is 30fps, the color depth is 24 bits, the chroma sampling mode is YUV420, and the total number is 4; meanwhile, a low-precision Depth camera with the resolution ratio of 640 x 480 is adopted, the frame rate is 30fps, the color Depth is 16 bits, the chroma sampling mode is YUV420, and the total number of the chroma sampling modes is 3; both cameras use H.265 coding, and the data compression rate is 100; the required network bandwidth is calculated using the following formula: the network bandwidth required for video transmission is video stream number video resolution frame rate color depth chroma sampling rate coding efficiency.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: based on the above formula, the network bandwidth required by RGB video transmission can be calculated to be 4×1600×1200×60×24×0.5×0.01= 52.734375Mbps respectively;

the network bandwidth required for the Depth video transmission is 3 x 1280 x 1024 x 60 x 30 x 0.5 x 0.01=33.75 Mbps;

thus, the network bandwidth required for the entire video transmission is 52.734375 mbps+33.75mbps= 86.484375Mbps;

The network bandwidth required by RGB video transmission is calculated respectively according to a calculation formula: 4 x 1280 x 720 x 30 x 24 x 0.5 x 0.01 = 12.65625Mbps;

the network bandwidth required by the Depth video transmission is 3 x 640 x 480 x 30 x 16 x 0.5 x 0.01= 2.109375Mbps;

Therefore, the network bandwidth required for the whole video transmission is 12.655Mbps+2.109375Mbps= 14.765625Mbps, which is reduced compared with the prior art

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: the image is divided into a number of small areas and the best match is found in each area, depth information is obtained by calculating the offset of each pixel between the two viewing angles, the original image is shifted and rendered according to the depth values, and 2D images for the left and right eye are created respectively, while the two parallax images are still low resolution.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: the depth value of each pixel can be calculated according to the parallax and the camera parameters, specifically, firstly, the internal and external parameters and the distortion parameters of the camera are obtained by using a calibration plate, so that the position and the orientation of the camera are determined; then, three reference points with known positions in the scene are selected, and the corresponding positions of the points are found in the two parallax images, wherein the points should form a triangle; finally, based on the principle of similar triangle, the distance and angle between any two points and the depth value of the pixel point are calculated according to the known camera parameters, the position of the reference point and the angle in the image.

As a preferable scheme of the stereoscopic vision video communication method applicable to the computing power resource constraint environment, the invention comprises the following steps: further processing the two low-resolution parallax images obtained in the previous step, and enhancing the image quality of the parallax images; firstly, detecting edge information of a low-resolution image by using an edge detection operator, designing a transverse convolution kernel and a longitudinal convolution kernel, respectively detecting transverse edges and longitudinal edges, and marking main edge shapes, directions, connectivity and other characteristics in the low-resolution image by setting a threshold value; then, extracting a multi-layer feature map of an image by using a multi-layer neural network, wherein the multi-layer feature map comprises feature information with different abstraction levels, the first layers are used for extracting low-layer edges and texture features, the middle layer is used for obtaining middle semantic features, the high layer is used for obtaining global content features, and meanwhile, a difference mechanism is used for obtaining residual map of the image and a high-resolution image; then, performing cross-level nonlinear mapping on the low-resolution image, further expanding the resolution of the low-resolution image, inputting the image subjected to up-sampling processing into a multi-layer neural network model, and generating a high-resolution image with rich details; finally, the average value of pixels in the neighborhood is used for replacing the central pixel value, so that isolated noise points, artifacts and the like are further effectively restrained

The invention has the beneficial effects that:

The invention provides a stereoscopic vision video communication method suitable for a computing power resource constraint environment, which solves the problem that the traditional three-dimensional stereoscopic vision video communication method is seriously dependent on a high-precision sensor array in order to realize high image quality, reduces the consumption of network bandwidth, GPU computing power and storage space while not reducing the final presentation image fidelity and resolution, further compresses hardware cost, reduces processing time delay and is suitable for the execution environment of computing power resource constraint; the nonlinear mapping low-resolution image smoothing processing method provided by the scheme can effectively inhibit isolated noise points, artifacts and the like, and reduces resource consumption while not reducing the fidelity and resolution of the final presented image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

Fig. 1 is a flowchart of three-dimensional model reconstruction and rendering performed by a GPU after a high-definition video stream in the prior art is transmitted to a cloud.

Fig. 2 is a flowchart of three-dimensional model reconstruction and rendering performed by a GPU after a high-definition video stream is transmitted to a cloud.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Referring to fig. 1-2, for the embodiment of the present invention, a stereoscopic video communication method suitable for an algorithm resource constraint environment is provided, where a low-precision and low-resolution data acquisition device is used for data processing and transmission, so that bandwidth consumption is greatly reduced;

The three-dimensional stereoscopic vision presentation with high image quality is an important technical innovation way in the field of video communication, and can promote the immersion of users, realize higher artistic expression, improve interactive experience and promote industrial development. To achieve this goal, it is generally necessary to perform depth data acquisition using a high-precision RGBD camera, scan a target object or a scene-emitting laser, and then receive a reflected laser signal, and determine the shape and structure of the object surface according to the intensity and time of the reflected signal.

In practical applications, factors affecting the final scan quality include video resolution, i.e. the size and sharpness of the video image, usually expressed in terms of the number of pixels, the higher the video resolution, the better the image quality; the higher the frame rate, i.e., the number of frames contained in the video per second, the smoother the video picture; the color depth, that is, the number of bits of color information of each pixel in the video, is generally expressed by the number of bits, and the higher the color depth is, the finer the color expression of the video is; chroma sampling rate, i.e., the proportional relationship of sampling luminance information and chroma information for each pixel in digital video coding, respectively, including 4:4:4, 4:2:2, 4:2:0, etc.; video coding, i.e., the manner in which a video signal is converted to a digital signal and compressed, includes h.264, h.265, VP9, etc.

Taking data processing in google starline as an example, the resolution of an RGB camera is 1600 x 1200, the frame rate is 60fps, the color depth is 24 bits, the chroma sampling mode is YUV420, and the total number of the chroma sampling modes is 4; the resolution of the Depth camera is 1280 x 1024, the frame rate is 60fps, the color Depth is 30 bits, the chroma sampling mode is YUV420, and the total number is 3; both cameras use h.265 coding, and the data compression rate is 100, the required network bandwidth can be calculated using the following formula: the network bandwidth required by video transmission is video flow number, video resolution, frame rate, color depth, chroma sampling rate, and coding efficiency; based on the above formula, it can be calculated that the network bandwidth required for RGB video transmission is 4×1600×1200×24×0.5×0.01=52.734375 Mbps, and the network bandwidth required for depth video transmission is 3×1280×1024×60×30×0.5×0.01=33.75 Mbps, respectively; thus, the network bandwidth required for the entire video transmission is 52.734375mbps+33.75 mbps= 86.484375Mbps.

As described above, after the high-definition video stream is transmitted to the cloud, the GPU performs three-dimensional model reconstruction and rendering, which mainly includes depth fusion, color fusion and parallax image rendering. Specifically, firstly, aligning Depth images, wherein distortion and errors may exist in Depth images acquired under different view angles, and the Depth images need to be aligned by matching the Depth images of adjacent frames in a characteristic point matching mode, so that Depth information from different view angles corresponds to the same position in a three-dimensional space; then, combining depth information from different visual angles by a weighted average-based method to obtain a more accurate and complete three-dimensional model, wherein in the process, factors such as internal parameters and external parameters of a camera, distortion correction, precision of a depth image, noise and abnormal points which possibly exist and the like are considered to perform data cleaning and filtering, so that the generated three-dimensional point cloud is guaranteed to have high quality and high precision; combining RGB color information from different visual angles to obtain more real and natural colors, generating a texture mapping from the combined color information and mapping the texture mapping to a three-dimensional model, so that the mapping of the colors is realized, and consistency processing of colors and brightness is required to be carried out to ensure that the finally generated three-dimensional model has high quality and high precision, and the situation of discontinuous colors and brightness is avoided; finally, the three-dimensional model and texture map are rendered into a two-dimensional image for presentation on a screen.

As shown in fig. 1, the high-precision data acquisition and image rendering process is a main factor of consuming computational resources, and from the first principle, analysis shows that the invention only needs to ensure that the quality of two parallax images seen by an end user is high enough, and does not need to care whether the initially acquired or stored model is full and high-definition.

Based on the method, as shown in fig. 2, the invention adopts a low-resolution data acquisition and parallax image rendering method, and increases an image quality enhancement method after rendering, so as to improve the parallax image quality, reduce the consumption of network bandwidth, GPU computing power and storage space while not reducing the final presentation image fidelity and resolution, and further compress the hardware cost, thereby being suitable for the execution environment of computing power resource constraint.

Specifically, the invention adopts a low-precision RGB camera with the resolution ratio of 1280 x 720, the frame rate is 30fps, the color depth is 24 bits, the chroma sampling mode is YUV420, and the total number of the RGB cameras is 4; meanwhile, a low-precision Depth camera with the resolution ratio of 640 x 480 is adopted, the frame rate is 30fps, the color Depth is 16 bits, the chroma sampling mode is YUV420, and the total number of the chroma sampling modes is 3; both cameras use h.265 coding with a data compression rate of 100. Then, according to the above formula, the network bandwidth required for RGB video transmission is 4×1280×720×30×24×0.5×0.01=12.65625mbps, and the network bandwidth required for depth video transmission is 3×640×480×30×16×0.5×0.01= 2.109375Mbps;

In the aspect of image rendering, the invention does not need to render high-quality full-volume images, and only needs to render two parallax images with common resolution. The parallax image represents two different views of the same scene, the invention divides the image into a plurality of small areas, and searches for the best match in each area, and depth information is obtained by calculating the offset of each pixel between the two views. The depth value of each pixel can be calculated according to the parallax and the camera parameters, and specifically, firstly, the method and the device acquire the internal and external parameters and the distortion parameters of the camera by using the calibration plate so as to determine the position and the orientation of the camera; then, three reference points with known positions in the scene are selected, and the corresponding positions of the points are found in the two parallax images, wherein the points should form a triangle; finally, based on the principle of similar triangle, the distance and angle between any two points and the depth value of the pixel point are calculated according to the known camera parameters, the position of the reference point and the angle in the image. After obtaining these pixel depth values, the invention shifts and renders the original image according to the depth values, creating 2D images for the left and right eye, respectively, where the two parallax images are still low resolution.

As described above, although the user does not care whether the model initially acquired or stored is full and high-definition, the quality of the two parallax images finally seen is guaranteed to be sufficiently high, and therefore, the two low-resolution parallax images obtained in the previous step need to be further processed to enhance the image quality thereof. Firstly, edge information of a low-resolution image is detected by using an edge detection operator, two convolution kernels, namely a transverse convolution kernel and a longitudinal convolution kernel, are designed, the transverse edge and the longitudinal edge are respectively detected, and the characteristics of main edge shape, direction, connectivity and the like in the low-resolution image are marked by setting a threshold value; then, extracting a multi-layer feature map of an image by using a multi-layer neural network, wherein the multi-layer feature map comprises feature information with different abstraction levels, the first layers are used for extracting low-layer edges and texture features, the middle layer is used for obtaining middle semantic features, the high layer is used for obtaining global content features, and meanwhile, a difference mechanism is used for obtaining residual map of the image and a high-resolution image; then, performing cross-level nonlinear mapping on the low-resolution image, further expanding the resolution of the low-resolution image, inputting the image subjected to up-sampling processing into a multi-layer neural network model, and generating a high-resolution image with rich details; and finally, replacing the central pixel value by using the average value of the pixels in the neighborhood, and further effectively inhibiting isolated noise points, artifacts and the like.

When in practical application:

The method can be applied to instant video communication, creates a sense of near to the touch for people beyond the thousand miles, improves experience sense, and lays a foundation for the formation of the immersive video communication industry;

The method can be applied to medical image reconstruction, and improves the resolution and quality of medical images, thereby helping doctors to diagnose and treat diseases more accurately; specific: for example, the method provided by the invention can be used for reconstructing medical images such as CT, MRI and the like, so as to help doctors to better observe pathological change positions;

The method can be applied to intelligent monitoring, improves the resolution and quality of a monitoring video, ensures that a monitoring picture is clearer, and can help security personnel to better observe the monitoring picture and find abnormal conditions;

The method can be applied to intelligent traffic to realize intelligent traffic management, for example, the method provided by the invention can be used for simulating traffic scenes, predicting traffic flow, optimizing traffic routes and the like;

the method can be applied to aerospace, improves the resolution and quality of satellite images, and helps scientists to better observe the changes of the earth surface and the climate change.

Meanwhile, the method provided by the invention can be used for carrying out three-dimensional modeling on the aircraft, and helps a designer to better design, test and optimize the aircraft.

The scheme has good practicability and can create considerable value for society and enterprises.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A stereoscopic vision video communication method suitable for an algorithm resource constraint environment is characterized in that:

The method uses low-precision and low-resolution data acquisition equipment to process and transmit data, thereby greatly reducing bandwidth consumption;

The nonlinear mapping low-resolution image smoothing processing method effectively suppresses isolated noise points and artifacts, reduces the consumption of network bandwidth, GPU computing power and storage space while not reducing the final image fidelity and resolution, and further compresses hardware cost, so that the method is suitable for an execution environment with computing power resource constraint.

2. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: and after the high-definition video stream is transmitted to the cloud, the GPU performs three-dimensional model reconstruction and rendering, and mainly comprises depth fusion, color fusion and parallax image rendering.

3. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: the three-dimensional model reconstruction and rendering method comprises the following specific steps:

4. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: in step S3, in order to ensure that the finally generated three-dimensional model has high quality and high precision, consistency processing of color and brightness is performed at the same time, so that the situation of discontinuous color and brightness is avoided.

5. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: in the process of combining depth information from different visual angles by a weighted average-based method, data cleaning and filtering are performed by considering internal and external parameters of a camera, distortion correction, accuracy of a depth image and possibly noise and outlier factors, so that the generated three-dimensional point cloud is guaranteed to have high quality and high accuracy.

6. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: a low-precision RGB camera with the resolution ratio of 1280 x 720 is adopted, the frame rate is 30fps, the color depth is 24 bits, the chroma sampling mode is YUV420, and the total number is 4; meanwhile, a low-precision Depth camera with the resolution ratio of 640 x 480 is adopted, the frame rate is 30fps, the color Depth is 16 bits, the chroma sampling mode is YUV420, and the total number of the chroma sampling modes is 3; both cameras use H.265 coding, and the data compression rate is 100; the required network bandwidth is calculated using the following formula: the network bandwidth required for video transmission is video stream number video resolution frame rate color depth chroma sampling rate coding efficiency.

7. The stereoscopic video communication method applicable to a computing power resource constrained environment according to claim 1, wherein: based on the above formula, the network bandwidth required by RGB video transmission can be calculated to be 4×1600×1200×60×24×0.5×0.01= 52.734375Mbps respectively;

8. The stereoscopic video communication method applicable to the computing power resource constrained environment according to claim 7, wherein: the image is divided into a number of small areas and the best match is found in each area, depth information is obtained by calculating the offset of each pixel between the two viewing angles, the original image is shifted and rendered according to the depth values, and 2D images for the left and right eye are created respectively, while the two parallax images are still low resolution.

9. The stereoscopic video communication method applicable to the computing power resource constraint environment according to claim 8, wherein: the depth value of each pixel can be calculated according to the parallax and the camera parameters, specifically, firstly, the internal and external parameters and the distortion parameters of the camera are obtained by using a calibration plate, so that the position and the orientation of the camera are determined; then, three reference points with known positions in the scene are selected, and the corresponding positions of the points are found in the two parallax images, wherein the points should form a triangle; finally, based on the principle of similar triangle, the distance and angle between any two points and the depth value of the pixel point are calculated according to the known camera parameters, the position of the reference point and the angle in the image.

10. The stereoscopic video communication method applicable to the computing power resource constrained environment according to claim 9, wherein: further processing the two low-resolution parallax images obtained in the previous step, and enhancing the image quality of the parallax images; firstly, detecting edge information of a low-resolution image by using an edge detection operator, designing a transverse convolution kernel and a longitudinal convolution kernel, respectively detecting transverse edges and longitudinal edges, and marking main edge shapes, directions and connectivity characteristics in the low-resolution image by setting a threshold value; then, extracting a multi-layer feature map of an image by using a multi-layer neural network, wherein the multi-layer feature map comprises feature information with different abstraction levels, the first layers are used for extracting low-layer edges and texture features, the middle layer is used for obtaining middle semantic features, the high layer is used for obtaining global content features, and meanwhile, a difference mechanism is used for obtaining residual map of the image and a high-resolution image; then, performing cross-level nonlinear mapping on the low-resolution image, further expanding the resolution of the low-resolution image, inputting the image subjected to up-sampling processing into a multi-layer neural network model, and generating a high-resolution image with rich details; and finally, replacing the central pixel value by using the average value of the pixels in the neighborhood, and further effectively inhibiting the isolated noise point and the artifact.