CN114897681A - Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation - Google Patents

Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation Download PDF

Info

Publication number
CN114897681A
CN114897681A CN202210419565.8A CN202210419565A CN114897681A CN 114897681 A CN114897681 A CN 114897681A CN 202210419565 A CN202210419565 A CN 202210419565A CN 114897681 A CN114897681 A CN 114897681A
Authority
CN
China
Prior art keywords
view
color texture
texture image
virtual
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210419565.8A
Other languages
Chinese (zh)
Inventor
宋利
胡经川
解蓉
张文军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210419565.8A priority Critical patent/CN114897681A/en
Publication of CN114897681A publication Critical patent/CN114897681A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4023Decimation- or insertion-based scaling, e.g. pixel or line decimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4046Scaling the whole image or part thereof using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234309Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4 or from Quicktime to Realvideo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440218Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/643Communication protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/64Circuits for processing colour signals
    • H04N9/67Circuits for processing colour signals for matrixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing

Abstract

The invention provides a real-time virtual visual angle interpolation method, which comprises the following steps: acquiring a bidirectional optical flow from the first color texture image and the second color texture image to a virtual third color texture image at a visual angle position needing interpolation; acquiring a visibility mask matrix of the first color texture image and the second color texture image in the virtual third color texture image; based on the bi-directional optical flow, the first color texture image and the second color texture image are respectively warped to a virtual third color texture image position; based on the visibility mask, obtaining a primary virtual third color texture image at the position of the virtual third color texture image and optimizing the primary virtual third color texture image to obtain a final interpolated virtual third color texture image; and (4) performing iteration and exponential interpolation to obtain any number of virtual views. The invention is light and efficient, can interpolate a high-quality virtual intermediate view in real time by using few computing resources, can be conveniently deployed at an edge server end or a client end, and is very friendly to a free view video system.

Description

Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation
Technical Field
The invention relates to the field of immersive media, namely free view video systems, in particular to a multi-user free view video method and system based on real-time virtual view interpolation.
Background
Free-view technology, as a representative of interactive immersive media content, allows viewers to view a visual scene by interactively selecting an arbitrary direction and viewpoint according to their own needs, without being limited by the position of a camera used for photographing. Compared with the VR technology from inside to outside, the interaction form of the free visual angle technology from outside to inside has stronger three-dimensional experience for users and more direct interaction. The free visual angle video is a novel immersive media form with strong interaction characteristics, the attraction of the free visual angle video exceeds typical application scenes such as cloud games and remote virtual reality, and the mode of consuming visual contents is expected to be thoroughly changed.
Generally, a free view system includes four parts, namely, a multi-view acquisition system, a free view content production, and encoding transmission and a client. The multi-view acquisition system aims to provide multi-angle and multi-azimuth video source information for the free view system, however, due to the limitation of hardware cost and data volume, the acquisition system can only be limited to sparse limited number of cameras for shooting, and the virtual view synthesis technology aims to acquire other non-acquired viewpoint information from limited view information. The DIBR (Depth Image-Based Rendering) method is the most commonly used view synthesis method in free-view systems. However, due to the occlusion and black hole introduced in the three-dimensional image distortion, the synthetic result is often unsatisfactory, and in addition, the acquisition of an accurate depth map faces a great challenge.
Free-view systems can be generally classified into two models, central and distributed.
In the central model, the viewpoints required by different users are synthesized at the server side. Some existing real-time view synthesis methods require enough computing resources, so that one server can only serve a limited number of user terminals. As the number of access users increases, the number of servers also needs to increase accordingly. The model is difficult to handle high concurrency scenarios and causes additional response delay during interaction.
The distributed model is able to serve multiple users simultaneously because it performs the view composition process at the client. However, Multiview-VideoPlus-depth (mvd) required for view synthesis indicates a need for transmission to the user, which may result in high transmission bandwidth. Furthermore, the view synthesis method requires a lot of processing power, which is not friendly for some low end user terminals.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a multi-user free view video method and system based on real-time virtual view interpolation.
According to an aspect of the present invention, there is provided a real-time virtual perspective interpolation method, comprising:
s1 obtains a bidirectional optical flow F between the first color texture image and the second color texture image to the virtual third color texture image at a view angle position that requires interpolation in space 1 、F 2 And a visibility mask matrix describing how visible the first and second color texture images are in the virtual third color texture image;
s2 warping the first color texture image and the second color texture image to the virtual third color texture image position, respectively, based on the bi-directional optical flow, and obtaining a primary virtual third color texture image at the virtual third color texture image position based on the visibility mask;
s3, optimizing the primary virtual third color texture image to obtain a final interpolated virtual third color texture image;
s4 iterates S1-S3 repeatedly, interpolating exponentially any dense virtual views.
Preferably, the acquiring a bidirectional optical flow between the first color texture image and the second color texture image to the virtual third color texture image at a viewing angle position which needs to be interpolated spatially includes:
the first color texture image and the second color texture image are used as VINET network input, the size of the first color texture image is down-sampled to one fourth of the size of an original image, a first-level VIBlock calculates initial optical flow and mask with low resolution, and the optical flow and the mask form tensors of different channels output by the VIBlock;
sampling the initial optical flow and the mask to be half of the size of an original image, and estimating the residual error of the optical flow and the mask under the resolution by a second-stage VIBlock;
sampling the optical flow and the mask after the second-level residual refinement to the size of an original image, and calculating the bidirectional optical flow F by a third-level VIBlock 1 、F 2 And the residue of the mask.
Preferably, said warping said first color texture image and said second color texture image to said virtual third color texture image position, respectively, based on said bi-directional optical flow, comprises:
based on the optical flow F 1 Warping the first color texture image pixel to the position of the virtual third color texture image pixel to obtain a warped color texture image I l→m The distortion formula is:
Figure BDA0003606338490000021
based on the optical flow F 2 Warping the first color texture image pixel to the position of the virtual third color texture image pixel to obtain a warped color texture image I r→m The distortion formula is:
Figure BDA0003606338490000031
preferably, the obtaining a virtual third color texture image at a virtual third color texture image position based on the visibility mask includes:
two distorted color texture images I l→m 、I r→m Carrying out weighted summation by taking the visibility mask M as weight to calculate the pixel value of each pixel point to obtain the virtual third color texture image I mid The formula is as follows:
I mid =M⊙I l→m +(1-M)⊙I r→m
preferably, the optimizing the virtual third color texture image to obtain a final interpolated intermediate virtual view includes:
extracting high-order context information of the first color texture image and the second color texture image through a convolutional neural network;
and inputting the high-order context information serving as reference information into a sub-convolution neural network to obtain a residual error of a final virtual third color texture image, reducing the sensitivity of the whole algorithm to the estimated optical flow, adding the primarily obtained virtual third color texture image and the residual error, and refining the quality of a final interpolated intermediate virtual view.
Preferably, iterations S1-S6 are repeated, interpolating exponentially any dense virtual views, including:
interpolating the first color texture image and the second color texture image to obtain a virtual third color texture image;
interpolating the first color texture image and the virtual third color texture image to obtain a virtual fourth color texture image;
interpolating the second color texture image and the virtual third color texture image to obtain a virtual fifth color texture image;
repeating iteration, and interpolating by n stages of iteration exponential to obtain 2 n 1 virtual view.
According to a second aspect of the present invention, there is provided a multi-user-oriented view interpolation-based free-view video method, comprising:
acquiring multi-frame color images collected by a plurality of image collectors facing the same calibration object;
combining the multi-frame color images pairwise;
combining every two, and carrying out real-time interpolation by adopting the method of any one of the requirements 1-7 to obtain a virtual view;
performing self-adaptive splicing on the multi-frame color image and the virtual view to form a multi-view cluster on a space domain;
coding the multi-view cluster, dividing the multi-view cluster into video segments in a time domain, and transmitting the video segments through an HLS protocol;
and different clients download and switch viewpoints by interactively selecting the view angle clusters needing to be watched.
Preferably, the adaptively splicing the multi-frame color image and the virtual view to form a multi-view cluster in a spatial domain includes:
a plurality of virtual views which are interpolated exist around each image collector;
splicing the plurality of virtual views and the color texture image collected by the image collector into a color texture image with an integral size, wherein the color texture image collected by the image collector has high resolution, and the color texture image at the virtual view angle has low resolution;
each view has a multi-view cluster in a tile form, the multi-view clusters are uniformly distributed in a space domain, and an overlapping area exists between each multi-view cluster so as to ensure the continuity of switching.
For each image collector, a virtual dense visual angle is interpolated at the adjacent position of the image collector, and the color texture image collected by the image collector and a plurality of adjacent virtual visual angle color texture images with low resolution are spliced into a large-size color texture image in a high-resolution mode, so that each visual angle exists in a tile mode to form a multi-visual angle cluster; a plurality of interpolated dense virtual visual angles are arranged around each image collector, so that each image collector has a corresponding multi-visual angle cluster, the multi-visual angle clusters are uniformly distributed on a space domain, and an overlapping area exists between each multi-visual angle cluster so as to ensure the continuity of switching.
Preferably, encoding the multi-view cluster for temporal segmentation into video segments for transmission via the HLS protocol includes:
for each multi-view cluster, performing time domain slicing in time sequence after coding;
each slice is divided into a fixed time size to form a plurality of time domain-space domain joint distributed video clips;
the video segments are transmitted by using a transmission protocol, and the segment-based transmission protocol is applicable to both DASH and HLS.
Preferably, the downloading and viewpoint switching of the view clusters to be watched by the different clients through interactive selection of the view clusters to be watched includes:
each user interactively selects a view angle to be watched, a corresponding view angle cluster fragment is downloaded according to a global view angle index (the view angle index comprises position information of different view angles in a multi-view angle cluster, and a corresponding view angle tile can be found through the index), and the users switch a plurality of view angles of the view angle cluster within one video slice time;
after one video slicing time, the user selects other view clusters to download.
According to a third aspect of the present invention, there is provided a multi-user-oriented view interpolation-based free-view video system, comprising:
the system comprises an acquisition module, a calibration module and a control module, wherein the acquisition module is used for acquiring multi-frame color texture images acquired by a plurality of image acquisition devices facing the same calibration object, and the frame images acquired by the image acquisition devices are quasi-synchronous;
the cloud processing and content distribution network module is used for performing cloud processing on the multi-frame color texture images collected by the plurality of image collectors and distributing the multi-frame color texture images to the edge server through the content distribution network;
the edge server module interpolates a dense virtual view in real time by utilizing a neural network based on the acquired frame image, performs self-adaptive splicing on the acquired frame image and the interpolated frame image to form multi-view clusters which are uniformly distributed on a space domain, then encodes the multi-view clusters, segments the multi-view clusters into video segments in a time domain, and transmits the video segments through an HLS protocol;
and the client module is used for enabling a user to interactively select the view angle cluster to be watched to download and switch the view point.
Preferably, the edge server module comprises:
the visual angle interpolation unit interpolates a dense virtual view in real time by utilizing a neural network based on the collected frame image;
the adaptive visual angle splicing unit carries out adaptive splicing on the acquired frame image and the interpolated frame image to form a multi-visual angle cluster which is uniformly distributed on a spatial domain;
the encoding unit is used for encoding and compressing the multi-view cluster frame to reduce the data volume;
an HLS transmission unit, which divides the coded multi-view cluster into video segments in time domain and transmits the video segments through an HLS protocol.
Preferably, the user interaction and display unit provides a viewing and interaction interface for a user, the user can generate an interaction signaling through a key or a sliding screen, and the view angle image selected through the signaling is rendered on the display;
the video slice downloading unit can select the multi-view cluster video slices containing the corresponding view angles provided by the edge server to download according to the interactive signaling;
the video slice decoding unit decodes the downloaded video slices into original YUV format files;
a view extraction unit that extracts a view image tile to be viewed from the multi-view cluster.
Compared with the prior art, the invention has the following beneficial effects:
the embodiment of the invention provides a real-time virtual visual angle interpolation method, which is light and efficient, can interpolate a high-quality virtual intermediate visual angle view in real time by utilizing few computing resources, can be conveniently deployed at an edge server end or a client end, and is very friendly to a free visual angle video system.
The embodiment of the invention provides a free visual angle video method facing multiple users, which applies the real-time virtual visual angle interpolation method to free visual angle video application, and a system built based on the method decouples the number of accessed users and the load of an edge server, so that a single edge server can provide personalized free visual angle video service for multiple users.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic flow chart of a real-time virtual view interpolation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a real-time virtual view interpolation method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an algorithm structure of the reflonenet according to the embodiment of the present invention;
FIG. 4 is a schematic flow chart of a multi-user-oriented view interpolation-based free-view video method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an arrangement of cameras according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a view-angle cluster organization scheme according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a time-space video slicing method according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a client interaction logic of the free-view video system according to an embodiment of the present invention;
FIG. 9 is a block diagram illustrating the components and architecture of a free-view video system according to an embodiment of the present invention;
fig. 10 is a schematic diagram illustrating an output effect of the real-time virtual perspective interpolation method according to the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the invention.
Based on the defects of the prior art, a lightweight and easily-deployed viewpoint synthesis method capable of synthesizing a virtual viewpoint view in real time is required by a free view system, and in addition, the real-time free view system capable of providing services for a plurality of users simultaneously has low deployment cost, which is also a significant challenge to be solved at present.
Based on the above concept, the present invention provides a real-time virtual view interpolation method in an embodiment, where fig. 1 is a schematic flow diagram of the method, and fig. 2 is an architecture diagram of the method. A real-time virtual perspective interpolation method comprises the following steps;
s100: for the spatially adjacent left and right views, pixel offset to the intermediate virtual view angle, namely an optical flow and a visibility mask matrix, are simultaneously calculated by using VINet;
s200: warping the left and right perspective views to virtual intermediate perspective view positions, respectively, based on the bi-directional optical flow;
s300: taking the visibility mask as a weight to carry out weighted summation to calculate the pixel value of each pixel point to obtain a rough virtual intermediate view;
s400: further refining the sub-convolution neural network RefineNet based on the context information to obtain a final interpolated intermediate virtual view;
s500: the iterations use S100-S400 to interpolate any dense virtual views exponentially.
The present invention provides a preferred embodiment to perform S100. Two image collectors which are adjacent in space start a camera, a video camera and the like, a base line between the two image collectors is not too large, and the two image collectors are respectively called a first image collector and a second image collector. The first color texture image collected by the first image collector is a left view, the second color texture image collected by the second image collector is a right view, and the first color texture image are synchronously collected. In this embodiment, the two image collectors are industrial cameras, the baseline distance is 40cm, and the left view angle view and the right view angle view are both in RGB format.
For two spatially adjacent image collectors, a designed convolutional neural network-based algorithm VINet is utilized to simultaneously calculate bidirectional pixel-by-pixel offset between a first color texture image collected by a first image collector and a second color texture image collected by a second image collector to a virtual third color texture image at a viewing angle position needing interpolation in space, and two-dimensional vector matrix diagrams, namely optical flows F are used 1 、F 2 To represent, including:
s101, down-sampling the left and right view angles to the size of one fourth of the original image, and calculating a coarse initial optical flow and a mask with low resolution by a first-level VIBlock;
s102, the result of the first level (the rough initial optical flow and the mask) is up-sampled to the size of one half of the original image, and the residual error of the optical flow and the mask (the size of one half of the original image) under the resolution is estimated by the second level VIBlock;
s103, further up-sampling the optical flow and the mask refined by the second-level residual error to the size of the original image, and further calculating the residual error of the optical flow and the mask by a third-level VIBlock to obtain F 1 、F 2 And a visibility mask matrix M for describing how visible the first color texture image and the second color texture image are in the virtual third color texture image. Thus, an efficient process from rough to fine is completed.
Wherein, the visibility mask matrix is output together with the optical flow, the optical flow is a tensor of two channels, the visible line mask matrix is a tensor of a single channel, so that the two optical flows and the mask matrix form a tensor (2+2+1) of 5 channels, which is output together by the network.
The whole process of S101-S104 in this embodiment is driven by data, and the virtual third color texture image position defaults to the middle position of two image collectors, that is, the final interpolation result is the middle view angle of two cameras. On the one hand, because the training data set is easy to obtain, and on the other hand, the efficiency of the method can be ensured (the real-time performance of the network is ensured without calculating parameters related to the visual angle position.
The present invention provides an embodiment executing S200, respectively distorting the first color texture image and the second color texture image to the virtual third color texture image position, that is, respectively distorting the left view angle view and the right view angle view to the middle view angle position based on the bidirectional optical flow, and respectively obtaining the picture I distorted from the left view angle to the middle view angle l→m And picture I warped from right view to intermediate view r→m
Optical flow can be used to describe the pixel shift of the left and right view images to the intermediate virtual perspective image, in a two-dimensional vector representation, representing the amount of shift in the image matrix in the horizontal and vertical directions, respectively. The warping process is a process of adding the left and right view and the pixel offset, specifically, there is a pixel offset to the right from the left view to the middle virtual view, and the warping process is equivalent to shifting the left view to the right of the middle virtual view by an offset, thereby describing the view viewed by the middle virtual view. This matrix describing the offset or optical flow is estimated by the convolutional neural network method of the present invention,
the formula is as follows:
Figure BDA0003606338490000081
wherein x, y represent the two-dimensional coordinates of the first color texture image pixel, and u, v represent the corresponding two-dimensional coordinates of the third color texture image pixel;
similarly, the intermediate virtual views in the right views all have an overall leftward pixel shift, and the warping process is equivalent to shifting the right view to the left of the intermediate virtual view by a shift, thereby describing the view viewed from the intermediate virtual view.
The formula is as follows:
Figure BDA0003606338490000082
the present invention provides one embodiment to perform S300. Two distorted pictures I l→m 、I r→m The mapping of the left and right view pictures at the middle view position can represent the middle view to a certain extent. Since there may be some occluded or invisible areas, such as the leftmost area of the left view and the rightmost area of the right view, which may not be visible in the middle view, the information of the two warped views needs to be combined.
The visibility mask matrix M describes how the information of the two warped views should be fused, weighted and summed with the visibility mask as a weight. For example, the leftmost region of the left view should not be visible in the middle view, and after the warping, the leftmost region of the left warped view should be invalid information, where M is 0, and the whole visibility mask matrix M is learned through the convolutional neural network (the visibility mask matrix M is obtained in the embodiment S100). Specifically, the formula for describing the fusion process is as follows:
I mid =M⊙I l→m +(1-M)⊙I r→m
wherein an operation is a multiplication by elements, I mid Is a virtual intermediate perspective image.
Intermediate view angle view I obtained through fusion mid In order to solve the problem of artifacts occurring in the interpolated intermediate virtual view due to inaccuracies in the optical flow estimation and the visibility mask matrix estimation, the present invention provides an embodiment to perform S400, and further refine the final interpolated intermediate virtual view by a sub-convolution neural network based on context information, RefineNet.
In this embodiment, rich context information, i.e. high-order feature information, in the original left and right views is proved to have an effect on repairing such artifacts, so that the residual error of the interpolated intermediate virtual view is calculated by a sub-convolution neural network RefineNet based on the context information, thereby further refining the quality of the interpolation result.
Fig. 3 describes a method structure of the RefineNet in this embodiment, specifically, the method structure includes:
s401, firstly, a context extraction unit extracts high-order context feature information under different scales in left and right views;
s402, extracting high-order features under different scales from the rough virtual intermediate view, and fusing the high-order features with context information under the same scale. Furthermore, the whole RefineNet is in a pyramid structure, the middle layer is connected by jump connection, the residual error of the finally refined middle virtual view is obtained in the form of residual error estimation, and the rough virtual middle view is finally added with the residual error.
In one embodiment of the present invention, performing S500, iteratively using S100-S400, can interpolate any dense virtual views exponentially.
In this embodiment, an intermediate view may be interpolated based on the left and right views, and the interpolation is performed iteratively, and a virtual view at position 1/4 may be interpolated based on the left view and the interpolated intermediate view, and a virtual view at position 3/4 may be interpolated based on the interpolated intermediate view and the interpolated right view.
So iterating, n stages of iteration exponential interpolation result in 2 n 1 virtual view, in an embodiment of the invention, three levels are iterated altogether, and 7 virtual views are interpolated between the two views.
Fig. 2 depicts a complete architecture diagram of the real-time virtual perspective interpolation method of an embodiment of the present invention, specifically, where three viblocks have the same structure but different input channels. The whole algorithm is light and efficient, can run in real time, and only needs 12ms to interpolate a 720 p-resolution intermediate virtual view angle.
Based on the same inventive concept, the present invention further provides a free view angle video method based on view angle interpolation, the flow of which is shown in fig. 4, and the method comprises:
s10: collecting multi-frame color images collected by a plurality of image collectors facing the same calibration object, wherein the frame images collected by the image collectors are quasi-synchronous;
s20: interpolating a dense virtual view in real time by utilizing a neural network based on the acquired frame image;
s30: the acquired frame images and the interpolated frame images are spliced in a self-adaptive manner to form multi-view clusters which are uniformly distributed on a space domain;
s40: coding the multi-view cluster, dividing the multi-view cluster into video segments in a time domain, and transmitting the video segments through an HLS protocol;
s50: and different clients download and switch viewpoints by interactively selecting the view angle clusters to be watched.
S10 is performed in one embodiment of the present invention, wherein the image collector is an industrial camera supporting shooting a scene at 4K/120 FPS. Specifically, the plurality of image collectors have 12 sets, and are arranged according to a fixed circular arc, the angle of the field of view is about 70 degrees, and the 12 cameras shown in fig. 5 capture a scene therein (each square in fig. 5 represents one camera). Image information acquired by the 12 cameras in a quasi-synchronous manner is respectively transmitted to the cloud for processing.
In another embodiment of the present invention S20 is performed, a dense virtual view is interpolated in real time using a neural network based on the acquired frame images. Videos collected by the 12-path camera are processed by a cloud, including encoding compression and the like, and then are distributed to an edge media server by a Content Distribution Network (CDN). The edge media server is provided with a real-time virtual visual angle interpolation neural network, two adjacent paths of 12 received videos are combined pairwise to form 11 pairs of left and right visual angles, and then the dense virtual visual angles are interpolated by a real-time virtual visual angle interpolation method. Specifically, in this embodiment, three levels of virtual view interpolation networks are iterated in common, 89 views are interpolated in common from 12 views finally, the view distribution of this density is dense enough for the size of the view field of the user, and the user is switched smoothly.
How to organize so many views that can deal with multi-user scenes in a reasonable way is a matter of important consideration, and for this matter, the present invention provides a preferred embodiment to execute S30, and adaptively concatenate the acquired frame images and the interpolated frame images to form multi-view clusters that are uniformly distributed in the spatial domain. The multiple visual angles are adaptively spliced in a large image in different resolutions, specifically, the interpolated virtual visual angle is low resolution, the original real visual angle shot by a camera is high resolution, and the virtual visual angle is spliced in the large image in a tile form to form a multi-visual angle cluster. There are many virtual interpolated views around each camera, so there are 12 multi-view clusters that are spatially uniformly distributed and cover all 89 views at the same time through adaptive stitching. As shown in fig. 6, it should be noted that the number of views and the resolution of one view cluster are adaptive, and the number of interpolated virtual views is adaptively adjusted, and each multi-view cluster has an overlapping area, i.e. there are some same virtual views, so as to ensure the continuity of switching between different multi-view clusters.
To cope with the multi-user scenario, the present invention provides a preferred embodiment that performs S40, encoding and temporally dividing the multi-view cluster into video segments for transmission via the HLS protocol. In this embodiment, a space-time segmentation method is adopted to cope with a multi-user scenario. The stream of multiple multi-view clusters is divided into a series of video slices in both temporal and spatial domains as shown in fig. 7. Specifically, 12 adaptively organized multi-view clusters form segments in the spatial domain, and then each segment is segmented in time order. The present embodiment uses the edge media server as the media resource repository, wherein the server load is independent of the number of clients, and the operation of the clients will be described further. As for content transmission, the HTTP Adaptive Streaming transmission protocol divides video content into video slices having the same duration for transmission, the number of frames of each video slice is an integer multiple of the encoded GOP Size, and the first frame is a full intra reference frame (I-frame), so that each video slice can be independently decoded. The HLS protocol is one of them, well suited for the transmission of spatio-temporal video slices.
In another embodiment of the present invention, S50 is executed, and different clients download and perform viewpoint switching by interactively selecting a view cluster to be viewed. In this embodiment, as shown in fig. 8, when a user requests a corresponding view point in an interactive manner, an operation of searching a corresponding view index from a global lookup table (corresponding to the aforementioned global view index, including location information of different views in a multi-view cluster, and finding a corresponding view tile through the lookup table) is performed. If the required view is in the current multiview cluster, the view extraction unit extracts the corresponding view tile bitstream and decodes it through the video slice decoding unit, if not in the current multiview cluster, the switching is temporarily limited at the boundary, but after a short period of time, the multiview cluster video clip in which the required view is located can be downloaded. The user interaction directly determines the next multi-view cluster video clip to be downloaded, i.e. the multi-view cluster closest to the desired viewing angle. The video segment decoding unit is responsible for decoding the extracted bitstream into YUV format and then playing by an OpenGL based video player.
Based on the same invention, the invention also provides a free visual angle video system facing multiple users and based on visual angle interpolation, which comprises:
the system comprises an acquisition module, a calibration module and a control module, wherein the acquisition module is used for acquiring multi-frame color texture images acquired by a plurality of image acquisition devices facing the same calibration object, and the frame images acquired by the image acquisition devices are quasi-synchronous;
the cloud processing and content distribution network module is used for performing cloud processing on the multi-frame color texture images collected by the plurality of image collectors and distributing the multi-frame color texture images to the edge server through the content distribution network;
the edge server module interpolates a dense virtual view in real time by utilizing a neural network based on the acquired frame image, performs self-adaptive splicing on the acquired frame image and the interpolated frame image to form multi-view clusters which are uniformly distributed on a space domain, then encodes the multi-view clusters, segments the multi-view clusters into video segments in a time domain, and transmits the video segments through an HLS protocol;
and the client module is used for enabling a user to download and perform viewpoint switching by interactively selecting the view angle cluster to be watched.
Fig. 9 is a block diagram and an architecture diagram of the entire system, which mainly includes an acquisition server module, a cloud processing and content distribution network module, an edge server module, and a client module. The function of each module can be referred to the implementation of the above-mentioned multi-user-oriented view interpolation-based free view video method, and is not described herein again. In practical application, when the multi-user-oriented free visual angle video system based on visual angle interpolation is built, a design mode of a pipeline can be adopted, data transmission and thread separation are realized by utilizing a queue data structure with first-in first-out, all units in the system are executed highly in parallel, and therefore the whole delay bottleneck of the system is only limited by a module with the longest consumed time, better real-time performance is obtained, and in addition, the performance of the whole system can be greatly improved through heterogeneous computing, including a CPU (central processing unit), a GPU (graphics processing unit) and the like.
The table is a performance test result table of the real-time virtual visual angle interpolation method of the embodiment of the invention, which comprises the virtual visual angle synthesis quality, the algorithm running speed and comparison with some existing most advanced methods. As shown in table one, the real-time virtual perspective interpolation method according to the embodiment of the present invention performs a test in three different scenarios, namely scenario 1, scenario 2, and scenario 3, where the difference between the three scenarios is that the distances between the cameras are different, that is, the baselines between the cameras are different. The effect is also slightly worse because the larger the left and right camera pitch, the larger the pixel offset to be calculated by the network, which is actually more difficult for the network to handle, to estimate the pixel offset from the left and right perspective views to the virtual intermediate perspective view.
The quality evaluation standard is a commonly used peak signal to noise ratio (PSNR), and the larger the value of the quality evaluation standard is, the less the image distortion is; a Structural Similarity Index (SSIM) for quantifying an index of structural similarity between two images, the greater the value, the higher the similarity; learning Perception Image Patch Similarity (LPIPS), which is more in line with human perception than the previous two, a lower value indicates that the two images are more similar. The method is divided into a version added with RefineNet and a version without RefineNet, and compared with the most advanced methods at present, namely COLMAP + VSS, LLFF and Deep3DMask respectively, and the comparison result is shown in the table I. It is worth mentioning that the method of the present invention also performs well in operation speed, only 6.27ms and 12.85ms are required to synthesize a 720p image, and the method is fully capable of being used in real-time free-view video application scenarios.
Table-table of performance test results of virtual view interpolation method
Figure BDA0003606338490000121
Fig. 10 is an effect diagram of the real-time virtual perspective interpolation method according to the embodiment of the present invention. The final output effect diagrams of the method of the present invention and the comparative method are shown in fig. 10, respectively. As can be seen from fig. 10, the real-time virtual perspective interpolation method of the present embodiment performs well on the virtual perspective synthesis quality.
Table two free visual angle video system performance test result table
Figure BDA0003606338490000122
Figure BDA0003606338490000131
The second table is a performance test result table of the multi-user-oriented view interpolation-based free view video system pipeline of the embodiment of the invention, which tests the delay of each unit of the free view video system and the average delay of 1000 frames, and as shown in the second table, the multi-user-oriented view interpolation-based free view video system of the embodiment of the invention has good real-time performance. It should be noted that the client in this embodiment disables the PC client and the mobile phone client for testing, but the system of the present invention can support the access of any number of clients theoretically.
It should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The above-described preferred features may be used in any combination without conflict with each other.

Claims (12)

1. A method for interpolating a real-time virtual view, comprising:
s1 bidirectional optical flow F between the first color texture image and the second color texture image to the virtual third color texture image at the view angle position needing interpolation in space 1 、F 2 And a visibility mask matrix describing how visible the first and second color texture images are in the virtual third color texture image;
s2 warping the first color texture image and the second color texture image to the virtual third color texture image position, respectively, based on the bi-directional optical flow, and obtaining a primary virtual third color texture image at the virtual third color texture image position based on the visibility mask;
s3, optimizing the primary virtual third color texture image to obtain a final interpolated virtual third color texture image;
s4 iterates S1-S3 repeatedly, interpolating exponentially any dense virtual views.
2. The method of claim 1, wherein the obtaining of the bidirectional optical flow and visibility mask matrix between the first color texture image and the second color texture image to the virtual third color texture image at the view angle position needing interpolation in space comprises:
the first color texture image and the second color texture image are used as VINET network input, the size of the first color texture image is down-sampled to one fourth of the size of an original image, a first-level VIBlock calculates initial optical flow and mask with low resolution, and the optical flow and the mask form tensors of different channels output by the VIBlock;
the initial optical flow and the mask are up-sampled to the size of one half of the original image, and the residual error of the optical flow and the mask under the resolution is estimated by a second-level VIBlock;
the optical flow and the mask after the second-level residual refinement are up-sampled to the size of an original image, and the bidirectional optical flow F is calculated by a third-level VIBlock 1 、F 2 And the residue of the mask.
3. The method of claim 1, wherein the warping the first color texture image and the second color texture image to the virtual third color texture image position based on the bi-directional optical flow, respectively, and obtaining the primary virtual third color texture image at the virtual third color texture image position based on the visibility mask comprises:
based on the optical flow F 1 Warping the first color texture image pixel to the position of the virtual third color texture image pixel to obtain a warped color texture image I l→m The distortion formula is:
Figure FDA0003606338480000011
wherein x, y represent the two-dimensional coordinates of the first color texture image pixel, and u, v represent the corresponding two-dimensional coordinates of the third color texture image pixel;
based on the optical flow F 2 Warping the first color texture image pixel to the position of the virtual third color texture image pixel to obtain a warped color texture image I r→m The distortion formula is:
Figure FDA0003606338480000021
two twistedColor texture image I l→m 、I r→m Carrying out weighted summation by taking the visibility mask M as weight to calculate the pixel value of each pixel point to obtain the virtual third color texture image I mid The formula is as follows:
I mid =M⊙I l→m +(1-M)⊙I r→m
4. the method according to claim 1, wherein the optimizing the virtual third color texture image to obtain a final interpolated intermediate virtual view includes:
extracting high-order context information of the first color texture image and the second color texture image through a convolutional neural network;
and inputting the high-order context information serving as reference information into a sub-convolutional neural network to obtain a residual error of a final virtual third color texture image for reducing the sensitivity of the whole algorithm to the estimated optical flow, adding the primarily obtained virtual third color texture image and the residual error, and refining the quality of a final interpolated intermediate virtual view.
5. The method of claim 1, wherein repeating iterations S1-S3, interpolates exponentially any dense virtual views, comprising:
interpolating the first color texture image and the second color texture image to obtain a virtual third color texture image;
interpolating the first color texture image and the virtual third color texture image to obtain a virtual fourth color texture image;
interpolating the second color texture image and the virtual third color texture image to obtain a virtual fifth color texture image;
repeating iteration, and interpolating to obtain 2 by n-stage iteration exponential stage interpolation n 1 virtual view.
6. A free visual angle video method facing multiple users and based on visual angle interpolation is characterized by comprising the following steps:
acquiring multi-frame color images collected by a plurality of image collectors facing the same calibration object;
combining the multi-frame color images pairwise;
combining each two of the above, and interpolating in real time by using the method of any one of claims 1 to 5 to obtain a virtual view;
performing self-adaptive splicing on the multi-frame color image and the virtual view to form a multi-view cluster on a space domain;
encoding the multi-view clusters, dividing the multi-view clusters into video segments in a time domain and transmitting the video segments;
and different clients download and switch viewpoints by interactively selecting the view angle clusters to be watched.
7. The multi-user-oriented view interpolation-based free-view video method according to claim 6, wherein the adaptively splicing the multi-frame color image and the virtual view to form a multi-view cluster in a spatial domain comprises:
a plurality of virtual views which are interpolated exist around each image collector;
splicing the plurality of virtual views and the color texture image collected by the image collector into a color texture image with an integral size, wherein the color texture image collected by the image collector has high resolution, and the color texture image at the virtual view angle has low resolution;
each view exists in a multi-view cluster in a tile form, and an overlapping area between each multi-view cluster guarantees the continuity of switching.
8. The multi-user-oriented view interpolation-based free-view video method according to claim 6, wherein the encoding the multi-view cluster, and then dividing the multi-view cluster into video segments in time domain for transmission comprises:
for each multi-view cluster, performing time domain slicing in time sequence after encoding;
each slice is divided into fixed time size to form a plurality of time domain-space domain jointly distributed video clips;
the video clips are transmitted using a protocol.
9. The multi-user-oriented view interpolation-based free-view video method according to claim 6, wherein the downloading and viewpoint switching by interactively selecting view clusters to be watched at different clients comprises:
each user interactively selects a visual angle to be watched, downloads a corresponding visual angle cluster fragment according to the global visual angle index, and switches a plurality of visual angles of a visual angle cluster within one video slicing time;
after one video slice time, the user selects other view clusters to download.
10. A multi-user oriented view interpolation based free-view video system, comprising:
the system comprises an acquisition module, a calibration module and a control module, wherein the acquisition module is used for acquiring multi-frame color texture images acquired by a plurality of image acquisition devices facing the same calibration object, and the frame images acquired by the image acquisition devices are quasi-synchronous;
the cloud processing and content distribution network module is used for performing cloud processing on the multi-frame color texture images acquired by the plurality of image acquisition devices and distributing the multi-frame color texture images to the edge server by the content distribution network;
the edge server module interpolates a dense virtual view in real time by utilizing a neural network based on the acquired frame image, adaptively splices the acquired frame image and the interpolated frame image to form multi-view clusters which are uniformly distributed on an airspace, codes the multi-view clusters, segments the multi-view clusters into video segments in a time domain, and transmits the video segments through an HLS protocol;
and the client module is used for enabling a user to interactively select the view angle cluster to be watched to download and switch the view point.
11. The multi-user oriented view interpolation based free-view video system of claim 10, wherein the edge server module comprises:
the visual angle interpolation unit interpolates a dense virtual view in real time by utilizing a neural network based on the collected frame image;
the adaptive visual angle splicing unit carries out adaptive splicing on the acquired frame image and the interpolated frame image to form a multi-visual angle cluster which is uniformly distributed on a spatial domain;
the encoding unit is used for encoding and compressing the multi-view cluster frame to reduce the data volume;
an HLS transmission unit, which divides the coded multi-view cluster into video segments in time domain and transmits the video segments through an HLS protocol.
12. The multi-user oriented view interpolation based free-view video system of claim 10,
the user interaction and display unit provides a viewing and interaction interface for a user, the user can generate an interaction signaling through a key or a sliding screen, and a visual angle image selected through the signaling is rendered on a display;
the video slice downloading unit can select the multi-view cluster video slices containing the corresponding view angles provided by the edge server to download according to the interactive signaling;
the video slice decoding unit decodes the downloaded video slices into original YUV format files;
a view extraction unit that extracts a view image tile to be viewed from the multi-view cluster.
CN202210419565.8A 2022-04-20 2022-04-20 Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation Pending CN114897681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210419565.8A CN114897681A (en) 2022-04-20 2022-04-20 Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210419565.8A CN114897681A (en) 2022-04-20 2022-04-20 Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation

Publications (1)

Publication Number Publication Date
CN114897681A true CN114897681A (en) 2022-08-12

Family

ID=82718477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210419565.8A Pending CN114897681A (en) 2022-04-20 2022-04-20 Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation

Country Status (1)

Country Link
CN (1) CN114897681A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117596373A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method for information display based on dynamic digital human image and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117596373A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method for information display based on dynamic digital human image and electronic equipment
CN117596373B (en) * 2024-01-17 2024-04-12 淘宝(中国)软件有限公司 Method for information display based on dynamic digital human image and electronic equipment

Similar Documents

Publication Publication Date Title
US10460509B2 (en) Parameterizing 3D scenes for volumetric viewing
Azevedo et al. Visual distortions in 360° videos
EP3793205B1 (en) Content based stream splitting of video data
US20180189980A1 (en) Method and System for Providing Virtual Reality (VR) Video Transcoding and Broadcasting
KR101340911B1 (en) Efficient encoding of multiple views
KR101385514B1 (en) Method And Apparatus for Transforming Stereoscopic Image by Using Depth Map Information
US20090129667A1 (en) Device and method for estimatiming depth map, and method for generating intermediate image and method for encoding multi-view video using the same
EP2850592B1 (en) Processing panoramic pictures
KR20090071624A (en) Image enhancement
CN111602403B (en) Apparatus and method for generating image data bit stream
KR102641527B1 (en) image composition
JP2022548853A (en) Apparatus and method for evaluating quality of image capture of a scene
CN114897681A (en) Multi-user free visual angle video method and system based on real-time virtual visual angle interpolation
Simone et al. Omnidirectional video communications: new challenges for the quality assessment community
CN113963094A (en) Depth map and video processing and reconstruction method, device, equipment and storage medium
KR100914636B1 (en) A method of transmitting a visual communication signal, a transmitter for transmitting a visual communication signal and a receiver for receiving a visual communication signal
RU2732989C2 (en) Method, device and system for generating a video signal
Cao et al. A flexible client-driven 3DTV system for real-time acquisition, transmission, and display of dynamic scenes
JP7326457B2 (en) Apparatus and method for generating image signals
Hu et al. A multi-user oriented live free-viewpoint video streaming system based on view interpolation
JP3532823B2 (en) Image composition method and medium recording image composition program
EP4254958A1 (en) Compression of depth maps
RU2778456C2 (en) Device and method for formation of binary image data flow
EP4246988A1 (en) Image synthesis
CN117596373B (en) Method for information display based on dynamic digital human image and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination