HK1237168B

HK1237168B - Video transmission based on independently encoded background updates

Info

Publication number: HK1237168B
Application number: HK17110917.0A
Authority: HK
Inventors: J‧T‧科内柳森; A‧埃克内斯; H‧P‧阿尔斯塔德; S‧O‧埃瑞克森; E‧肖
Original assignee: 哈德利公司
Priority date: 2015-01-22
Filing date: 2016-01-22
Publication date: 2022-02-11

Description

Video transmission based on background updates based on independent coding

技术领域Technical Field

本申请总体上涉及视频传输(video transmission)。具体地，本申请涉及用于减轻视频传输的带宽限制并在接收器处提高视频的质量的装置和方法。更具体地，提供了改进的视频传输系统和方法，用于基于独立编码的背景和背景更新在接收器处生成高分辨率视频。The present application relates generally to video transmission. Specifically, the present application relates to apparatus and methods for alleviating bandwidth limitations of video transmission and improving the quality of video at a receiver. More specifically, an improved video transmission system and method are provided for generating high-resolution video at a receiver based on independently encoded background and background updates.

背景技术Background Art

实时视频通信系统和新兴的网真(telepresence)领域正在面临着内在的挑战，因为它们试图向远程用户模拟存在于另一物理空间中的体验。这是因为与具有当前技术水平的分辨率的市售单镜头摄像机相比，人眼利用将其高分辨率中心凹(fovea)固定在关注对象上的能力，在其视场(field of view)上保持了极高的优势。参见http:// www.clarkvision.com/imagedetail/eye-resolution.html(在120度的范围，将人眼的分辨率估计为576兆象素)。此外，网真系统还在实践中受到对于大多数用户可用的网络带宽的限制。因此，除了使用在大多数平板电脑、手机、和笔记本电脑中装备的窄视场摄像机进行的单一的人与人视频聊天之外，网真仅具有有限的吸引力(uptake)，这也就不足为奇了。Real-time video communication systems and the emerging field of telepresence face inherent challenges as they attempt to simulate the experience of being in another physical space for remote users. This is because the human eye, with its ability to focus its high-resolution fovea on an object of interest, maintains a superior field of view compared to commercially available single-lens cameras with state-of-the - art resolution. See http://www.clarkvision.com/imagedetail/eye-resolution.html (the human eye's resolution is estimated at 576 megapixels over a 120-degree range). Furthermore, telepresence systems are practically limited by the network bandwidth available to most users. Therefore, it's not surprising that, beyond simple person-to-person video chats using the narrow-field-of-view cameras found in most tablets, phones, and laptops, telepresence has had limited uptake.

商业网真系统中的自动和手动云台变焦(PTZ，pan-tilt-zoom)摄像机试图通过光学和机械地将视场固定在场景中所选的关注部分上来克服单镜头摄像机分辨率的限制。这样做部分地减轻了分辨率的限制，但仍有一些缺点。例如，在一给定时间只能进行一次机械固定；因此，可能无法令人满意地服务具有不同关注点的多个远程用户。此外，变焦镜头和机械云台机构提高了摄像机系统的成本，并对整个系统的可靠性提出了新的挑战。也就是说，与通常在其使用寿命期间维持更少的移动圈数(move cycle)的手动系统相比，自动PTZ系统对机械结构(mechanics)产生了更高的要求。与固定摄像机相比，对于高质量视频编码的带宽需求也显著增加。类似地，现有系统中的一些数字PTZ也存在如上所述的许多缺点，包括例如不能由远端的多个用户控制和对于视频编码的比特率要求较高。Automatic and manual pan-tilt-zoom (PTZ) cameras in commercial telepresence systems attempt to overcome the resolution limitations of single-lens cameras by optically and mechanically fixing the field of view to a selected portion of interest in the scene. This partially mitigates the resolution limitation, but it still has some drawbacks. For example, only one mechanical fixation can be performed at a given time; therefore, multiple remote users with different points of interest may not be able to be satisfactorily served. Furthermore, the zoom lens and mechanical pan-tilt mechanism increase the cost of the camera system and pose new challenges to the reliability of the entire system. Specifically, compared to manual systems, which typically maintain fewer move cycles over their service life, automatic PTZ systems place higher demands on the mechanics. Compared to fixed cameras, the bandwidth requirements for high-quality video encoding are also significantly increased. Similarly, some digital PTZ systems in existing systems suffer from many of the drawbacks described above, including, for example, the inability to be controlled by multiple remote users and the high bit rate requirements for video encoding.

全景和超广角视频摄像机可以满足网真系统的分辨率要求，以提供理想的用户体验。这些摄像机在传感器分辨率和像素速率方面的增长潜力远远超出现有的标准。这可以例如通过弯曲传感器表面和单中心镜头设计来实现。参见http:// www.jacobsschool.ucsd.edu/news/news_releases/release.sfe？id＝1418(讨论了分辨率可至少为85兆象素的120度FOV成像器)；http://image-sensors- world.blogspot.co.il/2014/04/vlsi-symposia-sony-presents-curved.html(传感器制造商公布了弯曲图像传感器的原型)。然而，这样的设计将对当前网络的容量和视频编码效率造成巨大压力，从而使得它们对于广泛的现实部署来说是不切实际的。例如，每秒30帧的85兆象素的摄像机将需要低至0.0002比特/像素的压缩，以适应10兆比特/秒(Mbit/s)的链路。考虑到例如在良好的条件下以0.05比特/像素运行的H.264等当前的视频压缩标准，这在现今通常是无法达到的。Panoramic and ultra-wide-angle video cameras can meet the resolution requirements of telepresence systems to provide an ideal user experience. These cameras have the potential to increase sensor resolution and pixel rate far beyond existing standards. This can be achieved, for example, through curved sensor surfaces and single-center lens designs. See http://www.jacobsschool.ucsd.edu/news/news_releases/release.sfe?id=1418 ( discussing a 120-degree FOV imager with a resolution of at least 85 megapixels); http://image-sensors-world.blogspot.co.il/2014/04/vlsi-symposia-sony-presents-curved.html (sensor manufacturer unveils prototype curved image sensor). However, such designs would place significant strain on current network capacity and video coding efficiency, making them impractical for widespread, real-world deployment. For example, an 85-megapixel camera operating at 30 frames per second would require compression as low as 0.0002 bits per pixel to accommodate a 10 megabit per second (Mbit/s) link. This is generally unattainable today considering current video compression standards such as H.264 which run at 0.05 bits per pixel under good conditions.

因此，需要改进的方法和系统来减轻视频传输的带宽限制并且基于传统的摄像机硬件生成高分辨率的视频。还需要利用这些改进来使得现代实时通信系统和理想网真体验成为可能。Therefore, there is a need for improved methods and systems to alleviate the bandwidth limitations of video transmission and generate high-resolution video based on traditional camera hardware. It is also necessary to utilize these improvements to enable modern real-time communication systems and ideal telepresence experiences.

发明内容Summary of the Invention

因此，本申请的目的是提供方法和系统，用于减轻视频传输上的带宽限制，从而使用传统的硬件设备来生成广角、高分辨率的视频。Therefore, the purpose of this application is to provide methods and systems for alleviating bandwidth limitations on video transmission, thereby generating wide-angle, high-resolution videos using traditional hardware devices.

特别地，根据本申请，在一个实施例中，提供了一种用于传输视频的方法，所述方法包括：1)通过从所述视频确定所述场景的静态背景来初始化背景模型；以及2)通过与所述视频独立地对所述背景模型进行编码来将所述场景的背景作为所述背景模型进行发送。所述背景模型被增量地更新，并且所述更新被进一步与所述视频独立地编码和发送。In particular, according to the present application, in one embodiment, a method for transmitting a video is provided, the method comprising: 1) initializing a background model by determining a static background of the scene from the video; and 2) transmitting the background of the scene as the background model by encoding the background model independently from the video. The background model is incrementally updated, and the updates are further encoded and transmitted independently from the video.

在另一实施例中，所述方法还包括通过将所述背景与所述视频合并来在接收器处产生增强的视频。在又一实施例中，以比所述视频的比特率更低的比特率对所述背景模型进行更新和发送。在进一步的实施例中，所述方法还包括针对每个帧发送所述背景和所述视频之间的几何映射。In another embodiment, the method further comprises generating an enhanced video at a receiver by merging the background with the video. In yet another embodiment, the background model is updated and transmitted at a lower bit rate than the bit rate of the video. In a further embodiment, the method further comprises transmitting a geometric mapping between the background and the video for each frame.

在另一实施例中，所述方法还包括通过场景分析来确定所述视频的视场。在又一实施例中，所述背景模型用于抑制所述视频的所述背景中的噪声变化。In another embodiment, the method further comprises determining the field of view of the video by scene analysis.In yet another embodiment, the background model is used to suppress noise variations in the background of the video.

根据一实施例，本申请的方法还包括通过标准视频编解码器(codec)来压缩所述视频。在另一实施例中，所述视频编解码器是H.264、H.265、VP8、和VP9之一。在又一实施例中，所述背景在由H.264、H.265、VP8、和VP9之一定义的辅助数据信道中发送。According to one embodiment, the method of the present application further comprises compressing the video using a standard video codec. In another embodiment, the video codec is one of H.264, H.265, VP8, and VP9. In yet another embodiment, the background is sent in an auxiliary data channel defined by one of H.264, H.265, VP8, and VP9.

根据另一实施例，所述背景模型是参数模型。在进一步的实施例中，所述参数模型是高斯混合(MOG)。According to another embodiment, the background model is a parametric model.In a further embodiment, the parametric model is a Mixture of Gaussians (MOG).

根据又一实施例，所述背景模型是非参数模型。在进一步的实施例中，所述非参数模型是视觉背景提取器(ViB)。According to yet another embodiment, the background model is a non-parametric model.In a further embodiment, the non-parametric model is a visual background extractor (ViB).

根据本申请的另一实施例，提供了一种用于在场景的视频上模拟云台变焦操作的方法，所述方法包括：1)通过从所述视频确定所述场景的静态背景来初始化背景模型；2)通过与所述视频独立地对所述背景模型进行编码来将所述场景的背景作为所述背景模型进行发送，其中，所述背景模型被增量地更新，其中，所述更新被进一步与所述视频独立地编码和发送，并且其中，针对每个帧发送所述背景和所述视频之间的几何映射；以及3)通过场景分析来选择所述视频的一个或多个视场；以及通过将所述背景与所述视频合并来在接收器处产生增强的视频。According to another embodiment of the present application, a method for simulating pan-tilt zoom operations on a video of a scene is provided, the method comprising: 1) initializing a background model by determining a static background of the scene from the video; 2) sending the background of the scene as the background model by encoding the background model independently from the video, wherein the background model is incrementally updated, wherein the update is further encoded and sent independently from the video, and wherein a geometric mapping between the background and the video is sent for each frame; and 3) selecting one or more fields of view of the video through scene analysis; and generating an enhanced video at a receiver by merging the background with the video.

在另一实施例中，该方法还包括在所述接收器处控制所述模拟的云台变焦操作。在又一实施例中，所述方法还包括在所述视频的发送器处控制所述模拟的云台变焦操作。In another embodiment, the method further comprises controlling the simulated pan-tilt-zoom operation at the receiver. In yet another embodiment, the method further comprises controlling the simulated pan-tilt-zoom operation at the transmitter of the video.

根据本申请的又一实施例，提供了一种用于传输场景的视频的系统，所述系统包括：1)发送器，所述发送器包括外部编码器和核心编码器，其中，所述外部编码器适于接收所述视频并分别地将显著视频以及背景和几何比特流输出到所述核心编码器中，其中，所述核心编码器适于输出编码比特流；以及2)接收器，所述接收器包括核心解码器，其中，所述核心解码器适于接收所述编码比特流并且输出所述显著视频。According to another embodiment of the present application, a system for transmitting a video of a scene is provided, the system comprising: 1) a transmitter comprising an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output a significant video and background and geometry bitstreams to the core encoder, respectively, wherein the core encoder is adapted to output an encoded bitstream; and 2) a receiver comprising a core decoder, wherein the core decoder is adapted to receive the encoded bitstream and output the significant video.

根据本申请的进一步的实施例，提供了一种用于传输场景的视频的系统，所述系统包括：1)发送器，所述发送器包括外部编码器和核心编码器，其中，所述外部编码器适于接收所述视频并分别地将显著视频以及背景和几何比特流输出到所述核心编码器中，其中，所述核心编码器适于输出编码比特流；以及2)接收器，所述接收器包括核心解码器和外部解码器，其中，所述核心解码器适于接收所述编码比特流并且分别地将所述显著视频以及所述背景和几何比特流输出到所述外部解码器中，其中，所述外部解码器适于合并所述显著视频以及所述背景和几何比特流，从而输出所述场景的增强的视频。According to a further embodiment of the present application, a system for transmitting a video of a scene is provided, the system comprising: 1) a transmitter comprising an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output the salient video and background and geometry bitstreams separately to the core encoder, wherein the core encoder is adapted to output the encoded bitstream; and 2) a receiver comprising a core decoder and an external decoder, wherein the core decoder is adapted to receive the encoded bitstream and output the salient video and the background and geometry bitstreams separately to the external decoder, wherein the external decoder is adapted to merge the salient video and the background and geometry bitstreams to output an enhanced video of the scene.

在另一实施例中，所述外部编码器还包括背景估计单元，所述背景估计单元适于通过从所述视频确定所述场景的静态背景来初始化背景模型，并且以比所述视频的比特率更低的比特率增量地更新所述背景模型。在又一实施例中，所述外部编码器还包括连接到所述背景估计单元的背景编码器。所述背景编码器适于与所述视频独立地对所述背景模型和所述更新进行编码。在进一步的实施例中，所述背景编码器包括熵编码器、熵解码器、更新预测单元、和更新存储单元。In another embodiment, the outer encoder further comprises a background estimation unit adapted to initialize a background model by determining a static background of the scene from the video and to incrementally update the background model at a bit rate lower than the bit rate of the video. In yet another embodiment, the outer encoder further comprises a background encoder coupled to the background estimation unit. The background encoder is adapted to encode the background model and the updates independently of the video. In a further embodiment, the background encoder comprises an entropy encoder, an entropy decoder, an update prediction unit, and an update storage unit.

根据另一实施例，所述背景编码器在下游方向连接到比特流复用器。在又一实施例中，所述外部编码器还包括显著性成帧(saliency framing)单元，所述显著性成帧单元适于将几何比特流输出到所述比特流复用器中。所述比特流复用器适于合并所述几何比特流和所述背景比特流，从而输出背景和几何比特流。According to another embodiment, the background encoder is connected to a bitstream multiplexer in a downstream direction. In yet another embodiment, the outer encoder further comprises a saliency framing unit adapted to output the geometry bitstream to the bitstream multiplexer. The bitstream multiplexer is adapted to merge the geometry bitstream with the background bitstream, thereby outputting the background and geometry bitstreams.

在进一步的实施例中，所述外部编码器还包括能够对所述视频进行缩放(scale)和裁剪(crop)的缩减(downscale)单元。所述缩减单元在下游方向连接到噪声抑制单元。所述噪声抑制单元适于基于所述背景模型来抑制所述显著视频中的噪声。In a further embodiment, the outer encoder further comprises a downscaling unit capable of scaling and cropping the video. The downscaling unit is connected to a noise suppression unit in a downstream direction. The noise suppression unit is adapted to suppress noise in the salient video based on the background model.

根据另一实施例，所述外部解码器还包括：i)比特流解复用器，适于从所述核心编码器接收所述背景和几何比特流并分别地输出所述几何比特流和所述背景比特流；ii)背景解码器，连接到所述比特流解复用器并适于接收所述背景比特流；以及iii)背景合并单元，在下游方向连接到所述比特流解复用器和所述背景解码器。所述背景合并单元适于从所述核心解码器接收所述显著视频，并且将所述几何比特流和所述背景比特流与所述显著视频合并，从而产生所述场景的增强的视频。According to another embodiment, the outer decoder further comprises: i) a bitstream demultiplexer adapted to receive the background and geometry bitstreams from the core encoder and output the geometry bitstream and the background bitstream, respectively; ii) a background decoder connected to the bitstream demultiplexer and adapted to receive the background bitstream; and iii) a background merging unit connected to the bitstream demultiplexer and the background decoder in a downstream direction. The background merging unit is adapted to receive the salient video from the core decoder and merge the geometry bitstream and the background bitstream with the salient video to produce an enhanced video of the scene.

在又一实施例中，所述背景解码器包括熵解码器、更新预测单元、和更新存储单元。In yet another embodiment, the background decoder includes an entropy decoder, an update prediction unit, and an update storage unit.

在进一步的实施例中，所述外部解码器还包括能够接收控制输入从而产生增强的视频的虚拟云台变焦单元。In a further embodiment, the external decoder further comprises a virtual pan-tilt-zoom unit capable of receiving control input to generate enhanced video.

根据另一实施例，本申请的所述系统中的所述核心编码器是H.264/H.265视频编码器，并且所述背景和几何比特流通过所述H.264/H.265视频编码器的网络抽象层承载。在又一实施例中，本申请的所述系统中的所述核心解码器是H.264/H.265视频解码器，并且所述背景和几何比特流通过所述H.264/H.265视频解码器的网络抽象层承载。According to another embodiment, the core encoder in the system of the present application is an H.264/H.265 video encoder, and the background and geometry bitstreams are carried through the network abstraction layer of the H.264/H.265 video encoder. In another embodiment, the core decoder in the system of the present application is an H.264/H.265 video decoder, and the background and geometry bitstreams are carried through the network abstraction layer of the H.264/H.265 video decoder.

在进一步的实施例中，所述核心编码器处于多媒体容器格式，并且所述背景和几何比特流通过所述核心编码器的辅助数据信道承载。在另一实施例中，所述核心解码器处于多媒体容器格式，并且所述背景和几何比特流通过所述核心解码器的辅助数据信道承载。In a further embodiment, the core encoder is in a multimedia container format, and the background and geometry bitstreams are carried via an auxiliary data channel of the core encoder. In another embodiment, the core decoder is in a multimedia container format, and the background and geometry bitstreams are carried via an auxiliary data channel of the core decoder.

根据又一实施例，本申请的所述系统中的所述核心编码器是标准视频编码器，并且所述背景和几何比特流通过所述核心编码器的辅助数据信道承载。在进一步的实施例中，所述核心解码器是标准视频解码器，并且所述背景和几何比特流通过所述核心解码器的辅助数据信道承载。According to yet another embodiment, the core encoder in the system of the present application is a standard video encoder, and the background and geometry bitstreams are carried via an auxiliary data channel of the core encoder. In a further embodiment, the core decoder is a standard video decoder, and the background and geometry bitstreams are carried via an auxiliary data channel of the core decoder.

根据本申请的另一实施例，提供了一种用于传输和呈现来自多个视场的场景的视频的方法，所述方法包括：(1)通过从所述视频确定所述场景的静态背景来初始化三维背景模型；(2)通过与所述视频独立地对所述背景模型进行编码来将所述场景的背景作为所述背景模型进行发送，其中，所述背景模型被增量地更新，并且其中，所述更新被进一步与所述视频独立地编码和发送；以及(3)通过将所述背景与所述视频合并来在接收器处呈现增强的视频。According to another embodiment of the present application, a method for transmitting and presenting a video of a scene from multiple fields of view is provided, the method comprising: (1) initializing a three-dimensional background model by determining a static background of the scene from the video; (2) sending the background of the scene as the background model by encoding the background model independently from the video, wherein the background model is incrementally updated, and wherein the update is further encoded and sent independently from the video; and (3) presenting an enhanced video at a receiver by merging the background with the video.

在又一实施例中，所述接收器是VR/AR设备。在进一步的实施例中，所述方法还包括：对来自所述VR/AR接收器的视线方向的关注区域进行自学习；以及发送所述关注区域的高分辨率视频，其中，通过将所述关注区域的所述高分辨率视频与所述背景合并来创建所述增强的视频。In yet another embodiment, the receiver is a VR/AR device. In a further embodiment, the method further comprises: self-learning a region of interest in a direction of sight from the VR/AR receiver; and transmitting a high-resolution video of the region of interest, wherein the enhanced video is created by merging the high-resolution video of the region of interest with the background.

根据本申请的另一实施例，提供了一种用于传输和呈现来自多个视场的场景的视频的系统，所述系统包括：(1)发送器，所述发送器包括外部编码器和核心编码器，其中，所述外部编码器适于接收所述视频并分别地将显著视频以及三维背景和几何比特流输出到所述核心编码器中，其中，所述核心编码器适于输出编码比特流；以及(2)VR/AR接收器，所述VR/AR接收器包括核心解码器和外部解码器，其中，所述核心解码器适于接收所述编码比特流并且分别地将所述显著视频以及所述背景和几何比特流输出到所述外部解码器中，其中，所述外部解码器适于合并所述显著视频以及所述背景和几何比特流，从而呈现所述场景的增强的视频。在另一实施例中，所述三维背景模型被增量地更新。According to another embodiment of the present application, a system for transmitting and presenting a video of a scene from multiple fields of view is provided, the system comprising: (1) a transmitter comprising an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output a salient video and a 3D background and geometry bitstream to the core encoder, respectively, wherein the core encoder is adapted to output an encoded bitstream; and (2) a VR/AR receiver comprising a core decoder and an external decoder, wherein the core decoder is adapted to receive the encoded bitstream and output the salient video and the background and geometry bitstream to the external decoder, respectively, wherein the external decoder is adapted to merge the salient video and the background and geometry bitstream to present an enhanced video of the scene. In another embodiment, the 3D background model is updated incrementally.

在又一实施例中，所述外部编码器包括背景估计单元，所述背景估计单元适于通过从所述视频确定所述场景的静态背景来初始化三维背景模型，并且以比所述视频的比特率更低的比特率增量地更新所述背景模型。In a further embodiment, the outer encoder comprises a background estimation unit adapted to initialize a three-dimensional background model by determining a static background of the scene from the video and to incrementally update the background model at a lower bitrate than the bitrate of the video.

在进一步的实施例中，所述系统还包括用于捕获所述场景的视频源。在另一实施例中，所述视频源包括具有部分重叠的视场的一个或多个摄像机。在又一实施例中，所述摄像机是移动摄像机。在进一步的实施例中，所述系统适于估计所述场景的移动部分和静止部分。在另一实施例中，所述外部编码器包括背景估计单元，所述背景估计单元适于基于所述场景的所述静止部分生成三维背景模型，并且以比所述视频的比特率更低的比特率增量地更新所述背景模型。In a further embodiment, the system further comprises a video source for capturing the scene. In another embodiment, the video source comprises one or more cameras with partially overlapping fields of view. In yet another embodiment, the cameras are mobile cameras. In a further embodiment, the system is adapted to estimate moving and stationary portions of the scene. In another embodiment, the outer encoder comprises a background estimation unit adapted to generate a three-dimensional background model based on the stationary portion of the scene and to incrementally update the background model at a bit rate lower than the bit rate of the video.

在进一步的实施例中，所述移动摄像机是PTZ摄像机。在另一实施例中，所述VR/AR接收器适于对来自其视线方向的关注区域进行自学习，并且其中，所述一个或多个PTZ摄像机适于捕获所述关注区域的高分辨率视频。In a further embodiment, the mobile camera is a PTZ camera.In another embodiment, the VR/AR receiver is adapted to self-learn an area of interest from its line of sight, and wherein the one or more PTZ cameras are adapted to capture high-resolution video of the area of interest.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1示出根据本申请一个实施例的视频传输系统。FIG1 shows a video transmission system according to an embodiment of the present application.

图2示出根据另一实施例的视频传输系统的外部编码器。FIG2 shows an outer encoder of a video transmission system according to another embodiment.

图3示出根据另一实施例的视频传输系统的外部解码器。FIG3 shows an outer decoder of a video transmission system according to another embodiment.

图4示出根据另一实施例的视频传输系统的H.264/H.265核心编码器。FIG4 shows an H.264/H.265 core encoder of a video transmission system according to another embodiment.

图5示出根据另一实施例的视频传输系统的H.264/H.265核心解码器。FIG5 shows an H.264/H.265 core decoder of a video transmission system according to another embodiment.

图6示出根据另一实施例的视频传输系统的多媒体容器格式(multimediacontainer format)核心编码器。FIG6 shows a multimedia container format core encoder of a video transmission system according to another embodiment.

图7示出根据另一实施例的视频传输系统的多媒体容器格式核心解码器。FIG7 shows a multimedia container format core decoder of a video transmission system according to another embodiment.

图8示出根据另一实施例的作为视频传输系统的核心编码器的、具有辅助数据信道的标准视频编码器。FIG8 shows a standard video encoder with an auxiliary data channel as a core encoder of a video transmission system according to another embodiment.

图9示出根据另一实施例的作为视频传输系统的核心解码器的、具有辅助数据信道的标准视频解码器。FIG9 shows a standard video decoder with an auxiliary data channel as a core decoder of a video transmission system according to another embodiment.

图10示出根据另一实施例的视频传输系统中的背景编码器。FIG10 shows a background encoder in a video transmission system according to another embodiment.

图11示出根据另一实施例的视频传输系统中的背景解码器。FIG. 11 shows a background decoder in a video transmission system according to another embodiment.

具体实施方式DETAILED DESCRIPTION

根据本申请各种实施例的方法和系统采用背景模型，基于该背景模型，视频中场景的背景被编码和增量地更新。编码后的背景和更新独立于该视频被发送。然后，在接收器处，背景可以与视频合并，从而产生增强的高分辨率视频。The methods and systems according to various embodiments of the present invention employ a background model based on which the background of a scene in a video is encoded and incrementally updated. The encoded background and updates are transmitted independently of the video. At the receiver, the background can then be merged with the video to produce an enhanced, high-resolution video.

方法概述Method Overview

在一个实施例中，例如，包括前景和背景两者的场景的视频被发送。诸如H.264等标准视频编解码器对它进行压缩。场景的静态背景被作为背景模型发送，该背景模型以比视频更低的比特率被增量地更新。基于已建立的监控系统技术，从视频的静态背景来生成和初始化背景模型。In one embodiment, for example, a video of a scene including both foreground and background is transmitted. It is compressed using a standard video codec such as H.264. The static background of the scene is transmitted as a background model, which is incrementally updated at a lower bit rate than the video. The background model is generated and initialized from the static background of the video based on established surveillance system technology.

在替代实施例中，具有部分重叠视场的多个摄像机(camera)被部署为视频源，其生成用于发送和呈现的一个或多个同步和协调(coordinate)的视频流。在某些实施例中，这种视频源包括移动摄像机。根据所述视频流来估计场景的移动和静止部分，并由此，基于图像的静止部分来生成三维背景模型。In an alternative embodiment, multiple cameras with partially overlapping fields of view are deployed as video sources, which generate one or more synchronized and coordinated video streams for transmission and presentation. In some embodiments, such video sources include mobile cameras. The moving and stationary portions of the scene are estimated based on the video streams, and thus, a three-dimensional background model is generated based on the stationary portion of the image.

在另一实施例中，通过场景分析——例如，将所发送视频的视场限于人类对象——来自动限制该视场，以更好地利用视频格式的分辨率。根据本实施例，针对每个帧，发送视频和背景之间的确切空间关系。In another embodiment, the field of view of the transmitted video is automatically limited by scene analysis, for example, limiting the field of view to human subjects, to better utilize the resolution of the video format. According to this embodiment, the exact spatial relationship between the video and the background is transmitted for each frame.

在又一实施例中，背景模型用于抑制视频的背景中的杂散噪声(spuriousnoise)。背景模型数据和其他相关信息在由诸如H.264之类的视频标准定义的辅助数据信道中发送。该背景和相关数据可以被以下解码器忽略和旁路(bypass)，该解码器设置为不对通过辅助数据信道承载的数据进行解释。因此，根据该实施例的系统提供了与旧的和现有的既存系统集成的灵活性。In yet another embodiment, a background model is used to suppress spurious noise in the background of the video. Background model data and other related information are sent in an auxiliary data channel defined by a video standard such as H.264. This background and related data can be ignored and bypassed by a decoder that is configured not to interpret the data carried by the auxiliary data channel. Thus, the system according to this embodiment provides the flexibility to integrate with older and existing pre-existing systems.

在某些实施例中，在接收器处，来自背景模型的输出与视频合并，从而产生增强的视频。在特定实施例中，在接收器处，在增强的视频上模拟PTZ操作。根据一实施例，在发送器或接收器处对该模拟的PTZ操作进行控制。根据替代实施例，该控制由用户实现或通过发送器或接收器处的自动处理实现。In some embodiments, at the receiver, the output from the background model is merged with the video to produce an enhanced video. In a specific embodiment, PTZ operations are simulated on the enhanced video at the receiver. According to one embodiment, the simulated PTZ operations are controlled at the transmitter or receiver. According to alternative embodiments, the control is performed by a user or through automated processing at the transmitter or receiver.

背景处理Background Processing

一些现有的视频编码器应用前景背景分割，其中在编码之前从视频中减去背景，并且单独地发送背景。根据本申请的一个实施例，使用诸如H.264或H.265之类的标准视频编码器对前景和背景两者的视频都进行编码。在本实施例中，通过将输入的视频像素与背景模型的预测像素状态进行比较来抑制背景中的杂散噪声。因此，在本实施例中，向视频编码器提供了背景区域中近乎静止的图像。背景模型在标准编解码器的辅助信道中被发送和增量地更新。因此，根据本实施例的背景发送方法放宽了对于视频传输的带宽要求，并且使得能够通过将背景更新与视频合并来在接收器处呈现高分辨率的视频。Some existing video encoders apply foreground-background segmentation, in which the background is subtracted from the video before encoding and the background is sent separately. According to one embodiment of the present application, a standard video encoder such as H.264 or H.265 is used to encode both the foreground and background videos. In this embodiment, stray noise in the background is suppressed by comparing the input video pixels with the predicted pixel states of the background model. Therefore, in this embodiment, a nearly still image in the background area is provided to the video encoder. The background model is sent and incrementally updated in an auxiliary channel of the standard codec. Therefore, the background sending method according to this embodiment relaxes the bandwidth requirements for video transmission and enables high-resolution video to be presented at the receiver by merging the background update with the video.

根据一个实施例，视频由并不知道背景模型数据的标准解码器进行解码。标准解码器忽略未知的辅助字段，并旁路背景模型数据。该实施例的系统利用现有的核心视频编解码器，其提供较低的实现成本。因此，该实施例的系统提供了与旧的和现有系统的向后兼容性。According to one embodiment, the video is decoded by a standard decoder that is unaware of the background model data. The standard decoder ignores the unknown auxiliary fields and bypasses the background model data. The system of this embodiment utilizes the existing core video codec, which provides a lower implementation cost. Therefore, the system of this embodiment provides backward compatibility with older and existing systems.

在另一实施例中，本申请的系统和方法以相对于前景的增强的表示级别来发送背景。在特定实施例中，以较高的分辨率和较高的动态范围来发送背景数据。出于许多原因，这是有利的。例如，虽然可以修改传统的混合视频编解码器来发送高分辨率帧内帧(intraframe)并且以低分辨率发送预测帧，但是帧内帧可能需要许多位来进行编码，因此不可能以低延迟实现来传递(transfer)，而不中断视频流。利用根据本实施例的外层中的背景发送，核心视频传输正常进行，而不中断，这是因为背景发送正在完成。In another embodiment, the systems and methods of the present application send the background at an enhanced level of representation relative to the foreground. In a specific embodiment, the background data is sent at a higher resolution and a higher dynamic range. This is advantageous for a number of reasons. For example, while a conventional hybrid video codec could be modified to send high resolution intraframes and predicted frames at a lower resolution, intraframes may require many bits to encode and therefore may not be possible to transfer in a low latency implementation without interrupting the video stream. With background sending in the outer layer according to this embodiment, the core video transmission proceeds normally without interruption because the background sending is completing.

与高分辨率的帧内帧相比，根据本实施例，利用外层中的背景发送，核心编码器可变得更简单。这提供了成本节省和广泛的系统兼容性。Compared to high-resolution intra frames, according to this embodiment, the core encoder can be made simpler by using background transmission in the outer layer. This provides cost savings and wide system compatibility.

模拟的云台变焦Simulated pan-tilt zoom

根据另一实施例，如上所述，本申请的系统模拟PTZ操作。在本实施例中，视域(view)由接收侧的模拟PTZ处理确定，而不是在发送侧固定。因此，所有的接收用户都可以访问另一侧的不同视域。由于该模拟PTZ不受机械结构的限制，因此，在其他实施例中，它可以对许多额外的转换和变换(transition and transformation)开放。特别地，在一个实施例中，提供了视域之间的瞬时切换和视域的滚动。According to another embodiment, as described above, the system of the present application simulates PTZ operation. In this embodiment, the view is determined by the simulated PTZ processing on the receiving side, rather than being fixed on the transmitting side. Therefore, all receiving users can access different views on the other side. Because this simulated PTZ is not limited by mechanical structure, it is open to many additional transitions and transformations in other embodiments. In particular, in one embodiment, instantaneous switching between views and scrolling of views are provided.

与现有的PTZ网真解决方案相比，根据本申请的这些非机械的、模拟的PTZ系统也提供了成本节省，并进一步增强了网真的可靠性。These non-mechanical, analog PTZ systems according to the present application also provide cost savings compared to existing PTZ telepresence solutions and further enhance the reliability of telepresence.

装置和部件Devices and components

参考图1，在一实施例中，本申请的系统包括视频源、发送器和接收器。在特定实施例中，视频源、发送器和接收器中的每个都是全景的。1 , in one embodiment, the system of the present application includes a video source, a transmitter, and a receiver. In a specific embodiment, each of the video source, the transmitter, and the receiver is panoramic.

根据一个实施例的全景视频源是提供广角或全景数字视频流的设备。在这个实施例中，它提供了适用于进一步处理的高比特率的未压缩视频。一实施例中的视频源是单个镜头和图像传感器组件；在另一实施例中，它包括多个镜头和传感器以及可以模拟单个镜头和传感器的操作的合适的图像拼接软件或硬件。在又一实施例中，视频源包括图形呈现设备，其将三维(3D)场景的几何投影模拟到一表面。因此，本实施例的系统可以被有利地部署用于计算机视频游戏。According to one embodiment, a panoramic video source is a device that provides a wide-angle or panoramic digital video stream. In this embodiment, it provides high-bitrate uncompressed video suitable for further processing. In one embodiment, the video source is a single lens and image sensor assembly; in another embodiment, it includes multiple lenses and sensors and suitable image stitching software or hardware that can simulate the operation of a single lens and sensor. In yet another embodiment, the video source includes a graphics rendering device that simulates the geometric projection of a three-dimensional (3D) scene onto a surface. Therefore, the system of this embodiment can be advantageously deployed for computer video games.

在一个实施例中，全景视频源的几何投影可能与期望的呈现投影不同。因此，可以在视频源设备的设计、制造或设置期间以适合于嵌入到视频发送器中的形式对它进行校准，或将它作为辅信息(side information)转发到视频发送器。发送器又将该信息提供给接收器，然后接收器可以被用于使用另一投影来呈现视频。因此，该实施例的系统提供了在基于期望的控制在接收器处呈现视频时的相当大的灵活性，该期望的控制可以是通过设计内置的或从用户输入的。在替代实施例中，可以从发送器或接收器实现这种控制。In one embodiment, the geometric projection of the panoramic video source may be different from the desired presentation projection. Therefore, it can be calibrated in a form suitable for embedding into the video transmitter during the design, manufacture or setup of the video source device, or forwarded to the video transmitter as side information. The transmitter in turn provides this information to the receiver, which can then be used to present the video using another projection. Therefore, the system of this embodiment provides considerable flexibility in presenting the video at the receiver based on the desired control, which can be built-in by design or input from the user. In alternative embodiments, such control can be implemented from the transmitter or receiver.

根据一实施例的系统的发送器包括外部编码器。参考图2，在一实施例中，外部编码器接收全景数字视频流，并输出显著(salient)视频流、编码背景模型更新序列、和几何投影数据。根据一实施例，来自外部编码器的该数据然后被传送到系统的核心编码器。视频流在某一实施例中为未压缩的形式，并且适合于标准视频编码器的压缩。根据另一实施例的编码背景模型数据和几何投影数据被复用并成帧为适合于在标准视频编码器的辅助数据帧中发送的格式。本实施例中的系统的核心编码器输出编码后的比特流。The transmitter of the system according to one embodiment includes an external encoder. Referring to Figure 2, in one embodiment, the external encoder receives a panoramic digital video stream and outputs a salient video stream, a coded background model update sequence, and geometric projection data. According to one embodiment, this data from the external encoder is then transmitted to the core encoder of the system. In one embodiment, the video stream is in uncompressed form and is suitable for compression by a standard video encoder. According to another embodiment, the coded background model data and geometric projection data are multiplexed and framed into a format suitable for transmission in an auxiliary data frame of a standard video encoder. The core encoder of the system in this embodiment outputs an encoded bitstream.

如图4所示，一个实施例中的核心编码器是H.264/H.265编码器。H.264/H.265核心编码器使用该标准的网络抽象层，在标记为用户数据的SEI报头中发送辅助数据。在某个实施例中，该数据被未设置为接收这样的SEI报头的接收器忽略。如上所述，该系统提供了向后兼容性，并有助于将其集成到现有的网真系统中。As shown in Figure 4, in one embodiment, the core encoder is an H.264/H.265 encoder. The H.264/H.265 core encoder uses the standard's network abstraction layer to send auxiliary data in an SEI header marked as user data. In one embodiment, this data is ignored by receivers not configured to receive such SEI headers. As described above, this system provides backward compatibility and facilitates integration into existing telepresence systems.

根据一个实施例，在本申请的系统中采用的背景模型是参数模型。在这样的参数背景模型中，基于来自过去视频帧的样本，对每个像素确定多个统计量(statistics)。根据另一实施例，背景模型是非参数模型。在这样的非参数背景模型中，对每个像素存储或聚合(aggregate)来自过去视频帧的多个样本——在有限维的空间中没有确定统计量或参数。根据一实施例，非参数背景模型是视觉背景提取器(ViBe)。在另一实施例中，参数背景模型是高斯混合(MOG)。在本申请的某些实施例中，系统的背景模型是三维模型并且支持VR/AR应用。为了本申请的各种实施例的目的，术语“三维”涵盖以下场景，在该场景中模型为来自单视点的图像，所述单视点的图像具有用于图像中每个点的深度，其有时被称为“2.5维”。According to one embodiment, the background model used in the system of the present application is a parametric model. In such a parametric background model, multiple statistics are determined for each pixel based on samples from past video frames. According to another embodiment, the background model is a non-parametric model. In such a non-parametric background model, multiple samples from past video frames are stored or aggregated for each pixel - no statistics or parameters are determined in a finite-dimensional space. According to one embodiment, the non-parametric background model is a visual background extractor (ViBe). In another embodiment, the parametric background model is a Gaussian mixture (MOG). In certain embodiments of the present application, the background model of the system is a three-dimensional model and supports VR/AR applications. For the purposes of various embodiments of the present application, the term "three-dimensional" covers the following scenario, in which the model is an image from a single viewpoint, and the image from the single viewpoint has a depth for each point in the image, which is sometimes referred to as "2.5 dimensions".

根据一个实施例，通过控制场景或通过使用更简单的背景模型进行自举(bootstrap)，系统的背景模型根据已知为背景的视频帧中的像素进行初始化。在替代实施例中，系统假定在背景模型的初始化时所有的像素都是背景的一部分。According to one embodiment, the system's background model is initialized based on pixels in a video frame known to be background, either by controlling the scene or by bootstrapping using a simpler background model. In an alternative embodiment, the system assumes that all pixels are part of the background when the background model is initialized.

在初始化之后，在一个实施例中，基于根据模型被确定为是或可能是背景的新样本中的背景上的改变来更新背景模型。After initialization, in one embodiment, the background model is updated based on changes in the background in new samples that are determined to be or likely to be background according to the model.

根据一个实施例，通过根据先前重建的更新来预测每个更新并仅发送预测的更新和实际的更新之间的差异，即残差(residual)，来对更新进行编码。在另一实施例中，通过量化和熵编码进一步减少残差的比特率。According to one embodiment, the updates are encoded by predicting each update based on previously reconstructed updates and sending only the difference between the predicted update and the actual update, the residual. In another embodiment, the bit rate of the residual is further reduced by quantization and entropy coding.

参考图10和11，根据本申请的某些实施例，通过背景编码器和背景解码器两者中的相同处理来重建更新。首先通过对熵编码和量化进行逆转来解码残差，然后根据先前更新来预测每个更新或每组更新，并且通过添加残差和预测更新来重建实际更新。10 and 11 , according to certain embodiments of the present application, updates are reconstructed by the same process in both the background encoder and the background decoder. The residual is first decoded by reversing entropy coding and quantization, and then each update or group of updates is predicted based on the previous update, and the actual update is reconstructed by adding the residual and the predicted update.

根据一个实施例，系统的发送器包括如图1所示的外部编码器和核心编码器。在该实施例中，发送器及其部件实现在相同的物理设备中。例如，一个实施例中的发送器是移动片上系统(SoC)。在某些实施例中，外部编码器实现在用于GPU或CPU内核的软件中，并且使用在这样的SoC中装备的用于视频编码的硬件加速器来实现核心编码器。该SoC发送器的实现有利于以下的网真系统，在该网真系统中移动电话或平板设备提供发送器功能(utility)。According to one embodiment, the system's transmitter includes an external encoder and a core encoder as shown in Figure 1. In this embodiment, the transmitter and its components are implemented in the same physical device. For example, the transmitter in one embodiment is a mobile system-on-chip (SoC). In some embodiments, the external encoder is implemented in software for a GPU or CPU core, and the core encoder is implemented using a hardware accelerator for video encoding equipped in such an SoC. The implementation of this SoC transmitter is beneficial for telepresence systems in which a mobile phone or tablet device provides the transmitter functionality.

在另一实施例中，发送器实现在为摄像机定制的SoC中。除了用于视频编码的加速器之外，还有其他功能被实现为在DSP内核上运行的软件。该特定实施例的发送器有利于采用单机(stand-alone)摄像机的网真系统。In another embodiment, the transmitter is implemented in a custom SoC for the camera. In addition to the accelerator for video encoding, other functions are implemented as software running on the DSP core. This particular embodiment of the transmitter is beneficial for telepresence systems that use stand-alone cameras.

如上所述，本申请的视频接收器包括核心解码器。参考图5、7、和9，在某些实施例中，核心解码器接收编码比特流并且，除了辅助数据之外，输出未压缩视频。根据这些实施例，辅助数据包括背景模型数据和几何映射数据。如图3所示，该数据被传送到外部解码器，其根据一个实施例合并显著视频和背景模型输出，从而产生增强的全景视频流。在又一实施例中，外部解码器改变视频的几何映射，从而模拟光学PTZ摄像机的效果。As described above, the video receiver of the present application includes a core decoder. Referring to Figures 5, 7, and 9, in some embodiments, the core decoder receives a coded bitstream and outputs uncompressed video in addition to auxiliary data. According to these embodiments, the auxiliary data includes background model data and geometric mapping data. As shown in Figure 3, this data is transmitted to an external decoder, which, according to one embodiment, merges the significant video and background model outputs to produce an enhanced panoramic video stream. In another embodiment, the external decoder changes the geometric mapping of the video to simulate the effect of an optical PTZ camera.

在发送器和接收器之间的辅助数据信道遇到分组丢失或其他可靠性问题的情况下，本申请另一实施例中的系统提供了向发送器发送请求以重发丢失的分组的功能。这些可包括其他发送的元数据和背景模型数据的部分。In the event that the auxiliary data channel between the sender and receiver experiences packet loss or other reliability issues, the system in another embodiment of the present invention provides functionality to send a request to the sender to resend the lost packets. These may include portions of other sent metadata and background model data.

根据一实施例，系统的视频接收器实现在云服务中，该云服务在通用数据中心或媒体处理器上运行。在另一实施例中，接收器被实现在诸如智能电话、平板电脑或个人计算机之类的终端用户设备的网络浏览器中。在网络浏览器中，接收器功能在特定实施例中由浏览器扩展、或使用诸如WebRTC(用于核心解码器)和WebGL(用于外部解码器)之类的标准化网络部件来实现。在又一实施例中，接收器被实现为诸如智能电话、平板电脑或个人计算机之类的终端用户设备的操作系统中的原生应用(native application)。在又一实施例中，接收器被实现在专用于视频通信的电器中。According to one embodiment, the video receiver of the system is implemented in a cloud service that runs on a general-purpose data center or media processor. In another embodiment, the receiver is implemented in a web browser of an end-user device such as a smart phone, tablet computer or personal computer. In the web browser, the receiver function is implemented in a specific embodiment by a browser extension or using standardized network components such as WebRTC (for core decoders) and WebGL (for external decoders). In another embodiment, the receiver is implemented as a native application (native application) in the operating system of an end-user device such as a smart phone, tablet computer or personal computer. In another embodiment, the receiver is implemented in an appliance dedicated to video communication.

在另一实施例中，接收器连同沉浸式(immersive)眼镜显示器、头戴式跟踪、或将选择图像投影到用户的视网膜中的替代技术一起，被实现为虚拟现实(VR)或增强现实(AR)系统的一部分。根据本实施例，本发明的装置和方法可以减轻启用VR/AR的视频会议系统的带宽限制，其中远程实时图像(distant live image)被投影到近端视域上。In another embodiment, the receiver is implemented as part of a virtual reality (VR) or augmented reality (AR) system, along with an immersive eyewear display, head-mounted tracking, or alternative technologies that project selected images onto the user's retina. According to this embodiment, the apparatus and method of the present invention can alleviate bandwidth limitations of VR/AR-enabled video conferencing systems, where a remote live image is projected onto a near-end field of view.

在又一实施例中，关于VR/AR接收器的眼睛注视和视线方向(view direction)的信息被中继传回到本发明的摄像机系统。来自该特定视线方向的高分辨率视频被相应地发送，允许了围绕该特定视线方向的某些额外边缘区域(margin)。在又一实施例中，本发明的系统适应自学习以绘出关注区域。具体地说，VR/AR接收器随着时间分析眼睛注视方向，并且接收到最多视线或“命中(hit)”的区域被以更高的分辨率进行编码以进行发送和呈现。In yet another embodiment, information about the VR/AR receiver's eye gaze and view direction is relayed back to the camera system of the present invention. High-resolution video from that particular view direction is sent accordingly, allowing for some additional margin around that particular view direction. In yet another embodiment, the system of the present invention adapts itself to learn to map areas of interest. Specifically, the VR/AR receiver analyzes eye gaze direction over time, and the areas that receive the most views or "hits" are encoded at a higher resolution for transmission and presentation.

根据一个实施例，本申请的系统包括视频源。在某些实施例中，视频源包括一个或多个移动的PTZ摄像机。这些移动PTZ摄像机针对特定的关注区域(“ROI”)捕获高分辨率视频，并且根据一实施例，将所述高分辨率视频与背景合并。在本实施例中，背景是静止图像，并且以比ROI视频的分辨率更高的分辨率被呈现，从而增强VR/AR体验。According to one embodiment, the system of the present application includes a video source. In some embodiments, the video source includes one or more mobile PTZ cameras. These mobile PTZ cameras capture high-resolution video of a specific region of interest ("ROI"), and according to one embodiment, the high-resolution video is merged with a background. In this embodiment, the background is a still image and is presented at a higher resolution than the resolution of the ROI video, thereby enhancing the VR/AR experience.

根据一个实施例，移动摄像机在时间上同步并且在位置上协调，从而允许在从多个摄像机收集的ROI视频之间进行高效的混合。According to one embodiment, mobile cameras are synchronized in time and coordinated in position, allowing efficient blending between ROI videos collected from multiple cameras.

在使用空间上移动的摄像机系统作为视频源的另一实施例中，使用具有部分重叠的视场(FOV)的多个固定的高分辨率摄像机预先生成背景的三维模型。在一个实施例中，这些摄像机还包括背景和前景分割滤波器，从而将场景的移动部分与非移动部分区分开。只有场景的背景(静止)部分用于生成场景的3D模型。在替代实施例中，在生成3D模型之前，使用超分辨率成像技术，以增加3D模型的分辨率。In another embodiment using a spatially moving camera system as a video source, a three-dimensional model of the background is pre-generated using multiple fixed high-resolution cameras with partially overlapping fields of view (FOVs). In one embodiment, these cameras also include background and foreground segmentation filters to distinguish the moving parts of the scene from the non-moving parts. Only the background (stationary) portion of the scene is used to generate the 3D model of the scene. In an alternative embodiment, super-resolution imaging techniques are used to increase the resolution of the 3D model before generating the 3D model.

在又一实施例中，用于空间和角度定位的陀螺仪和加速度计的组合连同用于微调的视觉信息一起，被应用于移动摄像机视频源。采用同步定位和地图构建(SLAM)技术，允许本申请的系统估计场景的哪些部分正在移动以及哪些部分没有移动，从而生成场景的3D模型。In yet another embodiment, a combination of gyroscopes and accelerometers for spatial and angular positioning, along with visual information for fine-tuning, is applied to a moving camera video feed. Using simultaneous localization and mapping (SLAM) technology, the system of the present application estimates which parts of a scene are moving and which are not, thereby generating a 3D model of the scene.

作为示例，当摄像机视频源正在移动时，一个实施例中的系统根据以下步骤来确定场景的移动部分。首先，针对每个连续的视频帧，估计哈里斯(Harris)角特征点(或其他类型的特征点)；针对每对视频帧(两者在时间上相邻，并且一些对之间具有较大的时间间隔)，估计帧之间的摄像机的旋转和平移(具有六个自由轴)；并删除异常值(outlier)。一些异常值是由于噪声引起的，而其他异常值则反映了帧之间已移动的对象。其次，针对异常值的哈里斯角，为包含异常值的场景的部分引入3D运动矢量；估计这些点的移动；并且，针对一直在一起移动的特征点，估计3D运动矢量。因此，考虑到摄像机的指向，生成基于场景的静止部分的3D模型。As an example, when a camera video source is moving, the system in one embodiment determines the moving parts of the scene according to the following steps. First, for each consecutive video frame, Harris corner feature points (or other types of feature points) are estimated; for each pair of video frames (the two are adjacent in time, and some pairs have a large time interval between them), the rotation and translation of the camera between the frames (with six axes of freedom) are estimated; and outliers are removed. Some outliers are due to noise, while other outliers reflect objects that have moved between frames. Second, for the Harris corners of the outliers, 3D motion vectors are introduced for the part of the scene containing the outliers; the movement of these points is estimated; and, for the feature points that have been moving together, 3D motion vectors are estimated. Thus, a 3D model based on the stationary part of the scene is generated, taking into account the pointing direction of the camera.

根据某些实施例，本申请的系统中的接收器和发送器实现在用于双向视频通信的同一设备中。According to some embodiments, the receiver and transmitter in the system of the present application are implemented in the same device for two-way video communication.

应用领域Application Areas

根据各种实施例，本申请的系统可以有利地部署在实时视频通信(视频会议和网真)、视频直播(live streaming)(体育运动、音乐会、活动分享、和电脑游戏竞技)、交通监视(仪表板摄像机、道路监视、停车场监视和计费)、虚拟现实、监控、家庭监视、故事讲述、电影、新闻、社交和传统媒体、以及艺术设施、连同其他应用和行业中。According to various embodiments, the system of the present application can be advantageously deployed in real-time video communications (video conferencing and telepresence), live streaming (sports, concerts, event sharing, and computer game competitions), traffic monitoring (dashboard cameras, road monitoring, parking lot monitoring and billing), virtual reality, surveillance, home monitoring, storytelling, movies, news, social and traditional media, and art installations, among other applications and industries.

在带宽不足够大以传输整个场景的高分辨率视频的视频直播和双向通信VR/AR应用中，根据一实施例，周期性地发送整个视场的高分辨率静止图像(stills)，而以常规频率发送所选择的关注区域的高分辨率视频。在又一实施例中，视频和静止图像(stills)在VR/AR接收器处进行本地混合，从而实现AR/VR的快速呈现和低延迟。在此上下文中，典型的延迟为20ms或更低。In live video and two-way communication VR/AR applications where bandwidth is insufficient to transmit high-resolution video of the entire scene, according to one embodiment, high-resolution still images (stills) of the entire field of view are periodically transmitted, while high-resolution video of selected areas of interest is transmitted at a regular frequency. In another embodiment, the video and still images (stills) are mixed locally at the VR/AR receiver, enabling fast AR/VR rendering and low latency. In this context, typical latency is 20ms or less.

包括各个附图和示例的、在本申请中提供的各种实施例的描述是对本申请及其各种实施例进行举例说明，而不用于进行限制。The description of the various embodiments provided in the present application, including the various figures and examples, is illustrative of the present application and its various embodiments and is not intended to be limiting.

Claims

1. A method for transmitting and presenting video of a scene from multiple fields of view, comprising: initializing a three-dimensional background model by determining a static background of the scene from the video; transmitting the background of the scene as the background model by encoding the background model independently of the video, wherein the background model is incrementally updated, and wherein the updates are further encoded and transmitted independently of the video, and wherein the incremental updates of the background model are transmitted via an auxiliary data channel; and presenting an enhanced video at a receiver by merging the background with the video.

2. The method according to claim 1, wherein the receiver is a VR/AR device.

3. The method of claim 2, further comprising: self-learning a region of interest from the gaze direction of the VR/AR receiver; and transmitting a high-resolution video of the region of interest, wherein the enhanced video is created by merging the high-resolution video of the region of interest with the background.

4. A system for transmitting and presenting video of a scene from multiple fields of view, comprising: i) a transmitter including an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output salient video and incrementally updated 3D background and geometry bitstreams to the core encoder, wherein the core encoder is adapted to output an encoded bitstream, and wherein the incremental updates of the background are transmitted via an auxiliary data channel; and ii) a VR/AR receiver including a core decoder and an external decoder, wherein the core decoder is adapted to receive the encoded bitstream and output the salient video and the incrementally updated background and geometry bitstreams to the external decoder, wherein the incremental updates of the background are received via an auxiliary data channel, and wherein the external decoder is adapted to merge the salient video and the incrementally updated background and geometry bitstreams to present an enhanced video of the scene.

5. The system of claim 4, wherein the external encoder includes a background estimation unit adapted to initialize a three-dimensional background model by determining a static background of the scene from the video, and to incrementally update the background model at a bit rate lower than that of the video.

6. The system according to claim 4 further includes: a video source for capturing the scene.

7. The system of claim 6, wherein the video source comprises one or more cameras having partially overlapping fields of view.

8. The system according to claim 7, wherein the camera is a mobile camera.

9. The system according to claim 8 is further adapted to estimate the moving and stationary portions of the scene.

10. The system of claim 9, wherein the external encoder includes a background estimation unit adapted to generate a three-dimensional background model based on the static portion of the scene, and to update the background model incrementally at a bit rate lower than that of the video.

11. The system according to claim 8, wherein the mobile camera is a pan-tilt-zoom (PTZ) camera.

12. The system of claim 11, wherein the VR/AR receiver is adapted to self-learn an area of interest from its line of sight, and wherein the one or more PTZ cameras are adapted to capture high-resolution video of the area of interest.

13. A method for transmitting video of a scene, comprising: initializing a background model by determining a static background of the scene from the video; and transmitting the background of the scene as the background model by encoding the background model independently of the video, wherein the background model is incrementally updated, and wherein the update is further encoded and transmitted independently of the video, and wherein the incremental update of the background model is transmitted via an auxiliary data channel.

14. The method of claim 13, further comprising: generating an enhanced video at the receiver by merging the background with the video.

15. The method of claim 14, wherein the background model is updated and transmitted at a lower bit rate than that of the video.

16. The method of claim 13, further comprising: sending a geometric mapping between the background and the video for each frame.

17. The method of claim 16, further comprising: determining the field of view of the video through scene analysis.

18. The method of claim 13, wherein the background model suppresses noise variations in the background of the video.

19. The method of claim 13, further comprising: compressing the video using a standard video codec.

20. The method of claim 19, wherein the video codec is one of H.264, H.265, VP8, and VP9.

21. The method of claim 20, wherein the background is transmitted in an auxiliary data channel defined by one of H.264, H.265, VP8, and VP9.

22. The method according to claim 13, wherein the background model is a parametric model.

23. The method of claim 22, wherein the parameter model is a Gaussian mixture (MOG).

24. The method of claim 13, wherein the background model is a nonparametric model.

25. The method of claim 24, wherein the nonparametric model is a visual background extractor (ViB).

26. A method for simulating gimbal zoom operation on a video of a scene, comprising: initializing a background model by determining a static background of the scene from the video; transmitting the background of the scene as the background model by encoding the background model independently of the video, wherein the background model is incrementally updated, wherein the updates are further encoded and transmitted independently of the video, wherein the incremental updates of the background model are transmitted via an auxiliary data channel, and wherein a geometric mapping between the background and the video is transmitted for each frame; selecting one or more fields of view of the video by scene analysis; and generating an enhanced video at a receiver by merging the background with the video.

27. The method of claim 26, wherein the simulated gimbal zoom operation is controlled at the receiver.

28. The method of claim 26, wherein the simulated gimbal zoom operation is controlled at the transmitter of the video.

29. A system for transmitting video of a scene, comprising: i) a transmitter including an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output salient video and incrementally updated background and geometry bitstreams to the core encoder, wherein the incremental updates of the background are transmitted via an auxiliary data channel, and wherein the core encoder is adapted to output an encoded bitstream; and ii) a receiver including a core decoder, wherein the incremental updates of the background are received via the auxiliary data channel, and wherein the core decoder is adapted to receive the encoded bitstream and output the salient video.

30. A system for transmitting video of a scene, comprising: i) a transmitter including an external encoder and a core encoder, wherein the external encoder is adapted to receive the video and output salient video and incrementally updated background and geometry bitstreams respectively to the core encoder, wherein the incremental updates of the background are transmitted via an auxiliary data channel, and wherein the core encoder is adapted to output an encoded bitstream; and ii) a receiver including a core decoder and an external decoder, wherein the core decoder is adapted to receive the encoded bitstream and output the salient video and the incrementally updated background and geometry bitstreams respectively to the external decoder, wherein the incremental updates of the background are received via an auxiliary data channel, and wherein the external decoder is adapted to merge the salient video and the incrementally updated background and geometry bitstreams to output an enhanced video of the scene.

31. The system of claim 30, wherein the external encoder includes a background estimation unit adapted to initialize a background model by determining a static background of the scene from the video, and to incrementally update the background model at a bit rate lower than that of the video.

32. The system of claim 31, wherein the external encoder further comprises a background encoder connected to the background estimation unit, the background encoder being adapted to encode the background model and the update independently of the video.

33. The system of claim 32, wherein the background encoder comprises an entropy encoder, an entropy decoder, an update prediction unit, and an update storage unit.

34. The system of claim 33, wherein the background encoder is connected to the bitstream multiplexer in the downstream direction.

35. The system of claim 34, wherein the external encoder further comprises a saliency framing unit adapted to output a geometric bitstream to the bitstream multiplexer, wherein the bitstream multiplexer is adapted to merge the geometric bitstream and the background bitstream to output a background and geometric bitstream.

36. The system of claim 35, wherein the external encoder further includes a reduction unit capable of scaling and cropping the video, the reduction unit being connected downstream to a noise suppression unit adapted to suppress noise in the significant video based on the background model.

37. The system of claim 36, wherein the external decoder further comprises: i) a bitstream demultiplexer adapted to receive the background and geometry bitstreams from the core encoder and output the geometry bitstream and the background bitstream respectively; ii) a background decoder connected to the bitstream demultiplexer and adapted to receive the background bitstream; and iii) a background merging unit connected downstream to the bitstream demultiplexer and the background decoder, wherein the background merging unit is adapted to receive the salient video from the core decoder and merge the geometry bitstream and the background bitstream with the salient video to produce an enhanced video of the scene.

38. The system of claim 37, wherein the background decoder comprises an entropy decoder, an update prediction unit, and an update storage unit.

39. The system of claim 37, wherein the external decoder further includes a virtual gimbal zoom unit capable of receiving control input to generate enhanced video.

40. The system of claim 37, wherein the core encoder is an H.264/H.265 video encoder, and wherein the background and geometric bitstreams are carried through the network abstraction layer of the core encoder.

41. The system of claim 37, wherein the core decoder is an H.264/H.265 video decoder, wherein the background and geometry bitstreams are carried through the network abstraction layer of the core decoder.

42. The system of claim 37, wherein the core encoder is in a multimedia container format, and wherein the background and geometric bitstreams are carried through an auxiliary data channel of the core encoder.

43. The system of claim 37, wherein the core decoder is in a multimedia container format, and wherein the background and geometry bitstreams are carried through an auxiliary data channel of the core decoder.

44. The system of claim 37, wherein the core encoder is a standard video encoder, and wherein the background and geometry bitstreams are carried through an auxiliary data channel of the core encoder.

45. The system of claim 37, wherein the core decoder is a standard video decoder, and wherein the background and geometry bitstreams are carried through an auxiliary data channel of the core decoder.