WO2020048617A1 - Latency efficient streaming of video frames for machine vision over an ip network - Google Patents

Latency efficient streaming of video frames for machine vision over an ip network Download PDF

Info

Publication number
WO2020048617A1
WO2020048617A1 PCT/EP2018/074190 EP2018074190W WO2020048617A1 WO 2020048617 A1 WO2020048617 A1 WO 2020048617A1 EP 2018074190 W EP2018074190 W EP 2018074190W WO 2020048617 A1 WO2020048617 A1 WO 2020048617A1
Authority
WO
WIPO (PCT)
Prior art keywords
encoded video
video frame
frames
encoded
indication
Prior art date
Application number
PCT/EP2018/074190
Other languages
French (fr)
Inventor
Bence FORMANEK
Peter Vaderna
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/EP2018/074190 priority Critical patent/WO2020048617A1/en
Publication of WO2020048617A1 publication Critical patent/WO2020048617A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/643Communication protocols
    • H04N21/64322IP
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/238Interfacing the downstream path of the transmission network, e.g. adapting the transmission rate of a video stream to network bandwidth; Processing of multiplex streams
    • H04N21/23805Controlling the feeding rate to the network, e.g. by controlling the video pump
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/238Interfacing the downstream path of the transmission network, e.g. adapting the transmission rate of a video stream to network bandwidth; Processing of multiplex streams
    • H04N21/2381Adapting the multiplex stream to a specific network, e.g. an Internet Protocol [IP] network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44004Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving video buffer management, e.g. video decoder buffer or video display buffer

Definitions

  • This disclosure relates to latency efficient streaming.
  • it relates to a method and a system capable of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network.
  • IP Internet protocol
  • Mixed reality is a technology that allows virtual imagery or objects to be mixed with a real-world physical environment, enabling physical and virtual objects to coexist in real-time.
  • Mixed reality generally encompasses augmented reality (AR), wherein elements in a real- world environment are augmented in a live direct or indirect view of that real-world
  • MR applications today are usually device centric i.e. it is in the device itself, such as an actual headset, smart glasses or smartphone, where data processing and applications themselves - detecting physical objects, creating and rendering virtual objects on top of the view of the real-world - run.
  • Real-time machine vision systems require low latency video transmission and processing.
  • Existing technologies in this field either use specialized video encoders and decoders, or special buffering technique to achieve low latency video transmission.
  • Patent document US8005149B2 concerns transmission of stream video in low latency, by slicing each frame or field into a predetermined number of slices, which are compressed separately, using standard video compression technologies. Compressed slices are compounded together before being transmitted.
  • low latency is achieved by splitting audio and video data streams from a given audio-video conversation using two different transport protocols to send separate streams over a network, and re-syncing them at the other end.
  • Video streaming systems are traditionally used for visualization for purposes of human viewers. It is noted that the human sight is sensitive to variations in frame rate. In order to maintain a correct playback rate, video streaming systems often use multiple stages of buffering between video encoder and decoder, especially on packet-switched IP network.
  • low latency streaming For low latency streaming, specialized low latency video encoders/decoders may be used. However, these are not available on generic devices e.g. Internet of things (loT) devices, mobile phones, etc.
  • LoT Internet of things
  • the disclosure provides a method of latency efficient streaming of video frames for machine vision over an Internet protocol network.
  • the method being performed in a system comprising a video recording device and a server in communication with each other over said Internet protocol network, where the video recording device captures and encodes video frames, forming a stream of encoded video frames.
  • the method comprises, within the video recording device, as soon as each video frame is encoded, adding to each encoded video frame an indication of the end of said each encoded video frame.
  • the method also comprises sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication.
  • the method comprises receiving the bursts of singular encoded video frames having said indication.
  • the method comprises, for each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
  • the disclosure provides a computer program for robot device control comprising instructions which, when executed on at least one processor, cause the at least one processors to carry out the method according to the aspect above.
  • the disclosure provides a computer-readable storage medium, having thereon said computer program.
  • the disclosure provides a system that is capable of latency efficient streaming of video frames for machine vision over an Internet protocol network.
  • the system comprises a video recording device and a server adapted to be in communication with each other over said Internet protocol network.
  • the video recording device is adapted to capture and encode video frames, forming a stream of video frames.
  • the video recording device further comprises a processor circuit and a memory having instructions executable by said processor circuit.
  • the processor circuit when executing the instructions, is configured to, as soon as each video frame is encoded, add to each encoded video frame an indication of the end of said each encoded video frame.
  • the processor circuit is also configured to, when executing the instructions, send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication.
  • the server comprises another processor circuit and another memory that has other instructions executable by said another processor circuit. When executing said other instructions, said another processor circuit is configured to receive the bursts of singular encoded video frames having said indication.
  • said another processor circuit is further configured to forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
  • a latency efficient, i.e. essentially real-time, image processing is enabled in the cloud by using existing commodity hardware and software video encoders and a well as decoders available in generic computers, mobile phones and Internet of things devices.
  • image processing on a server receiving the bit stream of video frames may be performed in real-time, by using a generic server or a server with generic graphics processing unit (GPU).
  • GPU generic graphics processing unit
  • the latency efficient streaming thus achieved is typically a streaming with a latency that is lower than a full frame time, i.e. sub-frame time latency, without the need to use specialized low latency video codec pairs.
  • FIG. 1 illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure
  • FIG. 2 presents a schematic illustration of a bit stream, related to embodiments of the present disclosure
  • FIG. 3 schematically illustrates transfer of video frames related to an embodiment of the present disclosure
  • FIG. 6 presents a system, according to embodiments of the present disclosure.
  • video streaming equipment In order to maintain a correct playback rate, video streaming equipment typically uses several stages of buffering between video encoding and video decoding, especially on packet- switched IP networks.
  • buffering of frames is often the main source of latency in video streaming.
  • Frames may comprise intra coded T-frames or predictively and inter-coded“P”- frames.
  • Intra coded frames may have 5-10 times the size in bits of predictively coded frames.
  • VBR variable bitrate
  • CBR constant bitrate
  • a video streaming device may comprise an output buffer at the output of a video encoder in order to smooth out the variations from above, and to create a CBR stream, in CBR mode, or average bitrate stream, in VBR mode.
  • Data packets containing video data are typically transmitted using a best-effort network protocol, which cannot guarantee a constant transmission delay. Rather, the delay may vary from packet to packet. A varying delay creates delay jitter, and to avoid stalls in playback caused by this delay jitter, a delay jitter buffer may be used at the input of a streaming video receiver.
  • This buffer receives constant (or average) bitrate video from the delay jitter buffer and ensures that there is always a full coded picture at the buffer at every frame time, when the decoder has to decompress the next picture to display.
  • An accumulated latency of a streaming system may be as large as a few seconds in non-conversational video streams, mainly caused by the size of the delay jitter buffer. In conversational video streams, accumulated latency may be around 150 - 200 ms.
  • a task of the present disclosure is to provide latency efficient streaming, i.e. a streaming that is efficient in terms of latency, and thus provide streaming with a latency that typically is reduced when compared to many other alternative streaming techniques.
  • This significant cause, or contribution, to latency comprises buffering between a video encoder output and a decoder input.
  • the idea of the present disclosure is that encoded video frames are transmitted as bursts, as soon as they are ready, by using all available network bandwidth.
  • a frame end signal is for this reason placed at the end of the transmitted burst by a transmitting part.
  • the video frame is handed over to a decoder to produce, for instance, a decoded image to an image processing function.
  • a capture timestamp may be attached to every frame to assist in recovering a correct timing of the frames, if needed.
  • the disclosure comprises a video encoder - decoder pair that is connected to each other over an IP network.
  • Figure 1 schematically illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure.
  • the processing path involves a video recording device and, located in the cloud, one or more servers.
  • the video recording device may comprise a high definition (HD) camera, a video encoder and a packetizer.
  • the HD camera captures a video, and transfers a stream of frames to the video encoder for encoding of video frames. Having encoded video frames they are transferred to the packetizer at which they are packetized into IP packets and transmitted to a server in the cloud.
  • the server located in the cloud may comprise a de-packetizer, a video decoder and an image processor.
  • Figure 1 may be regarded as an illustration of a video encoder-decoder pair connected by an IP network.
  • this disclosure proposes to remove a significant cause of latency in video streaming, i.e. the buffering between a video encoder output and a video decoder input.
  • SoC system on a chip
  • LoTs Internet of things
  • the video encoder may be adapted to receive a full image captured by the high definition (HD) camera.
  • the video encoder is further adapted to process, such as to encode, the picture asynchronously, and to forward the encoded video frame to the packetizer, as soon as the encoding of the frame is completed.
  • the components in the server, i.e. the de-packetizer, the video decoder and the image processor are all adapted to process received video frames asynchronously, which for instance means that the de-packetizer may be ready to receive next video frame while the decoder is busy decompressing the preceding video frame.
  • the decoded video frame may thus be handed over to an image processing function, which may also be adapted to process video frames asynchronously.
  • each encoded video frame is signaled.
  • a video decoder may detect when each encoded video frame begins and thus when to start decoding of for instance a full image.
  • the end of an encoded video frame may only be detected indirectly by waiting for the beginning of the next video frame, and conclude that the end must have passed since the next video frame, being a later video frame has arrived. Even though information may thus be gained that the first video frame has arrived, there is no explicit information about the moment in time when the video frame was fully received. Dependent on the bitrate mode being used, there may be residual waiting time between the end of a video frame and the beginning of the next video frame.
  • the packetizer may be adapted to add an end-of-frame signal to the bit stream of encoded video frames.
  • an end-of-frame signal may be added by adding to each encoded video frame an indication of the end of said each encoded video frame.
  • an indication may be added in terms of an access unit delimiter (AuDL) following network abstraction layer (NAL) unit(s) containing the encoded video frame, in question.
  • AuDL access unit delimiter
  • NAL network abstraction layer
  • FIG. 2 presents an illustration of a piece of a bit stream comprising encoded video frames.
  • An H.264 - encoded video frame is presented, and presents an access unit delimiter (AuDL) in the beginning.
  • AuDL access unit delimiter
  • the AuDL NAL unit is intended to signal the beginning of each video frame by the H.264 standard,
  • the AuDL is adapted to signal the end of video frame.
  • a decoder that receives a bit stream of video frames one after each other may use the AuDL as a pointer for a beginning of an H.264 access unit, or use the AuDL as an indicator of en end of each video frame.
  • the decoder will therefore not experience a difference between the two different interpretations of the AuDL NAL unit. The difference may thus be considered to disappear.
  • FIG. 2 The illustration of the piece of a bit stream comprising the encoded video frames, of Figure 2 also presents one picture parameter set (PPS) per H.264 encoded video frame or video frame burst. Further, the singular encoded video frame is here denoted by video coded layer (VCL) video frame.
  • PPS picture parameter set
  • VCL video coded layer
  • a packetizer may be adapted to create a moving picture experts group (MPEG) transport stream (TS) multiplex and to place time information of the capture of the picture, i.e. a camera capture time, as a timestamp into the bit stream.
  • MPEG moving picture experts group
  • TS transport stream
  • the created user datagram (UDP)/lnternet protocol (IP) packets are transmitted as bursts, by using all available bandwidth of the network.
  • the de-packetizer When the UDP packet burst has been received by the de-packetizer at the receiving end, the de-packetizer is adapted to unpack the video frame from the multiplexed packet-by- packet. The de-packetizer is further adapted to forward, or hand over, a received video frame, optionally together with a timestamp, to the decoder, instantly, without having to await the next video frame.
  • FIG. 3 schematically illustrates receipt of transferred video frames in bursts related to an embodiment of the present disclosure.
  • This illustration presents each video frame being transferred as a burst, together with its indication of the end of each video frame, here located at the end of each video burst, as“t Fi “ > “W‘,“W‘, etc.
  • The“t” is the time at receipt at a receiving side. At“tO” the first frame starts to arrive, and at“t Fi ”, the first frame end is detected and passed to the decoder. After decoding the first frame, now decoded, is presented to the image processing function. Herein, decoding of the frame is herein for simplicity considered to be an instant process. At time“t F2 ” the end of the second frame is detected and the second frame is presented to the image processing function, etc.
  • the y coordinate of Figure 3 is bandwidth (Bw).
  • the bursts are transferred with the maximum channel bandwidth available.
  • a de-packetizer receiving the video frames may be adapted to hand over, or forward, the video frame in question to a decoder.
  • the end indication herein indicates the end of each video frame, as an indicator that is located at the end of each video frame. Alternatively, the end of each video may be indicated by the size of each video frame.
  • each video frame is handed to the decoder for decoding of the encoded video frame.
  • bursts, or video frames are indicated as either ⁇ ” or“P”, where ⁇ ” frames are longer in time as compared to“P” frames.
  • Intra-coded frames are denoted by T- frames. While an l-frame may be intra-coded, i.e. based on the very T-frame itself only,“P”- frames are predictively coded and inter-coded, i.e. based on further frames also.
  • FIG. 4 schematically illustrates transfer of video frames using a constant bitrate according to prior art techniques.
  • the notation of T- and“P”-frame is the same here as for the preceding Figure.
  • the bitrate of Figure 4 is a constant bitrate, which constant bit rate is lower than the one used in the preceding Figure, in which the video frames ate transferred in bursts. For this reason, the video frames are presented wider in the Figure 4, as compared to the ones in the preceding Figure.
  • the“t” is herein the time at receipt at a receiving side.
  • At“tO” the first frame starts to fill a buffer.
  • at“t Fi ” the first frame, which is not transferred as a burst, is passed to a decoder and typically presented to a human viewer.
  • decoding is herein for simplicity considered to be instant.
  • At time“t F 2“ > the second frame may be presented to the human viewer.
  • the y-coordinate of Figure 4 is bandwidth (Bw), as for Figure 3.
  • the bandwidth in Figure 4 is constant and since the time positions“t Fi ” are equidistant, the frame rate of the frames presented to the human viewer is hence constant.
  • An indication of the end of each video frame as presented in Figure 3, may instruct when each video frame may be played out by the server, for machine vision, or other related services.
  • a timestamp of the capture of each picture may be attached to every video frame, enabling a recovery of a correct timing of output video frames, if required.
  • the time stamp may be attached to an output of an image processing function within a receiving server, for further usage.
  • the transfer of video frames as illustrated in Figure 3 thus uses transfer in bursts.
  • a low-end hardware encoder, and a software decoder an average latency of 25 ms was achieved for a 720p resolution video stream.
  • the latency of 25 ms is well under a 40-ms frame time of a 25 frames per second (fps) video stream.
  • Figure 5 presents a flow chart of actions within a method of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network.
  • the method is performed in a system that comprises a video recording device 602 and a server 612 in communication with each other over said IP network.
  • the video recording device captures and encodes video frames, forming a stream of encoded video frames.
  • the method comprises, as soon as each video frame is encoded:
  • Action 52 Adding to each encoded video frame an indication of the end of said each encoded video frame.
  • Action 54 Sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.
  • the method comprises within the server:
  • Action 56 Receiving the bursts of singular encoded video frames having said indication.
  • Action 58 For each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
  • the action 52 of adding the indication of the end of each encoded video frame may comprise adding information about the size of each encoded video frame, and wherein the action 58 of forwarding of said each encoded video frame may comprise for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
  • the action 52 of adding the indication of the end of each encoded video frame may comprise adding the indication to an end part of each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame may comprise, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
  • the action 52 of adding the indication to the end part of each encoded video frame may comprise adding an access unit delimiter (AuDL), to a network abstraction layer (NAL) unit following said each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame to the decoder may comprise, for each encoded video frame, forwarding said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
  • AuDL access unit delimiter
  • NAL network abstraction layer
  • the encoded video frames may be encoded with any one of: H.264 and H.265 codecs.
  • the encoded video frames may comprise intra and inter coded video frames.
  • These intra and inter coded video frames may comprise l-frames and P-frames, respectively.
  • each encoded video frame in the stream of encoded video frames may comprise a timestamp of its capture by the video recording device.
  • the present disclosure also comprises a computer program for latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network.
  • IP Internet protocol
  • the computer program comprises instructions which, when executed on at least one processor, causes the at least one processor to carry out the method of the actions above.
  • the present disclosure also comprises a computer-readable storage medium, having thereon the computer program of above.
  • Figure 6 schematically presents a system 600 that is capable of latency efficient streaming of video frames for machine vision over an IP network.
  • the system comprising a video recording device 602 and a server 612 located within a cloud.
  • the video recording device 602 and the server 612 are adapted to be in communication with each other over said IP network.
  • the video recording device is adapted to capture and encode video frames, forming a stream of video frames.
  • the video recording device 602 may comprise high definition camera 604.
  • the video recording device 602 further comprises a processor circuit 606 and a memory 608.
  • the memory has instructions executable by said processor circuit 606, wherein said processor circuit 604 when executing the instructions is configured, as soon as each video frame is encoded, to add to each encoded video frame an indication of the end of said each encoded video frame.
  • the processor is configured, as soon as each video frame is encoded, to send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.
  • the server 612 within the system 600 comprises another processor circuit 614 and another memory 616.
  • Said another memory 616 has other instructions executable by said another processor circuit 614, wherein said another processor circuit 614 when executing said other instructions is configured to receive the bursts of singular encoded video frames having said indication.
  • the processor circuit 614 when executing said other instructions is configured to, for each singular encoded video frame of the stream of encoded video frames, forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server 612.
  • the decoder may be implemented as software as a part of the instructions of the memory 616.
  • the processor circuit 606, of the video recording device, when executing the instructions may be configured to add information about the size of each encoded video frame, and where said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
  • the processor circuit 606, of the video recording device 602, when executing the instructions may be configured to add the indication to an end part of each encoded video frame, and wherein said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
  • the processor circuit 606, of the video recording device 602, when executing the instructions, may be configured to add an access unit delimiter (AuDL) to an end of a network abstraction layer (NAL) unit following said each encoded video frame, and wherein said another processor circuit 614, or the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
  • AuDL access unit delimiter
  • NAL network abstraction layer
  • the encoded video frames may be encoded with any one of: H.264 and H.265 codecs.
  • the encoded video frames may comprise intra and inter coded video frames. These intra and inter coded video frames may comprise l-frames and P-frames, respectively.
  • each encoded video frame in the stream of encoded video frames may comprise a timestamp of its capture by the video recording device.
  • Examples and/or embodiments of the present disclosure carry the following advantages: Latency efficient streaming of video frames for machine vision over an IP network is provided. It is an advantage that commodity hardware and software video encoders as well as and decoders available in generic computers, mobile phones and loT devices, may be utilized.
  • a full available bandwidth of the network may be used for the transmission of video frames, as they are transmitted in bursts.
  • An advantage is the optional attachment to video frames of a timestamp of the capture of picture, enabling a recovery of relative timing of video frames.

Abstract

The present disclosure relates to a method and system (600) capable of latency efficient streaming of video frames for machine vision over an IP network. The system comprises a server (612) and a video recording device (602) that is adapted to capture and encode video frames, forming a stream of encoded video frames. An indication of the end of the video frame is added (52) to each encoded video frame. These encoded video frames are sent (54) in bursts of singular encoded video frames, over the IP network to the server. The server forwards (58) to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame,whereby each encoded video frame, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

Description

LATENCY EFFICIENT STREAMING OF VIDEO FRAMES FOR MACHINE VISION
OVER AN IP NETWORK
TECHNICAL FIELD
This disclosure relates to latency efficient streaming. In more particular, it relates to a method and a system capable of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network.
BACKGROUND
Mixed reality (MR) is a technology that allows virtual imagery or objects to be mixed with a real-world physical environment, enabling physical and virtual objects to coexist in real-time. Mixed reality generally encompasses augmented reality (AR), wherein elements in a real- world environment are augmented in a live direct or indirect view of that real-world
environment. MR applications today are usually device centric i.e. it is in the device itself, such as an actual headset, smart glasses or smartphone, where data processing and applications themselves - detecting physical objects, creating and rendering virtual objects on top of the view of the real-world - run.
There are many advantages to move such an MR environment to the cloud where applications are implemented e.g. as interconnected micro services and in the form of AR/MR devices that connect to and view this MR environment. An AR/MR system in the cloud however needs sensor information to build a virtual view of the real world. To achieve real- time operation, transmission of sensor information e.g. video streams taken by cameras, to the cloud needs to have low latency.
Current industrial robot deployments often comprise a robotic arm, sensors and a controller deployed very close to each other. The controller, the sensors and the arm are typically connected via proprietary industrial protocols. Moving robotics and automation algorithms into the cloud involves similar problems as the ones of AR/MR technology. Control commands and sensor information, such as video data from camera sensors, needs to be transmitted between robot clusters and the cloud with low latency.
The output of imaging systems is traditionally used merely for visualization for humans. With an evolved technology of computer vision, automated two-dimensional (2D) and three- dimensional (3D) image processing can now extract additional information. Examples of problems that can be targeted by computer vision are object recognition, motion estimation, simultaneous localization and mapping (SLAM), etc. Due to the complexity of these problems, computer vision and image processing algorithms often require extra amount of hardware resources such as central processing unit (CPU), graphics processing unit (GPU), and memory. It is a common use-case to collect images from multiple sources and perform image processing centrally in the cloud. Because of the large size of such captured image data, typically some form of video compression is used to reduce the required bandwidth.
Real-time machine vision systems require low latency video transmission and processing. Existing technologies in this field, either use specialized video encoders and decoders, or special buffering technique to achieve low latency video transmission.
From US7421508B2 a buffering mechanism for improving playback of a streamed media from packet delay variation due to encoding and packetization.
Patent document US8005149B2 concerns transmission of stream video in low latency, by slicing each frame or field into a predetermined number of slices, which are compressed separately, using standard video compression technologies. Compressed slices are compounded together before being transmitted.
From US9077774B2, low latency is achieved by splitting audio and video data streams from a given audio-video conversation using two different transport protocols to send separate streams over a network, and re-syncing them at the other end.
Video streaming systems are traditionally used for visualization for purposes of human viewers. It is noted that the human sight is sensitive to variations in frame rate. In order to maintain a correct playback rate, video streaming systems often use multiple stages of buffering between video encoder and decoder, especially on packet-switched IP network.
However, multiple stage buffering often becomes the main source of latency in video streaming.
For low latency streaming, specialized low latency video encoders/decoders may be used. However, these are not available on generic devices e.g. Internet of things (loT) devices, mobile phones, etc.
SUMMARY
It is an object of embodiments of the disclosure to address at least some of the issues outlined above, and this object and others are achieved by a system, a method performed therein, a computer program and a computer-readable storage medium for latency efficient streaming of video frames for machine vision over an Internet protocol network.
According to an aspect, the disclosure provides a method of latency efficient streaming of video frames for machine vision over an Internet protocol network. The method being performed in a system comprising a video recording device and a server in communication with each other over said Internet protocol network, where the video recording device captures and encodes video frames, forming a stream of encoded video frames. The method comprises, within the video recording device, as soon as each video frame is encoded, adding to each encoded video frame an indication of the end of said each encoded video frame. The method also comprises sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication. Within the server, the method comprises receiving the bursts of singular encoded video frames having said indication. In addition, the method comprises, for each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
According to another aspect, the disclosure provides a computer program for robot device control comprising instructions which, when executed on at least one processor, cause the at least one processors to carry out the method according to the aspect above.
According to another aspect, the disclosure provides a computer-readable storage medium, having thereon said computer program.
According to yet another aspect, the disclosure provides a system that is capable of latency efficient streaming of video frames for machine vision over an Internet protocol network. The system comprises a video recording device and a server adapted to be in communication with each other over said Internet protocol network. The video recording device is adapted to capture and encode video frames, forming a stream of video frames. The video recording device further comprises a processor circuit and a memory having instructions executable by said processor circuit. The processor circuit, when executing the instructions, is configured to, as soon as each video frame is encoded, add to each encoded video frame an indication of the end of said each encoded video frame. The processor circuit is also configured to, when executing the instructions, send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication. The server comprises another processor circuit and another memory that has other instructions executable by said another processor circuit. When executing said other instructions, said another processor circuit is configured to receive the bursts of singular encoded video frames having said indication. For each singular encoded video frame of the stream of encoded video frames, said another processor circuit is further configured to forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
Embodiments of the present disclosure comprise the following advantages:
A latency efficient, i.e. essentially real-time, image processing is enabled in the cloud by using existing commodity hardware and software video encoders and a well as decoders available in generic computers, mobile phones and Internet of things devices.
By removing, or at least reducing, buffering of a video frames in a bit stream between an encoder and a decoder, a low latency streaming from Internet of things devices or mobile phones is achieved.
It is further an advantage that image processing on a server receiving the bit stream of video frames may be performed in real-time, by using a generic server or a server with generic graphics processing unit (GPU).
The latency efficient streaming thus achieved is typically a streaming with a latency that is lower than a full frame time, i.e. sub-frame time latency, without the need to use specialized low latency video codec pairs.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments will now be described in more detail, and with reference to the
accompanying drawings, in which:
- Figure 1 illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure;
- Figure 2 presents a schematic illustration of a bit stream, related to embodiments of the present disclosure;
- Figure 3 schematically illustrates transfer of video frames related to an embodiment of the present disclosure;
- Figure 4 schematically illustrates transfer of video frames related to the prior art;
- Figure 5 presents a flow chart of actions according to embodiments of the present disclosure;
- Figure 6 presents a system, according to embodiments of the present disclosure.
DETAILED DESCRIPTION
In the following description, different embodiments of the disclosure will be described in more detail, with reference to accompanying drawings. For the purpose of explanation and not limitation, specific details are set forth, such as particular examples and techniques in order to provide a thorough understanding.
In order to maintain a correct playback rate, video streaming equipment typically uses several stages of buffering between video encoding and video decoding, especially on packet- switched IP networks.
It is noted that buffering of frames is often the main source of latency in video streaming.
Further, buffering is often used for the reason that modern video compression algorithms typically create varying number of bits for each input image. The size of each encoded image depends on the type of frame. Frames may comprise intra coded T-frames or predictively and inter-coded“P”- frames. Intra coded frames may have 5-10 times the size in bits of predictively coded frames.
There is also a variation among frames depending on the content of each frame, especially if variable bitrate (VBR) encoding is used, but it also exists in constant bitrate (CBR) mode.
A video streaming device may comprise an output buffer at the output of a video encoder in order to smooth out the variations from above, and to create a CBR stream, in CBR mode, or average bitrate stream, in VBR mode.
Data packets containing video data are typically transmitted using a best-effort network protocol, which cannot guarantee a constant transmission delay. Rather, the delay may vary from packet to packet. A varying delay creates delay jitter, and to avoid stalls in playback caused by this delay jitter, a delay jitter buffer may be used at the input of a streaming video receiver.
In addition, there is a buffer at the input of the video decoder. This buffer receives constant (or average) bitrate video from the delay jitter buffer and ensures that there is always a full coded picture at the buffer at every frame time, when the decoder has to decompress the next picture to display.
An accumulated latency of a streaming system may be as large as a few seconds in non-conversational video streams, mainly caused by the size of the delay jitter buffer. In conversational video streams, accumulated latency may be around 150 - 200 ms.
A task of the present disclosure is to provide latency efficient streaming, i.e. a streaming that is efficient in terms of latency, and thus provide streaming with a latency that typically is reduced when compared to many other alternative streaming techniques.
In order to achieve this task, it is herein proposed to remove a significant cause to latency in video streaming. This significant cause, or contribution, to latency comprises buffering between a video encoder output and a decoder input. The idea of the present disclosure is that encoded video frames are transmitted as bursts, as soon as they are ready, by using all available network bandwidth.
In short, a frame end signal is for this reason placed at the end of the transmitted burst by a transmitting part. As soon as the frame end signal is received by the receiving part, the video frame is handed over to a decoder to produce, for instance, a decoded image to an image processing function. In addition, a capture timestamp may be attached to every frame to assist in recovering a correct timing of the frames, if needed.
The disclosure comprises a video encoder - decoder pair that is connected to each other over an IP network.
Figure 1 schematically illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure. The processing path involves a video recording device and, located in the cloud, one or more servers.
The video recording device may comprise a high definition (HD) camera, a video encoder and a packetizer. The HD camera captures a video, and transfers a stream of frames to the video encoder for encoding of video frames. Having encoded video frames they are transferred to the packetizer at which they are packetized into IP packets and transmitted to a server in the cloud. The server located in the cloud may comprise a de-packetizer, a video decoder and an image processor.
The present disclosure will be described further below.
It is noted that Figure 1 may be regarded as an illustration of a video encoder-decoder pair connected by an IP network.
As mentioned above, this disclosure proposes to remove a significant cause of latency in video streaming, i.e. the buffering between a video encoder output and a video decoder input.
It is common that hardware video encoders available on system on a chip (SoC) within mobile phones or Internet of things (loTs) devices, as well as software encoders and decoders being available on different systems, can typically only handle a single full image at a time.
Within the video recording device of Figure 1 , the video encoder may be adapted to receive a full image captured by the high definition (HD) camera. The video encoder is further adapted to process, such as to encode, the picture asynchronously, and to forward the encoded video frame to the packetizer, as soon as the encoding of the frame is completed. The components in the server, i.e. the de-packetizer, the video decoder and the image processor are all adapted to process received video frames asynchronously, which for instance means that the de-packetizer may be ready to receive next video frame while the decoder is busy decompressing the preceding video frame. As soon as the encoded video frame is decoded, the decoded video frame may thus be handed over to an image processing function, which may also be adapted to process video frames asynchronously.
It is noted that in up-to-date compressed video streams, e.g. H.264 and H.265, the beginning of each encoded video frame is signaled. By signaling the beginning of each encoded video frame a video decoder may detect when each encoded video frame begins and thus when to start decoding of for instance a full image.
However, the end of an encoded video frame may only be detected indirectly by waiting for the beginning of the next video frame, and conclude that the end must have passed since the next video frame, being a later video frame has arrived. Even though information may thus be gained that the first video frame has arrived, there is no explicit information about the moment in time when the video frame was fully received. Dependent on the bitrate mode being used, there may be residual waiting time between the end of a video frame and the beginning of the next video frame.
In generic use cases, where typically multiple video frames are buffered at the input of a video decoder, adding an extra frame time as required for the buffering will seldom cause problems in terms of latency.
However, increasing the latency by an extra time frame, will negatively impact latency- sensitive cases.
Thus, in order to remove, or at least diminish, this extra frame time latency, the packetizer may be adapted to add an end-of-frame signal to the bit stream of encoded video frames.
Generally, an end-of-frame signal may be added by adding to each encoded video frame an indication of the end of said each encoded video frame.
In one embodiment, maintaining a compatibility with H.264 video compression, an indication may be added in terms of an access unit delimiter (AuDL) following network abstraction layer (NAL) unit(s) containing the encoded video frame, in question.
Figure 2 presents an illustration of a piece of a bit stream comprising encoded video frames. An H.264 - encoded video frame is presented, and presents an access unit delimiter (AuDL) in the beginning. Within H.264 standard, the AuDL NAL unit is intended to signal the beginning of each video frame by the H.264 standard,
However, within this disclosure, the AuDL is adapted to signal the end of video frame. When buffering the bit stream of video frames, a decoder that receives a bit stream of video frames one after each other may use the AuDL as a pointer for a beginning of an H.264 access unit, or use the AuDL as an indicator of en end of each video frame. In this case of buffered video frames, the decoder will therefore not experience a difference between the two different interpretations of the AuDL NAL unit. The difference may thus be considered to disappear.
The illustration of the piece of a bit stream comprising the encoded video frames, of Figure 2 also presents one picture parameter set (PPS) per H.264 encoded video frame or video frame burst. Further, the singular encoded video frame is here denoted by video coded layer (VCL) video frame.
A packetizer may be adapted to create a moving picture experts group (MPEG) transport stream (TS) multiplex and to place time information of the capture of the picture, i.e. a camera capture time, as a timestamp into the bit stream.
Subsequently, the created user datagram (UDP)/lnternet protocol (IP) packets are transmitted as bursts, by using all available bandwidth of the network.
When the UDP packet burst has been received by the de-packetizer at the receiving end, the de-packetizer is adapted to unpack the video frame from the multiplexed packet-by- packet. The de-packetizer is further adapted to forward, or hand over, a received video frame, optionally together with a timestamp, to the decoder, instantly, without having to await the next video frame.
Figure 3 schematically illustrates receipt of transferred video frames in bursts related to an embodiment of the present disclosure. This illustration presents each video frame being transferred as a burst, together with its indication of the end of each video frame, here located at the end of each video burst, as“tFi>“W‘,“W‘, etc.
The“t” is the time at receipt at a receiving side. At“tO” the first frame starts to arrive, and at“tFi”, the first frame end is detected and passed to the decoder. After decoding the first frame, now decoded, is presented to the image processing function. Herein, decoding of the frame is herein for simplicity considered to be an instant process. At time“tF2” the end of the second frame is detected and the second frame is presented to the image processing function, etc.
The y coordinate of Figure 3 is bandwidth (Bw). The bursts are transferred with the maximum channel bandwidth available. At the time of each end indication, a de-packetizer receiving the video frames, may be adapted to hand over, or forward, the video frame in question to a decoder. The end indication herein indicates the end of each video frame, as an indicator that is located at the end of each video frame. Alternatively, the end of each video may be indicated by the size of each video frame.
When the de-packetizer identifies the indication of the end of each video frame, said each video frame is handed to the decoder for decoding of the encoded video frame.
Further, in Figure 3 bursts, or video frames, are indicated as either Ί” or“P”, where Ί” frames are longer in time as compared to“P” frames. Intra-coded frames are denoted by T- frames. While an l-frame may be intra-coded, i.e. based on the very T-frame itself only,“P”- frames are predictively coded and inter-coded, i.e. based on further frames also.
Figure 4 schematically illustrates transfer of video frames using a constant bitrate according to prior art techniques. The notation of T- and“P”-frame is the same here as for the preceding Figure.
The bitrate of Figure 4 is a constant bitrate, which constant bit rate is lower than the one used in the preceding Figure, in which the video frames ate transferred in bursts. For this reason, the video frames are presented wider in the Figure 4, as compared to the ones in the preceding Figure.
Similar to Figure 3, the“t” is herein the time at receipt at a receiving side. At“tO” the first frame starts to fill a buffer. When the buffer is filled, at“tFi”, the first frame, which is not transferred as a burst, is passed to a decoder and typically presented to a human viewer. As above, decoding is herein for simplicity considered to be instant. At time“tF2“> the second frame may be presented to the human viewer.
The y-coordinate of Figure 4 is bandwidth (Bw), as for Figure 3. The bandwidth in Figure 4 is constant and since the time positions“tFi” are equidistant, the frame rate of the frames presented to the human viewer is hence constant.
An indication of the end of each video frame as presented in Figure 3, may instruct when each video frame may be played out by the server, for machine vision, or other related services.
Further, it is noted that there may be a significant fluctuation in frame rate at an output of the video decoder. However, image processing functions are typically not sensitive to variations in video frame rate. Optionally, a timestamp of the capture of each picture may be attached to every video frame, enabling a recovery of a correct timing of output video frames, if required. Alternatively, the time stamp may be attached to an output of an image processing function within a receiving server, for further usage.
The transfer of video frames as illustrated in Figure 3 thus uses transfer in bursts. By using a low-end hardware encoder, and a software decoder, an average latency of 25 ms was achieved for a 720p resolution video stream. The latency of 25 ms is well under a 40-ms frame time of a 25 frames per second (fps) video stream.
Figure 5 presents a flow chart of actions within a method of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network. The method is performed in a system that comprises a video recording device 602 and a server 612 in communication with each other over said IP network. The video recording device captures and encodes video frames, forming a stream of encoded video frames. Within the video recording device, the method comprises, as soon as each video frame is encoded:
Action 52: Adding to each encoded video frame an indication of the end of said each encoded video frame.
Action 54: Sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.
In addition, the method comprises within the server:
Action 56: Receiving the bursts of singular encoded video frames having said indication.
Action 58: For each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
The action 52 of adding the indication of the end of each encoded video frame, may comprise adding information about the size of each encoded video frame, and wherein the action 58 of forwarding of said each encoded video frame may comprise for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
The action 52 of adding the indication of the end of each encoded video frame, may comprise adding the indication to an end part of each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame may comprise, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
The action 52 of adding the indication to the end part of each encoded video frame may comprise adding an access unit delimiter (AuDL), to a network abstraction layer (NAL) unit following said each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame to the decoder may comprise, for each encoded video frame, forwarding said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
Within the method, the encoded video frames may be encoded with any one of: H.264 and H.265 codecs.
Within the method, the encoded video frames may comprise intra and inter coded video frames. These intra and inter coded video frames may comprise l-frames and P-frames, respectively.
Optionally, within the method, each encoded video frame in the stream of encoded video frames, may comprise a timestamp of its capture by the video recording device.
The present disclosure also comprises a computer program for latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network. The computer program comprises instructions which, when executed on at least one processor, causes the at least one processor to carry out the method of the actions above.
The present disclosure also comprises a computer-readable storage medium, having thereon the computer program of above.
Figure 6 schematically presents a system 600 that is capable of latency efficient streaming of video frames for machine vision over an IP network. The system comprising a video recording device 602 and a server 612 located within a cloud. The video recording device 602 and the server 612 are adapted to be in communication with each other over said IP network. The video recording device is adapted to capture and encode video frames, forming a stream of video frames. For this purpose the video recording device 602 may comprise high definition camera 604.
The video recording device 602 further comprises a processor circuit 606 and a memory 608. The memory has instructions executable by said processor circuit 606, wherein said processor circuit 604 when executing the instructions is configured, as soon as each video frame is encoded, to add to each encoded video frame an indication of the end of said each encoded video frame. Also, the processor is configured, as soon as each video frame is encoded, to send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.
The server 612 within the system 600 comprises another processor circuit 614 and another memory 616. Said another memory 616 has other instructions executable by said another processor circuit 614, wherein said another processor circuit 614 when executing said other instructions is configured to receive the bursts of singular encoded video frames having said indication. In addition, when executing said other instructions the processor circuit 614 is configured to, for each singular encoded video frame of the stream of encoded video frames, forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server 612.
It is noted that the decoder may be implemented as software as a part of the instructions of the memory 616.
The processor circuit 606, of the video recording device, when executing the instructions may be configured to add information about the size of each encoded video frame, and where said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
The processor circuit 606, of the video recording device 602, when executing the instructions may be configured to add the indication to an end part of each encoded video frame, and wherein said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
The processor circuit 606, of the video recording device 602, when executing the instructions, may be configured to add an access unit delimiter (AuDL) to an end of a network abstraction layer (NAL) unit following said each encoded video frame, and wherein said another processor circuit 614, or the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
Within the system 600, the encoded video frames may be encoded with any one of: H.264 and H.265 codecs. Within said system 600, the encoded video frames may comprise intra and inter coded video frames. These intra and inter coded video frames may comprise l-frames and P-frames, respectively.
Within the system 600, each encoded video frame in the stream of encoded video frames, may comprise a timestamp of its capture by the video recording device.
Examples and/or embodiments of the present disclosure carry the following advantages: Latency efficient streaming of video frames for machine vision over an IP network is provided. It is an advantage that commodity hardware and software video encoders as well as and decoders available in generic computers, mobile phones and loT devices, may be utilized.
By removing buffering between video encoder and video decoder, an advantage of reduced latency in streaming is achieved.
It is further an advantage that a full available bandwidth of the network may be used for the transmission of video frames, as they are transmitted in bursts.
It is an advantageous feature that an indication of the end of each video frame is signaled, since this is used to recognize or detect the end of each video frame transmitted in a burst.
An advantage is the optional attachment to video frames of a timestamp of the capture of picture, enabling a recovery of relative timing of video frames.
ABBREVIATIONS
2D two-dimensional
3D three-dimensional
AR augmented reality
AuDL access unit delimiter
CBR constant bitrate
CPU central processing unit
GPU graphics processing unit
loT Internet of things
IP Internet protocol
MPEG moving picture experts group
MR mixed reality
NAL network abstraction layer
PPS picture parameter set
SLAM simultaneous localization and mapping
SoC system on a chip TS transport stream
UDP user datagram protocol
VBR variable bitrate
VCL video coding layer
VR virtual reality

Claims

1. A method of latency efficient streaming of video frames for machine vision over an
Internet protocol, IP, network, the method being performed in a system (600) comprising a video recording device (602) and a server (612) in communication with each other over said IP network, where the video recording device captures and encodes video frames, forming a stream of encoded video frames, the method comprising:
- within the video recording device:
- as soon as each video frame is encoded:
- adding (52) to each encoded video frame an indication of the end of said each encoded video frame; and
- sending (54) the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication, and
- within the server:
- receiving (56) the bursts of singular encoded video frames having said indication; and
- for each singular encoded video frame of the stream of encoded video frames, forwarding (58) to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.
2. The method according to claim 1 , wherein adding (52) the indication of the end of each encoded video frame, comprises adding information about the size of each encoded video frame, and wherein forwarding (58) said each encoded video frame comprises, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
3. The method according to claim 1 , wherein adding (52) the indication of the end of each encoded video frame, comprises adding the indication to an end part of each encoded video frame, and wherein forwarding (58) said each encoded video frame comprises, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
4. The method according to claim 3, wherein adding (52) the indication to the end part of each encoded video frame comprises adding an access unit delimiter, AuDL, to a network abstraction layer, NAL, unit following said each encoded video frame, and wherein forwarding (58) said each encoded video frame to the decoder comprises, for each encoded video frame, forwarding said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
5. The method according to claim 3 or 4, wherein the encoded video frames are encoded with any one of: H.264 and H.265 codecs.
6. The method according to any one of claims 1 to 5, wherein the encoded video frames comprise intra and inter coded video frames.
7. The method according to claim 6, wherein the intra and inter coded video frames
comprise l-frames and P-frames, respectively.
8. The method according to any one of claims 1 to 7, wherein each encoded video frame in the stream of encoded video frames, comprises a timestamp of its capture by the video recording device.
9. A computer program for latency efficient streaming of video frames for machine vision over an Internet protocol, IP, network, the computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method according to claims 1 to 8.
10. A computer-readable storage medium, having thereon a computer program according to claim 9.
1 1. A system (600) capable of latency efficient streaming of video frames for machine vision over an Internet protocol, IP, network, the system comprising a video recording device (602) and a server (612) adapted to be in communication with each other over said IP network, wherein the video recording device is adapted to capture and encode video frames, forming a stream of video frames, wherein the video recording device (602) further comprises a processor circuit (606) and a memory (608) having instructions executable by said processor circuit (606),
wherein said processor circuit (606) when executing the instructions is configured to:
- as soon as each video frame is encoded:
- add to each encoded video frame an indication of the end of said each encoded video frame; and
- send the encoded video frames, of the stream of encoded video frames, in
bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication: and
wherein the server (612) comprises another processor circuit (614) and another memory (616) having other instructions executable by said another processor circuit (614), wherein said another processor circuit (614) when executing said other instructions is configured to:
- receive the bursts of singular encoded video frames having said indication; and
- for each singular encoded video frame of the stream of encoded video frames, forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame,
whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server (612).
12. The system (600) according to claim 11 , wherein said processor circuit (606) when
executing the instructions is configured to add information about the size of each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.
13. The system (600) according to claim 11 , wherein said processor circuit (606) when
executing the instructions is configured to add the indication to an end part of each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.
14. The system (600) according to claim 13, wherein said processor circuit (606) when executing the instructions, is configured to add an access unit delimiter, AuDL, to an end of a network abstraction layer, NAL, unit following said each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.
15. The system (600) according to claim 13 or 14, wherein the encoded video frames are encoded with any one of: H.264 and H.265 codecs.
16. The system (600) according to any one of claims 11 to 15, wherein the encoded video frames comprise intra and inter coded video frames.
17. The system (600) according to claim 16, wherein the intra and inter coded video frames comprise l-frames and P-frames, respectively.
18. The system (600) according to any one of claims 1 1 to 17, wherein each encoded video frame in the stream of encoded video frames, comprises a timestamp of its capture by the video recording device.
PCT/EP2018/074190 2018-09-07 2018-09-07 Latency efficient streaming of video frames for machine vision over an ip network WO2020048617A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/074190 WO2020048617A1 (en) 2018-09-07 2018-09-07 Latency efficient streaming of video frames for machine vision over an ip network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/074190 WO2020048617A1 (en) 2018-09-07 2018-09-07 Latency efficient streaming of video frames for machine vision over an ip network

Publications (1)

Publication Number Publication Date
WO2020048617A1 true WO2020048617A1 (en) 2020-03-12

Family

ID=63524299

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/074190 WO2020048617A1 (en) 2018-09-07 2018-09-07 Latency efficient streaming of video frames for machine vision over an ip network

Country Status (1)

Country Link
WO (1) WO2020048617A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11178446B2 (en) 2020-03-09 2021-11-16 Haworth, Inc. Synchronous video content collaboration across multiple clients in a distributed collaboration system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7421508B2 (en) 2001-02-08 2008-09-02 Nokia Corporation Playback of streamed media
US8005149B2 (en) 2006-07-03 2011-08-23 Unisor Design Services Ltd. Transmission of stream video in low latency
US20120147973A1 (en) * 2010-12-13 2012-06-14 Microsoft Corporation Low-latency video decoding
US9077774B2 (en) 2010-06-04 2015-07-07 Skype Ireland Technologies Holdings Server-assisted video conversation
US20160100196A1 (en) * 2014-10-06 2016-04-07 Microsoft Technology Licensing, Llc Syntax structures indicating completion of coded regions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7421508B2 (en) 2001-02-08 2008-09-02 Nokia Corporation Playback of streamed media
US8005149B2 (en) 2006-07-03 2011-08-23 Unisor Design Services Ltd. Transmission of stream video in low latency
US9077774B2 (en) 2010-06-04 2015-07-07 Skype Ireland Technologies Holdings Server-assisted video conversation
US20120147973A1 (en) * 2010-12-13 2012-06-14 Microsoft Corporation Low-latency video decoding
US20160100196A1 (en) * 2014-10-06 2016-04-07 Microsoft Technology Licensing, Llc Syntax structures indicating completion of coded regions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEMIRCIN M U ET AL: "Delay-Constrained and R-D Optimized Transrating for High-Definition Video Streaming Over WLANs", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 10, no. 6, 1 October 2008 (2008-10-01), pages 1155 - 1168, XP011346541, ISSN: 1520-9210, DOI: 10.1109/TMM.2008.2001383 *
WU Y ET AL: "Indication of the end of coded data for pictures and partial-picture regions", 19. JCT-VC MEETING; 17-10-2014 - 24-10-2014; STRASBOURG; (JOINT COLLABORATIVE TEAM ON VIDEO CODING OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://WFTP3.ITU.INT/AV-ARCH/JCTVC-SITE/,, no. JCTVC-S0148, 7 October 2014 (2014-10-07), XP030116917 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11178446B2 (en) 2020-03-09 2021-11-16 Haworth, Inc. Synchronous video content collaboration across multiple clients in a distributed collaboration system
US11910048B2 (en) 2020-03-09 2024-02-20 Haworth, Inc. Synchronizing video content among clients in a collaboration system

Similar Documents

Publication Publication Date Title
EP3806477B1 (en) Video transcoding system and method, apparatus, and storage medium
CN108810636B (en) Video playing method, virtual reality equipment, server, system and storage medium
RU2518383C2 (en) Method and device for reordering and multiplexing multimedia packets from multimedia streams belonging to interrelated sessions
KR102077556B1 (en) System and method for encoding video content using virtual intra-frames
TWI533677B (en) Method, system, and computer-readable media for reducing latency in video encoding and decoding
US20160234522A1 (en) Video Decoding
US20150373075A1 (en) Multiple network transport sessions to provide context adaptive video streaming
US8837605B2 (en) Method and apparatus for compressed video bitstream conversion with reduced-algorithmic-delay
US20220078396A1 (en) Immersive media content presentation and interactive 360° video communication
US10862940B1 (en) Low latency live video on a communication session
US9253063B2 (en) Bi-directional video compression for real-time video streams during transport in a packet switched network
US20100161716A1 (en) Method and apparatus for streaming multiple scalable coded video content to client devices at different encoding rates
US20220329883A1 (en) Combining Video Streams in Composite Video Stream with Metadata
EP3560205A1 (en) Synchronizing processing between streams
KR20150106351A (en) Method and system for playback of motion video
US20190364087A1 (en) Protocol conversion of a video stream
US20120236927A1 (en) Transmission apparatus, transmission method, and recording medium
JP4358129B2 (en) TV conference apparatus, program, and method
US20170347112A1 (en) Bit Stream Switching In Lossy Network
WO2020048617A1 (en) Latency efficient streaming of video frames for machine vision over an ip network
US9363574B1 (en) Video throttling based on individual client delay
US8165161B2 (en) Method and system for formatting encoded video data
US20240098130A1 (en) Mixed media data format and transport protocol
Loonstra Videostreaming with Gstreamer
US20240007603A1 (en) Method, an apparatus and a computer program product for streaming of immersive video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18765889

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18765889

Country of ref document: EP

Kind code of ref document: A1