WO2020048617A1

WO2020048617A1 - Latency efficient streaming of video frames for machine vision over an ip network

Info

Publication number: WO2020048617A1
Application number: PCT/EP2018/074190
Authority: WO
Inventors: Bence FORMANEK; Peter Vaderna
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2020-03-12

Abstract

The present disclosure relates to a method and system (600) capable of latency efficient streaming of video frames for machine vision over an IP network. The system comprises a server (612) and a video recording device (602) that is adapted to capture and encode video frames, forming a stream of encoded video frames. An indication of the end of the video frame is added (52) to each encoded video frame. These encoded video frames are sent (54) in bursts of singular encoded video frames, over the IP network to the server. The server forwards (58) to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame,whereby each encoded video frame, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

Description

LATENCY EFFICIENT STREAMING OF VIDEO FRAMES FOR MACHINE VISION

OVER AN IP NETWORK

TECHNICAL FIELD

This disclosure relates to latency efficient streaming. In more particular, it relates to a method and a system capable of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network.

BACKGROUND

Mixed reality (MR) is a technology that allows virtual imagery or objects to be mixed with a real-world physical environment, enabling physical and virtual objects to coexist in real-time. Mixed reality generally encompasses augmented reality (AR), wherein elements in a real- world environment are augmented in a live direct or indirect view of that real-world

environment. MR applications today are usually device centric i.e. it is in the device itself, such as an actual headset, smart glasses or smartphone, where data processing and applications themselves - detecting physical objects, creating and rendering virtual objects on top of the view of the real-world - run.

There are many advantages to move such an MR environment to the cloud where applications are implemented e.g. as interconnected micro services and in the form of AR/MR devices that connect to and view this MR environment. An AR/MR system in the cloud however needs sensor information to build a virtual view of the real world. To achieve real- time operation, transmission of sensor information e.g. video streams taken by cameras, to the cloud needs to have low latency.

Current industrial robot deployments often comprise a robotic arm, sensors and a controller deployed very close to each other. The controller, the sensors and the arm are typically connected via proprietary industrial protocols. Moving robotics and automation algorithms into the cloud involves similar problems as the ones of AR/MR technology. Control commands and sensor information, such as video data from camera sensors, needs to be transmitted between robot clusters and the cloud with low latency.

The output of imaging systems is traditionally used merely for visualization for humans. With an evolved technology of computer vision, automated two-dimensional (2D) and three- dimensional (3D) image processing can now extract additional information. Examples of problems that can be targeted by computer vision are object recognition, motion estimation, simultaneous localization and mapping (SLAM), etc. Due to the complexity of these problems, computer vision and image processing algorithms often require extra amount of hardware resources such as central processing unit (CPU), graphics processing unit (GPU), and memory. It is a common use-case to collect images from multiple sources and perform image processing centrally in the cloud. Because of the large size of such captured image data, typically some form of video compression is used to reduce the required bandwidth.

Real-time machine vision systems require low latency video transmission and processing. Existing technologies in this field, either use specialized video encoders and decoders, or special buffering technique to achieve low latency video transmission.

From US7421508B2 a buffering mechanism for improving playback of a streamed media from packet delay variation due to encoding and packetization.

Patent document US8005149B2 concerns transmission of stream video in low latency, by slicing each frame or field into a predetermined number of slices, which are compressed separately, using standard video compression technologies. Compressed slices are compounded together before being transmitted.

From US9077774B2, low latency is achieved by splitting audio and video data streams from a given audio-video conversation using two different transport protocols to send separate streams over a network, and re-syncing them at the other end.

Video streaming systems are traditionally used for visualization for purposes of human viewers. It is noted that the human sight is sensitive to variations in frame rate. In order to maintain a correct playback rate, video streaming systems often use multiple stages of buffering between video encoder and decoder, especially on packet-switched IP network.

However, multiple stage buffering often becomes the main source of latency in video streaming.

For low latency streaming, specialized low latency video encoders/decoders may be used. However, these are not available on generic devices e.g. Internet of things (loT) devices, mobile phones, etc.

SUMMARY

It is an object of embodiments of the disclosure to address at least some of the issues outlined above, and this object and others are achieved by a system, a method performed therein, a computer program and a computer-readable storage medium for latency efficient streaming of video frames for machine vision over an Internet protocol network.

According to an aspect, the disclosure provides a method of latency efficient streaming of video frames for machine vision over an Internet protocol network. The method being performed in a system comprising a video recording device and a server in communication with each other over said Internet protocol network, where the video recording device captures and encodes video frames, forming a stream of encoded video frames. The method comprises, within the video recording device, as soon as each video frame is encoded, adding to each encoded video frame an indication of the end of said each encoded video frame. The method also comprises sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication. Within the server, the method comprises receiving the bursts of singular encoded video frames having said indication. In addition, the method comprises, for each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

According to another aspect, the disclosure provides a computer program for robot device control comprising instructions which, when executed on at least one processor, cause the at least one processors to carry out the method according to the aspect above.

According to another aspect, the disclosure provides a computer-readable storage medium, having thereon said computer program.

According to yet another aspect, the disclosure provides a system that is capable of latency efficient streaming of video frames for machine vision over an Internet protocol network. The system comprises a video recording device and a server adapted to be in communication with each other over said Internet protocol network. The video recording device is adapted to capture and encode video frames, forming a stream of video frames. The video recording device further comprises a processor circuit and a memory having instructions executable by said processor circuit. The processor circuit, when executing the instructions, is configured to, as soon as each video frame is encoded, add to each encoded video frame an indication of the end of said each encoded video frame. The processor circuit is also configured to, when executing the instructions, send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the Internet protocol network to the server, where each encoded video frame comprises said indication. The server comprises another processor circuit and another memory that has other instructions executable by said another processor circuit. When executing said other instructions, said another processor circuit is configured to receive the bursts of singular encoded video frames having said indication. For each singular encoded video frame of the stream of encoded video frames, said another processor circuit is further configured to forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

Embodiments of the present disclosure comprise the following advantages:

A latency efficient, i.e. essentially real-time, image processing is enabled in the cloud by using existing commodity hardware and software video encoders and a well as decoders available in generic computers, mobile phones and Internet of things devices.

By removing, or at least reducing, buffering of a video frames in a bit stream between an encoder and a decoder, a low latency streaming from Internet of things devices or mobile phones is achieved.

It is further an advantage that image processing on a server receiving the bit stream of video frames may be performed in real-time, by using a generic server or a server with generic graphics processing unit (GPU).

The latency efficient streaming thus achieved is typically a streaming with a latency that is lower than a full frame time, i.e. sub-frame time latency, without the need to use specialized low latency video codec pairs.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will now be described in more detail, and with reference to the

accompanying drawings, in which:

- Figure 1 illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure;

- Figure 2 presents a schematic illustration of a bit stream, related to embodiments of the present disclosure;

- Figure 3 schematically illustrates transfer of video frames related to an embodiment of the present disclosure;

- Figure 4 schematically illustrates transfer of video frames related to the prior art;

- Figure 5 presents a flow chart of actions according to embodiments of the present disclosure;

- Figure 6 presents a system, according to embodiments of the present disclosure.

DETAILED DESCRIPTION

In the following description, different embodiments of the disclosure will be described in more detail, with reference to accompanying drawings. For the purpose of explanation and not limitation, specific details are set forth, such as particular examples and techniques in order to provide a thorough understanding.

In order to maintain a correct playback rate, video streaming equipment typically uses several stages of buffering between video encoding and video decoding, especially on packet- switched IP networks.

It is noted that buffering of frames is often the main source of latency in video streaming.

Further, buffering is often used for the reason that modern video compression algorithms typically create varying number of bits for each input image. The size of each encoded image depends on the type of frame. Frames may comprise intra coded T-frames or predictively and inter-coded“P”- frames. Intra coded frames may have 5-10 times the size in bits of predictively coded frames.

There is also a variation among frames depending on the content of each frame, especially if variable bitrate (VBR) encoding is used, but it also exists in constant bitrate (CBR) mode.

A video streaming device may comprise an output buffer at the output of a video encoder in order to smooth out the variations from above, and to create a CBR stream, in CBR mode, or average bitrate stream, in VBR mode.

Data packets containing video data are typically transmitted using a best-effort network protocol, which cannot guarantee a constant transmission delay. Rather, the delay may vary from packet to packet. A varying delay creates delay jitter, and to avoid stalls in playback caused by this delay jitter, a delay jitter buffer may be used at the input of a streaming video receiver.

In addition, there is a buffer at the input of the video decoder. This buffer receives constant (or average) bitrate video from the delay jitter buffer and ensures that there is always a full coded picture at the buffer at every frame time, when the decoder has to decompress the next picture to display.

An accumulated latency of a streaming system may be as large as a few seconds in non-conversational video streams, mainly caused by the size of the delay jitter buffer. In conversational video streams, accumulated latency may be around 150 - 200 ms.

A task of the present disclosure is to provide latency efficient streaming, i.e. a streaming that is efficient in terms of latency, and thus provide streaming with a latency that typically is reduced when compared to many other alternative streaming techniques.

In order to achieve this task, it is herein proposed to remove a significant cause to latency in video streaming. This significant cause, or contribution, to latency comprises buffering between a video encoder output and a decoder input. The idea of the present disclosure is that encoded video frames are transmitted as bursts, as soon as they are ready, by using all available network bandwidth.

In short, a frame end signal is for this reason placed at the end of the transmitted burst by a transmitting part. As soon as the frame end signal is received by the receiving part, the video frame is handed over to a decoder to produce, for instance, a decoded image to an image processing function. In addition, a capture timestamp may be attached to every frame to assist in recovering a correct timing of the frames, if needed.

The disclosure comprises a video encoder - decoder pair that is connected to each other over an IP network.

Figure 1 schematically illustrates a processing path for latency efficient streaming for machine vision over an IP network, related to embodiments of the present disclosure. The processing path involves a video recording device and, located in the cloud, one or more servers.

The video recording device may comprise a high definition (HD) camera, a video encoder and a packetizer. The HD camera captures a video, and transfers a stream of frames to the video encoder for encoding of video frames. Having encoded video frames they are transferred to the packetizer at which they are packetized into IP packets and transmitted to a server in the cloud. The server located in the cloud may comprise a de-packetizer, a video decoder and an image processor.

The present disclosure will be described further below.

It is noted that Figure 1 may be regarded as an illustration of a video encoder-decoder pair connected by an IP network.

As mentioned above, this disclosure proposes to remove a significant cause of latency in video streaming, i.e. the buffering between a video encoder output and a video decoder input.

It is common that hardware video encoders available on system on a chip (SoC) within mobile phones or Internet of things (loTs) devices, as well as software encoders and decoders being available on different systems, can typically only handle a single full image at a time.

Within the video recording device of Figure 1 , the video encoder may be adapted to receive a full image captured by the high definition (HD) camera. The video encoder is further adapted to process, such as to encode, the picture asynchronously, and to forward the encoded video frame to the packetizer, as soon as the encoding of the frame is completed. The components in the server, i.e. the de-packetizer, the video decoder and the image processor are all adapted to process received video frames asynchronously, which for instance means that the de-packetizer may be ready to receive next video frame while the decoder is busy decompressing the preceding video frame. As soon as the encoded video frame is decoded, the decoded video frame may thus be handed over to an image processing function, which may also be adapted to process video frames asynchronously.

It is noted that in up-to-date compressed video streams, e.g. H.264 and H.265, the beginning of each encoded video frame is signaled. By signaling the beginning of each encoded video frame a video decoder may detect when each encoded video frame begins and thus when to start decoding of for instance a full image.

However, the end of an encoded video frame may only be detected indirectly by waiting for the beginning of the next video frame, and conclude that the end must have passed since the next video frame, being a later video frame has arrived. Even though information may thus be gained that the first video frame has arrived, there is no explicit information about the moment in time when the video frame was fully received. Dependent on the bitrate mode being used, there may be residual waiting time between the end of a video frame and the beginning of the next video frame.

In generic use cases, where typically multiple video frames are buffered at the input of a video decoder, adding an extra frame time as required for the buffering will seldom cause problems in terms of latency.

However, increasing the latency by an extra time frame, will negatively impact latency- sensitive cases.

Thus, in order to remove, or at least diminish, this extra frame time latency, the packetizer may be adapted to add an end-of-frame signal to the bit stream of encoded video frames.

Generally, an end-of-frame signal may be added by adding to each encoded video frame an indication of the end of said each encoded video frame.

In one embodiment, maintaining a compatibility with H.264 video compression, an indication may be added in terms of an access unit delimiter (AuDL) following network abstraction layer (NAL) unit(s) containing the encoded video frame, in question.

Figure 2 presents an illustration of a piece of a bit stream comprising encoded video frames. An H.264 - encoded video frame is presented, and presents an access unit delimiter (AuDL) in the beginning. Within H.264 standard, the AuDL NAL unit is intended to signal the beginning of each video frame by the H.264 standard,

However, within this disclosure, the AuDL is adapted to signal the end of video frame. When buffering the bit stream of video frames, a decoder that receives a bit stream of video frames one after each other may use the AuDL as a pointer for a beginning of an H.264 access unit, or use the AuDL as an indicator of en end of each video frame. In this case of buffered video frames, the decoder will therefore not experience a difference between the two different interpretations of the AuDL NAL unit. The difference may thus be considered to disappear.

The illustration of the piece of a bit stream comprising the encoded video frames, of Figure 2 also presents one picture parameter set (PPS) per H.264 encoded video frame or video frame burst. Further, the singular encoded video frame is here denoted by video coded layer (VCL) video frame.

A packetizer may be adapted to create a moving picture experts group (MPEG) transport stream (TS) multiplex and to place time information of the capture of the picture, i.e. a camera capture time, as a timestamp into the bit stream.

Subsequently, the created user datagram (UDP)/lnternet protocol (IP) packets are transmitted as bursts, by using all available bandwidth of the network.

When the UDP packet burst has been received by the de-packetizer at the receiving end, the de-packetizer is adapted to unpack the video frame from the multiplexed packet-by- packet. The de-packetizer is further adapted to forward, or hand over, a received video frame, optionally together with a timestamp, to the decoder, instantly, without having to await the next video frame.

Figure 3 schematically illustrates receipt of transferred video frames in bursts related to an embodiment of the present disclosure. This illustration presents each video frame being transferred as a burst, together with its indication of the end of each video frame, here located at the end of each video burst, as“t_Fi“_>“W‘,“W‘, etc.

The“t” is the time at receipt at a receiving side. At“tO” the first frame starts to arrive, and at“t_Fi”, the first frame end is detected and passed to the decoder. After decoding the first frame, now decoded, is presented to the image processing function. Herein, decoding of the frame is herein for simplicity considered to be an instant process. At time“t_F2” the end of the second frame is detected and the second frame is presented to the image processing function, etc.

The y coordinate of Figure 3 is bandwidth (Bw). The bursts are transferred with the maximum channel bandwidth available. At the time of each end indication, a de-packetizer receiving the video frames, may be adapted to hand over, or forward, the video frame in question to a decoder. The end indication herein indicates the end of each video frame, as an indicator that is located at the end of each video frame. Alternatively, the end of each video may be indicated by the size of each video frame.

When the de-packetizer identifies the indication of the end of each video frame, said each video frame is handed to the decoder for decoding of the encoded video frame.

Further, in Figure 3 bursts, or video frames, are indicated as either Ί” or“P”, where Ί” frames are longer in time as compared to“P” frames. Intra-coded frames are denoted by T- frames. While an l-frame may be intra-coded, i.e. based on the very T-frame itself only,“P”- frames are predictively coded and inter-coded, i.e. based on further frames also.

Figure 4 schematically illustrates transfer of video frames using a constant bitrate according to prior art techniques. The notation of T- and“P”-frame is the same here as for the preceding Figure.

The bitrate of Figure 4 is a constant bitrate, which constant bit rate is lower than the one used in the preceding Figure, in which the video frames ate transferred in bursts. For this reason, the video frames are presented wider in the Figure 4, as compared to the ones in the preceding Figure.

Similar to Figure 3, the“t” is herein the time at receipt at a receiving side. At“tO” the first frame starts to fill a buffer. When the buffer is filled, at“t_Fi”, the first frame, which is not transferred as a burst, is passed to a decoder and typically presented to a human viewer. As above, decoding is herein for simplicity considered to be instant. At time“t_F2“_> the second frame may be presented to the human viewer.

The y-coordinate of Figure 4 is bandwidth (Bw), as for Figure 3. The bandwidth in Figure 4 is constant and since the time positions“t_Fi” are equidistant, the frame rate of the frames presented to the human viewer is hence constant.

An indication of the end of each video frame as presented in Figure 3, may instruct when each video frame may be played out by the server, for machine vision, or other related services.

Further, it is noted that there may be a significant fluctuation in frame rate at an output of the video decoder. However, image processing functions are typically not sensitive to variations in video frame rate. Optionally, a timestamp of the capture of each picture may be attached to every video frame, enabling a recovery of a correct timing of output video frames, if required. Alternatively, the time stamp may be attached to an output of an image processing function within a receiving server, for further usage.

The transfer of video frames as illustrated in Figure 3 thus uses transfer in bursts. By using a low-end hardware encoder, and a software decoder, an average latency of 25 ms was achieved for a 720p resolution video stream. The latency of 25 ms is well under a 40-ms frame time of a 25 frames per second (fps) video stream.

Figure 5 presents a flow chart of actions within a method of latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network. The method is performed in a system that comprises a video recording device 602 and a server 612 in communication with each other over said IP network. The video recording device captures and encodes video frames, forming a stream of encoded video frames. Within the video recording device, the method comprises, as soon as each video frame is encoded:

Action 52: Adding to each encoded video frame an indication of the end of said each encoded video frame.

Action 54: Sending the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.

In addition, the method comprises within the server:

Action 56: Receiving the bursts of singular encoded video frames having said indication.

Action 58: For each singular encoded video frame of the stream of encoded video frames, forwarding to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

The action 52 of adding the indication of the end of each encoded video frame, may comprise adding information about the size of each encoded video frame, and wherein the action 58 of forwarding of said each encoded video frame may comprise for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.

The action 52 of adding the indication of the end of each encoded video frame, may comprise adding the indication to an end part of each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame may comprise, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.

The action 52 of adding the indication to the end part of each encoded video frame may comprise adding an access unit delimiter (AuDL), to a network abstraction layer (NAL) unit following said each encoded video frame, and wherein action 58 of forwarding of said each encoded video frame to the decoder may comprise, for each encoded video frame, forwarding said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.

Within the method, the encoded video frames may be encoded with any one of: H.264 and H.265 codecs.

Within the method, the encoded video frames may comprise intra and inter coded video frames. These intra and inter coded video frames may comprise l-frames and P-frames, respectively.

Optionally, within the method, each encoded video frame in the stream of encoded video frames, may comprise a timestamp of its capture by the video recording device.

The present disclosure also comprises a computer program for latency efficient streaming of video frames for machine vision over an Internet protocol (IP) network. The computer program comprises instructions which, when executed on at least one processor, causes the at least one processor to carry out the method of the actions above.

The present disclosure also comprises a computer-readable storage medium, having thereon the computer program of above.

Figure 6 schematically presents a system 600 that is capable of latency efficient streaming of video frames for machine vision over an IP network. The system comprising a video recording device 602 and a server 612 located within a cloud. The video recording device 602 and the server 612 are adapted to be in communication with each other over said IP network. The video recording device is adapted to capture and encode video frames, forming a stream of video frames. For this purpose the video recording device 602 may comprise high definition camera 604.

The video recording device 602 further comprises a processor circuit 606 and a memory 608. The memory has instructions executable by said processor circuit 606, wherein said processor circuit 604 when executing the instructions is configured, as soon as each video frame is encoded, to add to each encoded video frame an indication of the end of said each encoded video frame. Also, the processor is configured, as soon as each video frame is encoded, to send the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication.

The server 612 within the system 600 comprises another processor circuit 614 and another memory 616. Said another memory 616 has other instructions executable by said another processor circuit 614, wherein said another processor circuit 614 when executing said other instructions is configured to receive the bursts of singular encoded video frames having said indication. In addition, when executing said other instructions the processor circuit 614 is configured to, for each singular encoded video frame of the stream of encoded video frames, forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server 612.

It is noted that the decoder may be implemented as software as a part of the instructions of the memory 616.

The processor circuit 606, of the video recording device, when executing the instructions may be configured to add information about the size of each encoded video frame, and where said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.

The processor circuit 606, of the video recording device 602, when executing the instructions may be configured to add the indication to an end part of each encoded video frame, and wherein said another processor circuit 614, of the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.

The processor circuit 606, of the video recording device 602, when executing the instructions, may be configured to add an access unit delimiter (AuDL) to an end of a network abstraction layer (NAL) unit following said each encoded video frame, and wherein said another processor circuit 614, or the server 612, when executing said other instructions may be configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.

Within the system 600, the encoded video frames may be encoded with any one of: H.264 and H.265 codecs. Within said system 600, the encoded video frames may comprise intra and inter coded video frames. These intra and inter coded video frames may comprise l-frames and P-frames, respectively.

Within the system 600, each encoded video frame in the stream of encoded video frames, may comprise a timestamp of its capture by the video recording device.

Examples and/or embodiments of the present disclosure carry the following advantages: Latency efficient streaming of video frames for machine vision over an IP network is provided. It is an advantage that commodity hardware and software video encoders as well as and decoders available in generic computers, mobile phones and loT devices, may be utilized.

By removing buffering between video encoder and video decoder, an advantage of reduced latency in streaming is achieved.

It is further an advantage that a full available bandwidth of the network may be used for the transmission of video frames, as they are transmitted in bursts.

It is an advantageous feature that an indication of the end of each video frame is signaled, since this is used to recognize or detect the end of each video frame transmitted in a burst.

An advantage is the optional attachment to video frames of a timestamp of the capture of picture, enabling a recovery of relative timing of video frames.

ABBREVIATIONS

2D two-dimensional

3D three-dimensional

AR augmented reality

AuDL access unit delimiter

CBR constant bitrate

CPU central processing unit

GPU graphics processing unit

loT Internet of things

IP Internet protocol

MPEG moving picture experts group

MR mixed reality

NAL network abstraction layer

PPS picture parameter set

SLAM simultaneous localization and mapping

SoC system on a chip TS transport stream

UDP user datagram protocol

VBR variable bitrate

VCL video coding layer

VR virtual reality

Claims

1. A method of latency efficient streaming of video frames for machine vision over an

Internet protocol, IP, network, the method being performed in a system (600) comprising a video recording device (602) and a server (612) in communication with each other over said IP network, where the video recording device captures and encodes video frames, forming a stream of encoded video frames, the method comprising:

- within the video recording device:

- as soon as each video frame is encoded:

- adding (52) to each encoded video frame an indication of the end of said each encoded video frame; and

- sending (54) the encoded video frames, of the stream of encoded video frames, in bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication, and

- within the server:

- receiving (56) the bursts of singular encoded video frames having said indication; and

- for each singular encoded video frame of the stream of encoded video frames, forwarding (58) to a decoder, said each encoded video frame at a moment according to the indication of the end of said each encoded video frame, whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server.

2. The method according to claim 1 , wherein adding (52) the indication of the end of each encoded video frame, comprises adding information about the size of each encoded video frame, and wherein forwarding (58) said each encoded video frame comprises, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.

3. The method according to claim 1 , wherein adding (52) the indication of the end of each encoded video frame, comprises adding the indication to an end part of each encoded video frame, and wherein forwarding (58) said each encoded video frame comprises, for each encoded video frame, forwarding to the decoder, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.

4. The method according to claim 3, wherein adding (52) the indication to the end part of each encoded video frame comprises adding an access unit delimiter, AuDL, to a network abstraction layer, NAL, unit following said each encoded video frame, and wherein forwarding (58) said each encoded video frame to the decoder comprises, for each encoded video frame, forwarding said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.

5. The method according to claim 3 or 4, wherein the encoded video frames are encoded with any one of: H.264 and H.265 codecs.

6. The method according to any one of claims 1 to 5, wherein the encoded video frames comprise intra and inter coded video frames.

7. The method according to claim 6, wherein the intra and inter coded video frames

comprise l-frames and P-frames, respectively.

8. The method according to any one of claims 1 to 7, wherein each encoded video frame in the stream of encoded video frames, comprises a timestamp of its capture by the video recording device.

9. A computer program for latency efficient streaming of video frames for machine vision over an Internet protocol, IP, network, the computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method according to claims 1 to 8.

10. A computer-readable storage medium, having thereon a computer program according to claim 9.

1 1. A system (600) capable of latency efficient streaming of video frames for machine vision over an Internet protocol, IP, network, the system comprising a video recording device (602) and a server (612) adapted to be in communication with each other over said IP network, wherein the video recording device is adapted to capture and encode video frames, forming a stream of video frames, wherein the video recording device (602) further comprises a processor circuit (606) and a memory (608) having instructions executable by said processor circuit (606),

wherein said processor circuit (606) when executing the instructions is configured to:

- as soon as each video frame is encoded:

- add to each encoded video frame an indication of the end of said each encoded video frame; and

- send the encoded video frames, of the stream of encoded video frames, in

bursts, where each burst comprises a singular encoded video frame, over the IP network to the server, where each encoded video frame comprises said indication: and

wherein the server (612) comprises another processor circuit (614) and another memory (616) having other instructions executable by said another processor circuit (614), wherein said another processor circuit (614) when executing said other instructions is configured to:

- receive the bursts of singular encoded video frames having said indication; and

- for each singular encoded video frame of the stream of encoded video frames, forward, to a decoder, said each encoded video frame at a moment, according to the indication of the end of said each encoded video frame,

whereby each encoded video frame, in the stream of encoded video frames, is forwarded to the decoder as soon as the end of each encoded video frame has been received by the server (612).

12. The system (600) according to claim 11 , wherein said processor circuit (606) when

executing the instructions is configured to add information about the size of each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the size of each encoded video frame.

13. The system (600) according to claim 11 , wherein said processor circuit (606) when

executing the instructions is configured to add the indication to an end part of each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the end of each encoded video frame has been received, according to the indication at the end part of each encoded video frame.

14. The system (600) according to claim 13, wherein said processor circuit (606) when executing the instructions, is configured to add an access unit delimiter, AuDL, to an end of a network abstraction layer, NAL, unit following said each encoded video frame, and wherein said another processor circuit (614) when executing said other instructions is configured to forward to the decoder, for each encoded video frame, said each encoded video frame at the moment when the AuDL of each encoded video frame has been received.

15. The system (600) according to claim 13 or 14, wherein the encoded video frames are encoded with any one of: H.264 and H.265 codecs.

16. The system (600) according to any one of claims 11 to 15, wherein the encoded video frames comprise intra and inter coded video frames.

17. The system (600) according to claim 16, wherein the intra and inter coded video frames comprise l-frames and P-frames, respectively.

18. The system (600) according to any one of claims 1 1 to 17, wherein each encoded video frame in the stream of encoded video frames, comprises a timestamp of its capture by the video recording device.