CN115989530A

CN115989530A - Generating and processing video data

Info

Publication number: CN115989530A
Application number: CN202080103572.6A
Authority: CN
Inventors: 塞利姆·伊金; 加纳·海奇拉; 大卫·林德罗
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2023-04-18
Also published as: EP4205405A1; WO2022042831A1; US20230319321A1

Abstract

Embodiments disclosed herein relate to a method and apparatus for generating video frames when there is a change in the rate of received video data. In one embodiment, a method of processing video data is provided, the method comprising: generating a video frame using the received video data (510); encoding the video frame into potential vectors using an encoder portion of a generative model (520); modifying the potential vector (525); and in response to determining that the video frame generated using the received video data is reduced (515), decoding the modified potential vector using a decoder portion of the generative model to generate a new video frame (530).

Description

Generating and processing video data

Technical Field

Embodiments disclosed herein relate to a method and apparatus for generating a video frame when there is a change in the rate of received video data.

Background

Temporal interference in video streams on smart phones, virtual Reality (VR) headsets, smart glasses, and other devices is a potential contributing factor that negatively impacts end user quality of experience (QoE). Temporal interference is particularly critical in the context of augmented reality and virtual reality (AR/VR) applications due to the emphasis requirements related to MTP (motion to photon) time, i.e., the delay between user action and impact on the display. For a Head Mounted Display (HMD) attached to the user's head, the MTP can be as short as 10 to 20ms, and the content being displayed on the device needs to adapt the head motion accordingly and almost instantaneously.

DLSS (deep learning supersampling) 2.0 of NVIDIA uses an image lifting (e.g., from 1080p to 4K) algorithm. In case the target application is a game, this uses Artificial Intelligence (AI) to improve the image quality. NVIDIA build a single optimized universal neural network, allowing them more boosting options and using a fully synthetic training set for deep neural networks. This integrates the real-time motion vector information and re-projects the previous frame. These motion vectors require a DLSS (deep learning supersampling) platform provided by game developers to NVIDIA, which addresses the presence of a priori frames already received at the device and improves the quality of the frames by boosting the quality. This allows simulating a high image quality even if the reception of video or image data is reduced due to poor connectivity or other problems resulting in temporal interference in the video stream.

Other solutions exist, such as caching content for reuse within a local Content Delivery Network (CDN), and the case of audio content generation using a recurrent neural network, such as Sabet s. And Schmidt s. And zadtotoaghaj s. And griwdozC. And Moller S, delay Sensitivity Classification: towards a receiver advancement of the influx of Delay on Cloud Gaming QoE (Delay Sensitivity Classification: deep knowledge of the effect of Delay on Cloud Gaming QoE). Sabethttps://arxiv.org/ ftp/arxiv/papers/2004/2004.05609.pdf

However, these approaches are only able to handle short temporary interruptions or degradations in the content stream and require fast and complex immediate freeze handling mechanisms. What is needed is a mechanism that can handle temporary interruptions of content streams of longer duration.

Disclosure of Invention

According to some embodiments described herein, a method of processing video data is provided. The method comprises the following steps: generating a video frame using the received video data; and in response to determining the reduction in generating the video frame using the received video data, encoding the video frame into a potential vector using an encoder portion of the generative model. The potential vectors are modified and decoded using a decoder portion of the generative model to generate a new video frame.

This allows the video frames to continue to be generated even if the connection streaming the video is severely interrupted or degraded. By avoiding video freezing and instead displaying synthesized or artificially generated video frames, the user's quality of experience in real-time video streaming in various applications such as multiplayer video games and AR is improved.

According to some embodiments described herein, there is provided an apparatus for processing video data. The apparatus comprises a processor and a memory containing instructions executable by the processor whereby the apparatus is operative to generate a video frame using received video data and encode the video frame into a potential vector using an encoder portion of a generative model in response to determining a reduction in generating the video frame using the received video data. The potential vectors are modified and decoded using a decoder portion of the generative model to generate a new video frame.

According to some embodiments described herein, a method of processing video data is provided. The method comprises the following steps: a video frame is received from a first device (640) and encoded into a potential vector using an encoder portion of a first generative model, the potential vector is modified, and the modified potential vector is decoded using a decoder portion of the first generative model to generate a new video frame. The video frame is forwarded to the first device.

According to some embodiments described herein, there is provided an apparatus for processing video data. The apparatus comprises a processor and a memory containing instructions executable by the processor whereby the apparatus is operable to: the method includes receiving a video frame from a first device (640) and encoding the video frame into potential vectors using an encoder portion of a first generative model, modifying the potential vectors, and decoding the modified potential vectors using a decoder portion of the first generative model to generate a new video frame. The video frame is forwarded to the first device.

According to some embodiments described herein, there is provided a computer program comprising instructions which, when executed on a processor, cause the processor to perform the method described herein. The computer program may be stored on a non-transitory computer readable medium.

Drawings

For a better understanding of embodiments of the present disclosure, and to show how the same may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a system for transmitting video data according to some embodiments;

fig. 2 is a schematic diagram of an apparatus for receiving processed video data according to some embodiments;

FIG. 3 is a schematic diagram illustrating a generative model for processing video data according to an embodiment;

FIG. 4 is a schematic diagram illustrating the use of the generative model of FIG. 3 to generate a new video frame;

fig. 5 is a flow diagram of a method of processing video data according to an embodiment;

fig. 6 is a flow diagram of signaling and events for a method of processing video data according to an embodiment;

fig. 7 is a flow diagram of signaling and events for a method of processing video data according to another embodiment; and

fig. 8 is a schematic diagram illustrating an architecture of an apparatus according to an embodiment.

Detailed Description

Generally, all terms used herein are to be interpreted according to their ordinary meaning in the relevant art unless explicitly given and/or otherwise implied by the context. All references to "a/an/the element, device, component, means, step, etc" are to be interpreted openly as referring to at least one instance of the element, device, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless one step must be explicitly described as being after or before another step and/or implicitly one step must be after or before another step. Any feature of any embodiment disclosed herein may be applied to any other embodiment, where appropriate. Likewise, any advantage of any embodiment may apply to any other embodiment, and vice versa. Other objects, features and advantages of the appended embodiments will become apparent from the description that follows.

Specific details are set forth below, such as particular embodiments or examples for purposes of explanation rather than limitation. It will be understood by those skilled in the art that other examples may be employed other than these specific details. In some instances, detailed descriptions of well-known methods, nodes, interfaces, circuits, and devices are omitted so as not to obscure the description with unnecessary detail. Those skilled in the art will appreciate that the functions described may be implemented in one or more nodes using hardware circuitry (e.g., analog and/or discrete logic gates interconnected to perform a specialized function, ASICs, PLAs, etc.) and/or using software programs and data in conjunction with one or more digital microprocessors or general purpose computers. Nodes that communicate using the air interface also have appropriate radio communication circuitry. Moreover, the techniques may also be considered to be embodied entirely within any form of computer readable memory (e.g., solid-state memory, magnetic or optical disk) containing an appropriate set of computer instructions that would cause a processor to carry out the techniques described herein, where appropriate.

Hardware implementations may include or include, but are not limited to, digital Signal Processor (DSP) hardware, reduced instruction set processors, hardware (e.g., digital or analog) circuits (including, but not limited to, application Specific Integrated Circuits (ASICs) and/or Field Programmable Gate Arrays (FPGAs)), and state machines capable of performing such functions, if applicable. The memory may be used for storing temporary variables, storing and transmitting data between processes, non-volatile configuration settings, standard message passing formats, etc. Volatile and non-volatile storage may take any suitable form, including Random Access Memory (RAM) implemented as Metal Oxide Semiconductor (MOS) or Integrated Circuits (IC), and storage devices implemented as hard disk drives and flash memory.

Embodiments described herein relate to methods and apparatus for processing video data, including processing an interruption to an incoming video data stream used to generate video frames by generating new frames using a generative model such as a Variational Automatic Encoder (VAE). Video frames generated using the received video data may be represented as potential vectors in a potential space by encoding the video frames using an encoder portion that generates the model. By modifying the potential vectors representing the video frame, the decoder portion of the generative model may be used to generate a new video frame by decoding the modified potential vectors.

This process may be triggered by actual or predicted degradation in generating frames using received video data, e.g., due to problems with the connection over which the video data is received. However, in some embodiments, potential vectors representing video frames generated using the received data may be encoded and decoded, e.g., for continuous training of the generation model. When the video data stream is interrupted or degraded below a threshold, a device employing this method may switch ten times to artificially generated video frames, i.e., video frames generated by modifying the potential vectors.

FIG. 1 is a schematic view showingA schematic diagram of a system for transmitting video or other content data according to an embodiment. The system 100 includes a transmitter (e.g., a third generation partnership project (3 GPP) base station or a WiFi access point) coupled to a receiver 110 (e.g., a smartphone) via a wireless connection 115a, the receiver 110 coupled via a second wireless connection 115b (e.g., bluetooth) ^TM ) Coupled to a Head Mounted Display (HMD) or headset 120. Other headphones 160 may alternatively be connected directly to the transmitter 105. The

headphones

120, 160 may be used to display video frames generated from video data received from the transmitter 105. Alternatively or additionally, other devices may be used to display video frames generated from video data received from the transmitter, such as a smart phone 110, a larger non-head mounted display monitor, or a television, camera, or other display device. Similarly, alternative connections such as cable, visible light communication, power line, or satellite may be employed.

Each

headset

120, 160 or other device for generating video frames is associated with a generation model 145, 165 (e.g., a variational auto-encoder). In some embodiments, video data may be received and used to generate a video frame in intermediary device 110 and forward the frame to second device 120, in which case the intermediary device would be associated with the generative model.

A video or content server 125 coupled to the transmitter is arranged to transmit video data to the

headphones

120, 160 or other device 110 and may include a content library 130, the content library 130 including video data for video games, video programs, and other video content that may be pre-recorded or generated by the server 125. A user of video content may be able to interact with and change the content, for example, a user moving their headset 120 while watching a video game may cause the video content to change as a result of the movement. The actions of other users playing the same game may also cause the video content of the first user to change.

According to an embodiment, the video server 125 further comprises a processor 127 and a memory 128, the memory 128 containing instructions 129 for operating the server. The server 125 may also include one or more generative models 135, each generative model 135 including an encoder 140a and a decoder 140b. Generative models 135 can be associated with individual users and/or video content such as individual games; and may be used to generate video frames. Generative model 135 may be used to generate video frames for a particular user sending the video frames to a server, or generative model 135 may be forwarded to the user's device so that the user's device or headset 120 stores and uses generative model 145 to generate the video frames at the device.

Generative models

135, 145 on a server or device will initially be pre-trained, but may be further trained using video frames from a game or other content. For example, where the generative model is an auto-encoder, the frames are encoded and decoded by the auto-encoder, and the decoded frames are compared to the original frames in order to provide feedback to the auto-encoder to continue improving its operation.

Device 110 or headset 120 receives video data in a stream at a data rate sufficient to be able to generate video frames for display at a particular resolution and frame rate. The

connections

115a, 115b between the transmitter 105 and the devices or

headsets

110, 120, 160 need to provide sufficient bandwidth below a threshold time delay to achieve this. However, some connections, such as certain wireless connections, may be subject to degradation or even interruption, which may affect the video stream and may result in difficulty in generating video frames. Some methods for mitigating this situation include: video frame images are promoted based on more limited video data, however, these methods can only accommodate limited degradation or brief interruptions of the connection.

Embodiments allow the video frames to continue to be generated even if the connection is severely interrupted or degraded. This may be accomplished using

generative models

135, 145, 165 as described in more detail below.

A schematic diagram of an apparatus according to an embodiment is shown in fig. 2 and may correspond to the device 110 or

headset

120, 160 of fig. 1. The apparatus 200 comprises a receiver 223 for receiving video data over the

connections

115a, 115b and coupled to a video frame generator 227 for generating video frames using the received video data. The receiver 223 may be a 3GPP receiver such as LTE or 5G, a WiFi or bluetooth receiver, or any suitable circuit or software component. The video frame generator 227 may be an MPEG2/4 codec, or any suitable circuit or software component. The video frame generator 227 is coupled to a display driver 233, which display driver 233 drives one or more display screens 237 to display the generated video frames to a user.

Receiver 220 also includes generative model 245, which generative model 245 is coupled to video frame generator 227 and display driver 233. Generative model 245 may be a model having a convolutional network as an encoder and a deconvolution network as a decoder, such as a Variational Autoencoder (VAE) having an encoder 250a and a decoder 250 b. Other types of generative models may alternatively be used, such as generative countermeasure networks (GAN), recurrent Neural Networks (RNN), convolutional Neural Networks (CNN), and other types of machine learning. Generative model 245 may be used to generate a composite video frame using the video frames generated by video frame generator 227. The composite video frame may be forwarded to a display driver for display on display screen 237.

The receiver 220 also includes circuitry and/or software 243 for determining a reduction in generating video frames using the received video data. This may be determined using performance metrics of the connection used to receive the video data (e.g., received packet delay variation or jitter, signal strength, received throughput, e.g., in bits/second, and other known communication performance metrics); and may be provided by a receiver 223. Alternatively or additionally, this may be determined by a reduction in a performance metric associated with generating the video frame (e.g., an inter-frame delay or a size of a temporal gap between successive frames). In another alternative, time intervals or other performance metrics associated with the display of successive frames on display screen 237 may be monitored. These performance parameters may be monitored by the degradation detector 243 and if they exceed a threshold, the controller 247 is notified to switch from displaying the video frames generated by the video frame generator 227 to the video frames generated by the generative model 245. The degradation detector 243 may also predict the degradation of the generated video frames, for example by predicting an increase in packet delay and/or video frame delay. If the detected/predicted inter-frame time interval or other metric exceeds a threshold (e.g., 20ms for VR/AR applications), then a switch to generating a composite video frame is triggered.

The degradation detector 243 may be implemented as a supervised learning model. The supervised learning model may be pre-trained. The supervised learning model may, for example, receive as input a time sequence of inter-frame time intervals between displayed frames. Various models may be employed, such as Recurrent Neural Networks (RNNs) (e.g., long Short Term Memory (LSTM) networks or using the Wavenet architecture), and random forest or other learning algorithms may be implemented.

In an embodiment, the input metric of the inter-delay prediction model may be a time sequence of consecutive inter-frame time delays observed in the past time prior to the time of prediction or detection. The input metrics are fed into the model with a sliding window as defined in the Wavenet RNN and the tag is set to a discrete version of the next slot value based on the minimum required inter-frame time value. If the value is above the threshold, it is set to 0, otherwise it is set to 1. Once the model is trained on many devices, it can be deployed on the device where the predictive/predictive model will begin to execute and run inferences.

In an embodiment, the inputs to the VAE model may be all possible video frames that are displayed to the user device and trained as follows. The input video frames are embedded in a matrix representation (MxN) and scaled, then the model is compressed into a 3x1 representation of the image using a convolution network, and then the image is regenerated from a noisy (with some epsilon standard deviation) latent space by a deconvolution network. And then aims to minimize the loss between the original video frame and the regenerated video frame. The generated portion of the model with the deconvolution network (e.g., the VAE decoder) is extracted and deployed as a generator model, which then has the capability to regenerate the image from the underlying space. Neighbors in the underlying space represent images that are similar to each other and allow temporal continuity of the video frames, since a high degree of dependency between successive frames is expected. The next video frame can then be generated by random or more systematic walking (no jumps and with continuity).

During the operational phase, the last displayed video frame will be converted to a potential representation using the pre-trained VAE model, then a random walk or other step-size algorithm is started in successive steps at the potential space in steps of s, and an image is generated from the potential representation using the deconvolution decoder model.

The

various components

223, 243, 247, 227, 233, 245 of the apparatus 220 may be implemented using a processor and memory containing processor instructions and a generative model 245 and/or a predictive model 243. However, dedicated circuitry may be used for some of these components.

FIG. 3 illustrates a generative model according to an embodiment. The generative model 345 may be a Variational Automatic Encoder (VAE) including an encoder 350a and a decoder 350 b. VAEs are conceptually similar to autoencoders, which are artificial neural networks that learn to copy their inputs to their outputs through transitions to and from corresponding latent variables. The VAE uses a modified learning approach to constrain the distribution of latent variables. Changing these latent variables produces new outputs.

The encoder 350a includes a perceptron or layer that includes other nodes of the input layer and the hidden layer to convert the input into coordinates 375 in the potential space 370 that may be represented by a potential vector 345. In this embodiment, the input is a video frame 230a, which may be represented by an input vector 340 a. The dimensionality of the potential vector 345 is reduced compared to the input vector 340 a. Although the underlying space is shown as having only two dimensions for simplicity, it should be understood that there may be more dimensions, albeit fewer than are required for the video frames themselves.

Decoder 350b includes layers of perceptrons, neurons, or other nodes including hidden layers and output layers to convert coordinates 375 in potential space 370 into output 330b. The output 330b may be represented by an output vector 340 b. The output layer of the decoder has the same number of nodes as the input layer of encoder 350a, so that input vector 340a and output vector 340b have the same dimensions. By training the network 345, the output 330b becomes identical to the input 330a or an approximate copy of the input 330a, where the potential coordinates of each input represent the input in a reduced dimension of the potential space sufficiently well that the output 330b, which is very similar to the input 230a, can be extracted from the potential vector 345.

Once the VAE is trained well enough, it can be used to generate an output 330b that is slightly different from the input 230a. For example, in the event that the input stream 230a, such as a video frame, ceases, the VAE may be used to continue generating the output 330b corresponding to a composite video frame that differs from the last received video frame in predicting the original stream. This may be accomplished by repositioning the coordinates 375 in the potential space 370, in other words, modifying the potential vector 345 and decoding the modified potential vector to generate a new video frame. By slightly changing the potential vectors in other ways, the generated video will only differ slightly from the last received video frame. By making a series of changes to the potential vectors, the corresponding sequence of video frames may be decoded, as described in more detail below with respect to fig. 4.

FIG. 4 illustrates the generation of a new video frame using the generative model of FIG. 3. A video stream comprising video data may be associated with a sequence of video frames 430a1 through 430a6, which video frames 430a1 through 430a6 may be generated from the received video data or estimated using a generative model 445. The video frames 430a1-430a6 shown on the left corresponding to the video data stream may be considered as original frames of a video game or video sequence that is forwarded to a device or apparatus for reproduction through the video data stream. The first two video frames 430a1, 430a2 may be rendered using the received video data to generate the video frames 430b1, 430b2, e.g., using an MPEG4 encoder or other known device or software. However, an interruption of the video data transfer channel, such as a loss of wireless signal due to user movement, interference, or other reasons, prevents the use of this mechanism to render more video frames 430a3 through 430a5.

When this reduction in video frame generation using the received video data occurs, the generative model 445 is used to generate new video frames 430c3 through 430c6. The reduction in generating video frames using received video data may be due to a complete loss of connection or a reduction in data rate or bandwidth such that there is insufficient information to generate video frames using the received video data.

When sufficient video data is received again, a new video frame 430b6 corresponding to the video data of the video frame 430a6 may be generated again using the received video data. At this point, the corresponding video frame 430c6 may also be generated using the generative model 445. The displayed video frames 430b6, 430c6 may switch back to the video frame 430b6 generated using the received video data, or a combination of the two video frames 430b6 and 430c6 may be used.

Generative model 445 includes an encoder 450a and a decoder 450b. When a new video frame needs to be generated using the generative model, the last generated video frame 430b2 is input to encoder 450a to find coordinates 475b-2 in the latent space 470 of the model. Alternatively, other video frames 430b1 previously generated may be used, e.g., the last reference video frame generated with the video data may be used, or when the system detects a change in scene at the point of interruption, the stored reference frame corresponding to the new scene may be used.

Each coordinate 475b-1 through 475c-6 may be represented by a potential vector in a computer processing system. The change in position of the coordinates represented by the change in the potential vector values corresponds to the change in the video frame they represent. For example, a change in position between coordinates 475b-1 and 475b-2 corresponds to a change in the content of video frames 430b1 and 430b2. The change may be, for example, a small change corresponding to a person moving slightly on a large stationary field. A large change may correspond to an entire scene panning from one type of landscape to another or even an entirely new scene with no common visual elements compared to the previous scene. The change in position between coordinates is referred to herein as a step size 480 and may be any magnitude in any direction of the multidimensional space. As described, the step size and direction will depend on the change in the visual element of the corresponding video frame.

To generate a new video frame using generative model 445, the coordinates 475b1 of the last used video frame 430b2 generated from the received video data are used as a starting point, and a step size is applied to find new coordinates 475c-3. The new coordinates 475c-3 correspond to a modification of the potential vector of the previous video frame 430b2, and the modified potential vector is decoded by the decoder 450b to generate a new video frame 430c3. Similarly, additional steps may be applied to find subsequent new coordinates 475c-4, 475c-5, and 475c-6 that are decoded to generate video frames 430c4, 430c5, 430c 6; each video frame has changed video content compared to the previous frame according to the change in position of their corresponding coordinates in the underlying space. Depending on the dimension or dimensions affected, a large step size may result in significant changes in one or more aspects of the video content.

The size and/or direction of step 480 used may depend on the application, e.g., a video game may use a random walk algorithm, and may also be affected by markers in the game indicating a changing scene or event. Augmented Reality (AR) applications may walk using a system corresponding to continuing to move in the same direction in a plant visit, where information about the machine is superimposed on the plant's display. A suitable algorithm for determining the step size may be determined experimentally. In some embodiments, the size and/or direction of step 480 may depend on the rate of change of the sequence of video frames 430a1, 430a2, and/or a prediction of future video frames before generating a reduction of the video frames using the received video data.

When video data is again received and available to generate video frame 430a6, generative model 445 may continue to generate new video frame 430c6 while generating new video frame 430b6 from the newly received video data. A device using both generation methods may then switch from the VAE-generated video frame 430c5 to the video frame 430b6 generated from the received video data, continuing to use the VAE-generated video frame 430c6 until the connection carrying the video data is deemed stable or a combination of the VAE-generated 430c6 video frame and the video data-generated 430b6 video frame may be used.

The VAE-generated video frame 430c6 and the video data-generated video frame 430b6 may be blended over time, for example, initially weighting the VAE frame 430c6 more heavily and then increasing the weight of the video data-generated frame 430b6. As can be seen in latent space 470, coordinates 475c-6 of video frame 430c6 generated by the VAE may differ from coordinates 475b-6 of video frame 430b6 generated from the newly restored video data stream. In this case, an abrupt switch between the two may result in a significant content change, which may be unpleasant for the viewer/user, and thus it may be preferable to mix the images while slowly moving completely to the video frame generated using the received video data. Various algorithms for Blending frames may be used, such as Ross t. Whittaker, a Level-Set Approach to Image Blending, IEEE Image processing journal, volume 9, 11 th, year 1849, 11 th month, and 11 th day.

Fig. 5 is a flow diagram of a method of processing video data according to an embodiment. Method 500 may be performed by a device such as a VR/AR headset, smartphone, or other display device including, for example, the devices described with respect to fig. 1-4. Portions of the method may be implemented on a server remote from the display device, as described in more detail below.

At 505, method 500 receives video data, such as an MPEG2/4 compressed video frame representation. The video data may be received over a connection such as a wireless 3GPP or WiFi channel. The received video data may be used to play a video game that may interact with the received video data based on user actions (e.g., moving a headset display to change a view within the game). The video data may also depend on the actions of other users playing the game. In another application, the received video data may be used for AR, for example to display items about machines within the plant that the user is viewing.

At 510, the method generates a video frame using the received data. The video frames may be generated using known circuitry and/or software including, for example, an MPEG decoder. The generated video frames may be displayed to a user of a headset or other display device.

At 515, method 500 determines whether there is a reduction in the generation of video frames that may result from degradation of the connection used to transmit the video data. For example, shadowing or interference of a wireless signal used to carry video data may result in some of the video data not being received, which may result in a reduction in the frame rate or resolution of video frame generation, or a complete interruption of video frame generation.

The reduction in video frame generation may be determined based on detecting a predetermined change in a connection metric (e.g., packet delay change, or jitter). Alternatively or additionally, the reduction in Video frame generation may be determined based on detecting predetermined changes in Video quality assessment metrics, such as inter-frame delay (e.g., inter-frame time intervals between displayed Video frames), video bit rate, video frame rate, and other metrics, such as the metrics described in International Telecommunications Union (ITU) specification p.1204"Video quality a stress of streaming services available for resolution up to 4K (Video quality assessment of streaming services over reliable transport) for example.

Predictive algorithms based on these or other metrics, including for example machine learning based models, may also or alternatively be used. The reduction in video frame generation may be determined by one or more of these metrics falling outside a threshold and/or a corresponding output from a prediction model. In one example, this may correspond to a quality score of less than 3 for one of the outputs specified in section 7.3 of ITU p.1204 (01/2020).

If it is determined that the generation of the video frame has not been reduced (515N), the method returns to 505, otherwise (515Y), the method 500 proceeds to 520. At 520, a previous video frame is encoded into a potential vector using an encoder (e.g., VAE) that generates the model. The previous video frame may be, for example, the last video frame generated from the received video data. While video frames may be encoded in response to determining a reduction in video frame generation (e.g., a detected or predicted interruption in a connection), video frames generated from video data may have been encoded into the potential space prior to the determination or event, e.g., to continue training the VAE by comparing decoded potential vectors to video generated from the video data.

At 525, the last generated video frame or video generated from the received video data and selected in response to the determination in 515 modifies its corresponding potential vector. The modification corresponds to the coordinates in the underlying space being changed or moved by a step size. The size and direction of the steps in the potential space may be determined based on the application for which the video frame is being used, and may include, for example, random walks of random size and/or direction, or system walks of fixed size and/or direction. The modified potential vector corresponds to a change in the visual component of the last used video frame.

At 530, method 500 decodes the modified latent vectors using a generative model decoder to generate new video frames, i.e., video frames generated by the generative model (rather than the received video data), and may be referred to herein as composite video frames. Additional video frames may be generated by further modifying the potential vectors and decoding the further modified potential vectors.

At the same time, if there is sufficient bandwidth and/or connection stability, the received video data may be used to continue generating video frames despite the lower frame rate. Some of these video frames generated from the video data may be displayed to the user along with the composite video frame. Video frames generated from video data may be interspersed with synthetically generated video frames, or they may be mixed. In another arrangement, any video frame that may continue to be generated from the video data may be encoded and used to update the potential vector so that it tracks the expected trajectory of the video frame. The updated potential vector may then be modified by continuing to apply the step size in the potential space, thereby generating a new composite video frame. However, in some cases, the connection may not be sufficient to provide any or sufficient video data to generate a video frame, and in such cases, the generative model continues to generate new composite video frames by modifying the underlying vectors.

At 535, the method 500 determines whether there is an increase in video frame generation. This may be due to the reconstruction of a connection carrying video data, or an increase in the rate at which video frames are generated using received video data above a threshold. If it is not determined that there has been an increase in the generation of video frames using the received video data (535N), the method 500 proceeds to 525 to again modify the potential vectors. However, if it has been determined that there is an increase in the generation of video frames using the received video data (535Y), the method 500 proceeds to 505 to receive video data again and generate video frames using the received video data.

The method may alternatively proceed to 540, where the video frames generated using the generative model and the video frames generated using the received video data are blended. This may avoid large discontinuities in the displayed video frames such that the initially displayed image is based primarily on the composite video frame and then slowly moves towards the video frame generated using the video data.

Method 500 may be implemented in a single device (e.g., a VR headset or smartphone) that performs the method and displays video frames on its onboard display screen or sends the video frames to a separate display (e.g., via bluetooth) ^TM Or WiFi ^TM Connected VR headset).

In some embodiments, the device W _1_1 may cooperate with other devices W _1_2, W _1_N and/or servers M _1, M _2, M _ All. Fig. 6 shows a flow diagram of signaling and events according to an embodiment. The device W _1_1 is a device that receives video data corresponding to a multiplayer game, such as a VR headset. The other devices W _1_2 and W _1_N belong to other players of the game, which is streamed from server M _1 to each of devices W _1_1, W _1_2, W _1_N. Although the players may all be in the same gaming environment, they may be in different locations and/or facing different directions and will therefore receive respective video data from the server M1. Devices W _1_1, W _1_2, W _1_N form a federation with server M _1 that can use federated learning to improve the performance of generative models used inside and outside the federation.

Other affiliations playing the same game but in different groups of players and devices may have video data streamed from different servers M _2. The different server M _2 may be implemented on the same or different hardware as the first server M _1. The master server M _ All may extend the joint learning approach by training a number of joint generative models across the same or similar games. Joint learning is a machine learning technique for training a model across multiple dispersed devices and/or servers without sharing local data. For example, the weights used in the VAEs in the various devices in the federation may be shared with a server (e.g., M _1, M _ 2) that aggregates these weights and shares the weights with the devices according to known methods to update their VAEs to provide improved learning as compared to relying on video data received on themselves. Similarly, aggregated weights from multiple joins may be shared with a master server M _ All that further aggregates the weights and reassigns them to servers M _1, M _2, and then to devices W _1_1, W _1_2, W _1_N in each join.

In this way, the VAEs used by each device to generate composite video frames are continuously trained using video data from many other devices, thereby improving the generation of composite frames, even for portions of the game that the device has not experienced (while other devices may have and their experience is used to improve composite video generation for those portions of the game). Thus, even for complex games with many possible scenes, accurate composite video generation can be quickly achieved by training VAE encoders and decoders using video data received from a potentially large number of devices.

Referring to FIG. 6, at 605, each device in the union, W _1_1, W _1_2, W _1_N, trains their respective VAE or other generative model. This may occur when receiving a video frame generated from received video data, which may then be fed through the VAE, where the video frame is encoded to the latent space by an encoder portion of the VAE, and the resulting latent vector is decoded by a decoder portion of the VAE to generate a video frame output that is compared to the input video frame, and feedback is given by known mechanisms to adjust the encoder and decoder weights.

The resulting VAE model or its weights are periodically forwarded from each device to the server M _1 at 610. At 615, the server aggregates the VAE models or weights according to known methods. A similar process may occur in other unions, where devices (not shown) forward their models or model weights to their server M _2 (which aggregates the models or weights). At 620, each federated server M _1, M _2 sends the aggregated VAE models or weights to the master server M _ All that aggregates these models or weights at 625. The master server M _ All then forwards the aggregated weight or VAE model to the federated servers M _1, M _2 at 630. The federation server then assigns the aggregated weight or VAE model to the devices in its federation at 632.

Referring now to device W _1_1, the device receives the updated weights or models and uses them for further processing of the video data. This may include further training the VAEs and periodically repeating the above process, thereby continuously updating one or many federated VAEs. At 635, the device determines whether there is a detected or predicted freeze or stop of the video. This is similar to the embodiment described with respect to fig. 5 and is based on inter-frame delay or other parameters.

In response to the condition, the device obtains a potential representation of the last video frame at 645, for example, by the encoder of the VAE sending the stored video frame generated using the received video data. At 650, a next video frame is generated by modifying the potential representation or vector and decoding the modified potential vector. At 655, the transition between the video frames generated using the received video data and the composite video frames generated by the VAE is smoothed or blended. At 665, the (mixed) video frame is displayed to a user of the device, e.g., on a smartphone screen or VR headset.

At 670, the device determines that the freeze or stop condition is no longer applicable, for example because the original content stream has been received again. At 675, video frames generated using the reconstructed video data stream and video frames generated by the VAE are blended or smoothed. At 680, the smoothed video frame is then displayed.

Fig. 7 shows a flow diagram of signaling and events according to another embodiment in which the composite video frame is actually generated by the federated server M _1. As with the previous embodiment, at 705, each device in the union, W _1_1, W _1_2, W _1_N, trains their respective VAEs or other generative models. The resulting VAE model or its weights are periodically forwarded from each device to the server M _1 at 710. At 715, the server aggregates the VAE models or weights according to known methods. A similar process may occur in other federations, where devices (not shown) forward their models or model weights to their server M _2 that aggregates the models or weights.

At 720, each federated server M _1, M _2 sends the aggregated VAE models or weights to the master server M _ All that aggregates these models or weights at 725. The master server M _ All then forwards the aggregated weight or VAE model to the federated servers M _1, M _2 at 730. However, instead of assigning aggregated weights or VAE models to the devices in their union, server M _1 then retains updated VAEs.

At 735, device W _1_1 determines the video freeze condition and forwards its last generated video frame to server M _1 at 740. The server then encodes the last video frame received from the device as a potential vector using the encoder of the updated VAE at 745. At 750, the potential vector is modified, as previously described, and at 755, the modified potential vector is decoded by the decoder of the updated VAE to generate the next video frame. The next video frame and subsequent next video frames are forwarded to the device at 760.

At 765, the (mixed) video frames are displayed to a user of the device, for example, on a smartphone screen or VR headset. At 770, the device determines that the freeze or stop condition is no longer applicable. At 775, the video frames generated using the reconstructed video data stream and the composite video frames forwarded by the federated server are blended or smoothed. At 780, the smoothed video frame is then displayed.

The embodiment of fig. 7 has a lower computational load on the device compared to the embodiment of fig. 6; and thus, may be advantageous for more designs with more limited processing power. The embodiment of fig. 6 has a lower computational load and reduced network traffic on the server. The embodiment of fig. 6 may also be advantageous in situations where the delay requirements of the video application are low and/or the network link cost is high. In addition, the embodiment of fig. 6 is more privacy-preserving, since the device does not send the original video frames to the server, and is therefore less sensitive to intrusive attacks.

Fig. 8 is a schematic diagram of a device that may be used to process video data according to an embodiment. The device 800 comprises a processor 810 and a memory 820, the memory 820 containing computer program instructions 625, the computer program instructions 625, when executed by the processor, causing the processor to perform a method according to an embodiment. Example instructions are shown that may be executed by processor 810. The device may include a generation model 830 (e.g., VAE) for generating a composite video frame and a prediction model 835 for detecting or predicting video freeze events.

At 840, the processor 810 may generate a video frame using the received video data, e.g., as previously described. At 845, the processor may encode the video frame into a potential vector using the encoder of VAE 830. This may be in response to a video freeze event predicted by the prediction model 835. At 850, the processor 810 may modify the potential vector, as previously described. At 855, the processor may decode the modified potential vectors using a decoder portion of generative model 830 to generate a new video frame. A new or synthesized video frame may be used in place of a video frame that is typically generated from the received video but is no longer available due to a video freeze event.

Embodiments may provide a number of advantages. For example, by avoiding video freezing and instead displaying video frames via instant-frame generation, the quality of experience (QoE) of users in real-time video streaming in various applications such as multiplayer video games and AR is improved. Embodiments are also energy and bandwidth friendly, as they do not overload the transmission link with too many consecutive packet request messages (also caused by retransmissions), but can create their own content temporarily. These embodiments are able to accommodate long stop events that may be due to, for example, significant connection degradation and interruptions. Some embodiments may utilize joint learning to accelerate and improve the learning of generative models to generate video frames with greater accuracy. The perceptual information of the device may be continuously collected and processed in order to perform the mapping between the correct content at the correct time.

Although embodiments are described with respect to processing video data, many other applications are possible, including, for example, audio data, or a combination of video, audio, and other streaming data.

Some or all of the described server functions may be instantiated in a cloud environment such as Docker, kubenets, or Spark. Alternatively, they may be implemented in dedicated hardware.

Modifications and other variations to the described embodiments will be apparent to those skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the embodiments are not to be limited to the specific examples disclosed and that modifications and other variations are intended to be included within the scope of the present disclosure. Although specific terms may be employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A method of processing video data, the method comprising:

generating a video frame using the received video data (510);

in response to determining that the video frame generated using the received video data is reduced (515), encoding the video frame into a potential vector using an encoder portion of a generative model (520);

modifying the potential vector (525);

the modified potential vectors are decoded using a decoder portion of the generative model to generate a new video frame (530).

2. The method of claim 1, wherein determining a video frame reduction generated from the received video data comprises one or more of: predicting or detecting a predetermined change in a video quality assessment metric; and/or predict or detect a predetermined change in a performance metric of a connection for receiving the video data.

3. The method of claim 2, wherein detecting or predicting a predetermined change in a video quality assessment metric comprises: detecting or predicting, using an inter-frame delay prediction model, that an inter-frame time interval between video frames displayed on a display is above a threshold (243)

4. The method of claim 2, wherein the performance metric is one or more of: packet delay; grouping changes; a bandwidth; the received power.

5. The method according to any of the preceding claims, wherein the generative model (245) is a variational auto-encoder, VAE.

6. The method of any preceding claim, wherein the generative model is one or more of: generating a model through pre-training; a generation model trained using a sequence of video frames generated using the received video data; generating a model having model weights, the model weights being updated using the received weight data.

7. The method of any of the preceding claims, wherein modifying the potential vector comprises: for one or more new video frames (430 c3, 430c4, 430c5, 430c 6), the location (475 b-2, 475c-3, 475c-4, 475c-5, 475 c-6) corresponding to the potential vector (345) is moved by a step size (480) in the potential space (470).

8. The method of claim 7, wherein the size and/or direction of the step size (480) depends on one or more of: a rate of change of a sequence of video frames (430 a1, 430a 2) prior to a reduction of the video frames generated using the received video data; prediction of future video frames; an application using the video data.

9. The method according to any of the preceding claims, comprising: in response to determining an increase in video frames generated using the received video data, switching from a new video frame (430 c 6) generated using the decoder portion to a video frame (430 b 6) generated using the received video data.

10. The method of claim 9, wherein the switching comprises mixing new video frames (430 c 6) generated using the decoder portion and video frames (430 b 6) generated using the received video data, in which mixing the weights of the video frames generated using the received video data increase over time.

11. The method of any preceding claim, comprising displaying the video frame using one or more of the following applications; video-on-demand; real-time video; playing a game; artificial reality; and (4) augmented reality.

12. A method of processing video data, the method comprising:

receiving a video frame from a first device (640) and encoding the video frame into a potential vector using an encoder portion of a first generative model (645);

modifying the potential vector (650);

decoding the modified latent vectors using a decoder portion of the first generation model to generate new video frames (655);

forwarding the new video frame (660) to the first device.

13. The method of claim 12, comprising:

receiving a second generative model (610) from the plurality of devices, the second generative model having respective model weights and an encoder portion and a decoder portion;

aggregating the model weights to generate the first generation model (615);

14. the method of claim 13, wherein a second generative model is received from the first device and used to generate the first generative model.

15. The method of any of claims 12 to 14, comprising:

forwarding the first generative model to a server (620) and receiving an updated first generative model from the server (630);

encoding (645) the video frame and decoding (655) the modified vector using the updated first generation model.

16. A device for processing video data, the device (800) comprising a processor (810) and a memory (820), the memory containing instructions (825) executable by the processor whereby the device is operable to:

generating a video frame using the received video data (840);

in response to determining that the video frame generated using the received video data is reduced (515), encoding the video frame into a latent vector using an encoder portion of a generative model (845);

modifying the potential vector (850);

decoding the modified potential vector using a decoder portion of the generative model to generate a new video frame (855).

17. The apparatus of claim 16, operable to determine a video frame reduction generated from the received video data by: predicting or detecting a predetermined change in a video quality assessment metric; and/or predict or detect a predetermined change in a performance metric of a connection for receiving the video.

18. The apparatus of claim 17, operable to detect or predict a change in a video quality assessment metric by: an inter-frame time interval between frames displayed on the display is detected or predicted to be above a threshold using an inter-frame delay prediction model (243).

19. The apparatus of claim 17, wherein the performance metric is one or more of: packet delay; grouping changes; a bandwidth; the received power.

20. The apparatus according to any of claims 16 to 19, wherein the generative model (830) is a variational auto-encoder, VAE.

21. The apparatus of any of claims 16 to 20, operable to: receiving a pre-trained generative model; training the generative model using a sequence of video frames; and/or receiving weight data to update the generative model.

22. Apparatus according to any of claims 16 to 21, operable to modify the potential vector by: for one or more new video frames (430 c3, 430c4, 430c5, 430c 6), the location (475 b-2, 475c-3, 475c-4, 475c-5, 475 c-6) corresponding to the potential vector (345) is moved by a step size (480) in the potential space (470).

23. The apparatus as recited in claim 22, wherein a size and/or direction of the step size (480) is dependent upon one or more of: a rate of change of a sequence of video frames (430 a1, 430a 2) prior to the video frame reduction generated using the received video data; prediction of future video frames; an application using the video data.

24. The apparatus of any of claims 16 to 23, operable to switch from a new video frame (430 c 6) generated using the decoder portion to a video frame (430 b 6) generated using the received video data in response to determining that a video frame generated using the received video data is increasing.

25. The apparatus of claim 24, operable to mix a new video frame (430 c 6) generated using the decoder portion with a video frame (430 b 6) generated using the received video data, in which mixing the weight of the video frame generated using the received video data increases over time.

26. Apparatus according to any of claims 16 to 25, operable to display the video frame using one or more of the following applications; video-on-demand; real-time video; playing a game; artificial reality; and (4) augmented reality.

27. A device for processing video data, the device (125) comprising a processor (127) and a memory (128), the memory containing instructions (129) executable by the processor, whereby the device is operable to:

receiving a video frame (640) from a first device and encoding the video frame as a potential vector (645) using an encoder portion (140 a) of a first generative model;

modifying the potential vector (650);

decoding the modified latent vectors using a decoder portion (140 b) of the first generation model (135) to generate new video frames (655);

forwarding the new video frame to the first device (660).

28. The apparatus of claim 27, operable to:

aggregating the model weights to generate the first generation model (615);

29. the apparatus of claim 28, operable to receive a second generative model from the first device and generate the first generative model using the second generative model from the first device.

30. The apparatus of any of claims 27 to 29, operable to:

31. A computer program comprising instructions which, when executed on a processor, cause the processor to carry out the method of any one of claims 1 to 15.

32. A computer program product comprising a non-transitory computer readable medium having the computer program of claim 31 stored thereon.