CN113934890A

CN113934890A - Method and system for automatically generating scene video by characters

Info

Publication number: CN113934890A
Application number: CN202111538104.4A
Authority: CN
Inventors: 马诗洁; 王俊彦
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-01-14
Anticipated expiration: 2041-12-16
Also published as: CN113934890B

Abstract

The invention relates to the field of video production, and in particular to a method and system for automatically generating scene videos from text. The system includes: a composition logic generation module for generating a composition template image for composition according to an input text description; an image content generation module, The input is the composition template image generated by the composition logic generation module, and the output is the rendered real image; the image dynamic module transforms the real image output by the image content generation module into a continuous multi-frame image to generate a dynamic video. Based on a natural language pre-training model and computer vision technology, the present invention automatically generates a short video through a given language input without manual intervention by a third party, thereby greatly improving the efficiency of short video production, and at the same time, the generated short video has authenticity and quality. Diversity ensures the quality of the generated video and the novelty of the video material.

Description

Method and system for automatically generating scene video by characters

Technical Field

The invention relates to the field of video production, in particular to a method and a system for automatically generating scene videos by characters.

Background

With the development of the internet, short videos are produced at the same time. Short videos rapidly occupy people's lives as a new way of content recording and media presentation.

In the field of video production. The traditional video production process is complicated, and specific materials need to be searched or shot for video creation. Meanwhile, in order to avoid the repeatability of video materials and ensure the novelty of videos, designers need to be continuously created and designed.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides the following specific technical inventions:

a method for automatically generating scene videos by characters comprises the following steps:

the method comprises the following steps: compressing and quantizing the composition template image through a vector quantization variation automatic encoder VQ-VAE to generate an image coding vector and an image token;

step two: coding the input language description through a pre-training neural network language model to obtain a word vector and token;

step three: flattening the image coding vector in the step one, splicing the image coding vector with the word vector in the step two, inputting the spliced image coding vector into a GPT (general purpose test) model for autoregressive training, establishing a direct relation between a language and a composition template image, modeling the relation, inputting a sentence of language description, and generating a corresponding composition template image;

step four: generating a live-action picture from the language composition template image generated in the step three based on the style migration GAN network;

step five: and generating a subsequent series of image frames by using the live-action image generated in the step four based on the image dynamic GAN network, and generating a video.

Further, the step one specifically includes: template image to be composed

Sent into a trained vector quantization variation automatic encoder VQ-VAE, converted into a sequence of a discrete implicit space, and subjected to discrete encoding

Image coding vector with dimension d

And an image encoding token,

。

further, the vector quantization variation automatic Encoder VQ-VAE is mainly divided into three modules, an Encoder, a codebook and a Decoder, wherein the Encoder module encodes an input composition template image, the Decoder decodes the composition template image, the two modules share the codebook module, and specifically, the Encoder encodes an image

Is coded into

，

The vector in (1) is quantized according to the Euclidean distance between the vector and the coodbook vector, namely, the nearest vector is found in codebook by the method of nearest neighbor table lookup, and then

Converted into the discrete code e closest thereto, i.e. output as

Image coding vector with dimension d

And a corresponding token,

and will be

Sending into Decoder module, decoding to generate composition template image

。

Further, the training mode of the vector quantization variation automatic encoder VQ-VAE is as follows: and (3) adopting a semantic segmentation image as a training data set, and reducing a VQ-VAE training loss function value through an Adam random gradient back propagation algorithm to obtain the optimal parameter of the model.

Further, the second step specifically includes: coding the input language description through a pre-training neural network language model to generate k word vectors,

and k word tokens,

。

further, the third step specifically includes the following steps:

(3.1) mixing

Flattening the image code vectors to generate g image code vectors, wherein

G is a fixed value, and position embedding is added to g image coding vectors;

(3.2) splicing the k word vectors and the g image coding vectors to obtain the text

And image

Also performing a splicing operation to generate

An embedded representation vector

；

(3.4) dividing the f vectors

And sending the text word vector to a GPT model for autoregressive training, and establishing a relation between the text word vector and the image coding vector, wherein the training of the GPT model specifically comprises the following steps: the training set is f code vectors

And the corresponding image and word token thereof, sending the f encoding vectors into a GPT model, predicting the next token to appear by the GPT model according to the previously input vector, and reducing the softmax classification loss function through a random gradient descent back propagation algorithm;

during prediction, word vectors input by languages are input into a GPT model, composition image compression coding tokens are predicted step by step, the generated tokens are sent into codebook to find corresponding vectors, and the formed compression coding vector diagram is sent into a Decoder module to generate a composition template image

。

Further, the fourth step specifically includes: combining the composition template images generated in the third step

And sending random noise into the style migration GAN network, wherein the style migration GAN network comprises a generator and a discriminator, the generator realizes the control of the random attribute of the live-action picture by using the noise injected into the network, and the discriminator distinguishes whether the live-action picture is real or generated by the generator. In the network training, the generator and the discriminator are jointly trained, and the composition template image generated in the third step is input in the prediction

To the generator, outputting the generated live-action picture

。

Further, the fifth step specifically includes the following steps:

(5.1) training an image dynamic GAN network, wherein the network is mainly divided into two modules: the generator controls video characteristics through latent vectors, meanwhile, the control of random attributes of videos is achieved by combining noise injected into a network, and the discriminator distinguishes whether the videos are real or generated by the generator, wherein the latent vectors are divided into two groups: coding latent vectors of colors and image layout, and coding latent vectors of overall brightness of the video; the noise injected into the network is also divided into two groups: encoding details and shapes of static entities in the video and details and shapes of dynamic entities in the video;

(5.2) training an Encoder model, wherein the Encoder model takes the live-action picture as input, maps the picture to the potential space of the GAN network, and outputs the live-action picture in the step four

Sending the trained Encoder model to obtain a latent vector quantity code corresponding to the live-action picture;

(5.3) rendering the live-action picture

Corresponding patterned template map

Marking static areas and dynamic areas in the images, optimizing latent vector code output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic area information, and respectively inputting the noise generated for controlling the dynamic areas

And controlling noise input generated by static region

；

（54) noise input to control dynamic region generation

Performing m dynamic transformations, and combining the latent vector code generated previously and the noise generated by the control static region

And inputting the dynamic GAN network model into a generator of the pre-training image dynamic GAN network model, outputting m frames of images, and forming a dynamic scene video with the number of m +1 by the original live-action image and the generated m frames of images.

Further, the dynamic transformation in the step (5.4) is specifically: controlling noise in dynamic regions, converting input noise, noise generated by controlling dynamic regions

To simulate its motion.

A system for automatically text-generating a video of a scene, comprising:

the composition logic generation module is used for generating a composition template image of a composition according to the input text description;

the image content generation module inputs the composition template image generated by the composition logic generation module and outputs the composition template image as a rendered live-action image;

and the image dynamic module is used for converting the live-action image output by the image content generation module into a continuous multi-frame image to generate a dynamic video.

The invention has the advantages that:

the method is based on the natural language pre-training model and the computer vision technology, the short video is automatically generated through the given language input without manual intervention of a third party, the short video generation efficiency is greatly improved, and meanwhile, the generated short video has authenticity and diversity, and the quality of the generated video and the novelty of video materials are ensured.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is an overall architecture diagram of the composition logic generation module;

FIG. 4 is an overall architecture diagram of an image content generation module;

FIG. 5 is a diagram illustrating the overall architecture of the image dynamizer;

FIG. 6 is a graph showing the results of an example of the present invention;

fig. 7 is a hardware configuration diagram of an arbitrary device having data processing capability in which the apparatus for automatically text-generating a scene video according to the present invention is located.

Detailed Description

In order to make the objects, technical inventions and technical effects of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, a system for automatically generating a scene video with text according to the present invention includes:

the composition logic generation module is used for generating a composition template picture of a composition according to the input text description;

As shown in fig. 2, a method for automatically generating a scene video with text according to the present invention includes:

step 101, realizing token serialization of the composition template image through a vector quantization variation automatic encoder.

The invention adopts a vector quantization variational automatic encoder VQ-VAE to compress and quantize a composition template, and the VQ-VAE adopts a conventional algorithm, and specifically comprises the following steps:

the VQ-VAE mainly comprises three parts, an Encoder module, a codebook module and a Decoder module.

Inputting an image

Through Encoder, output is

。

Is composed of

Vector coding with dimension d. The codebook contains k d-dimensional vectors, denoted by C,

。

finding out the vector with the shortest Euclidean distance in C, replacing and outputting e, sending e into Decoder, and outputting image

. Where d =256, h = 32, w = 32, k = 1024.

During training, the input of the network is

A size of

Where C =1, H =256, W = 256. The output of the network is a composition template image

A size of

。

The Encoder structure and the Decode structure are respectively constructed by adopting Encoder parts and Decode parts and respectively adopting a conventional convolution function, BatchNorm parts and a residual error network. The Encoder contains n resnet _ blocks and down-samples m times spatially. Convolution and upsampling operations in the Decoder block so that input x and output

The sizes are consistent.

The loss function of VQ-VAE is as follows:

where sg denotes stopping the back propagation.

Is a hyper-parameter.

In the present invention, n = 4, m = 4,

= 0.25。

training: with the Adam optimizer, the initial learning rate was 0.001.

And (3) prediction: inputting any composition template image for the VQ-VAE trained according to the steps

Output through Encoder module

，

Finding the nearest vector in codebook for replacement to obtain

Image coding vector with dimension d

And a corresponding image token.

And 102, inputting text description to a pre-training neural network language model Bert and outputting a text token.

As shown in FIG. 3, the input language description is fed into BeGenerating k word vectors in rt

And a number k of tokens to be used,

. And performing pad operation on the condition that less than k word vectors are output, so that the number of the word vectors is k. Where k = 256.

A size of

Where t = k =256 and d = 256.

And 103, jointly modeling the text image based on the GPT model.

And flattening the image coding vector, and then sending the image coding vector and the text coding vector into a GPT model for autoregressive training.

In order to suggest a mapping relation between text information and image information, the invention adopts the output of a language pre-training model Bert and codes after VQ-VAE quantization to be sent into a GPT model for relational modeling. The basic structure of the GPT model is described as follows:

the GPT model consists of a self-orientation block stacked by m layers, wherein the self-orientation block consists of structural units of Multi-Head orientation, Feed Forward and Add & Norm, and m =8 in the invention.

The input of GPT is obtained by adding Position Embedding and Token Embedding

。

A size of

Where f = 1280 and d = 256.

The objective function of the GPT training model is as follows:

the word vector representing the left window.

The calculation process of the GPT model is as follows:

wherein

When the first k word vectors are expressed and the prediction probability of each word is calculated, only the vocabulary information of the size of the left window is considered.

Where n represents the number of layers of the self-orientation block.

A matrix of word vectors is represented that is,

indicating position embedding.

Training: during training, an Adam optimizer is adopted, and the initial learning rate is 0.0003.

And (3) prediction: after the text and the image are jointly modeled, a section of text description is input to a pre-trained neural network language model Bert, the output of the Bert is input to a GPT, and token of the image is generated step by step. Looking up the corresponding vector from the token of the image to the VQ-VAE trained in the step 101 and outputting the vectore, e are input into Decoder module to output reconstructed image

。

Step 104, as shown in fig. 4, inputting the composition template image into the style migration GAN network to synthesize the live-action image.

The generation part of the composition template to the live-action picture adopts the most classical pixel2pixel in the style migration GAN network, and the pixel2pixel is a cGAN, so that the conversion from the composition template image to the live-action image can be completed. Pixel2 pixels are largely divided into two blocks, a generator and a discriminator. The overall structure of the Pixel2Pixel generator adopts a U-Net network structure, a symmetrical jump connection structure of Unet is adopted to directly copy low-layer information to a high-layer characteristic diagram, the ith layer is spliced to the (n-i) th layer, and n is the total number of network layers. The discriminator adopts L1 loss and patchGAN discriminator. L1 loss learns low frequency information as an aid, and the patch gan determines whether or not only one N × N image block is real at a time, and averages the results corresponding to each small block as a determination result of the one image. In the invention, N =9 and N = 70.

The training time generator and the arbiter are alternately trained.

The objective function for Pixel2Pixel optimization is as follows:

wherein

,

Where z is noise, the network learns the mapping from the template image x to the live-action image y. G denotes a generator, and D denotes a discriminator. During training, G, D is trained alternately.

During training: the generator and the discriminator adopt an Adam optimizer during training, and the initial learning rate is 0.0001.

In prediction, the composition template image reconstructed in step 103 is used

Input to the generator to generate the live-action picture

。

Step 105, as shown in fig. 5, the live-action image generates a series of video frames through the image dynamic GAN network.

As shown in fig. 6, according to the input text description, continuous multiframe images are generated by the method of the invention, and dynamic videos are generated.

This embodiment employs a deep-landscape network.

The deep-landscaper generator can control the image characteristics through the latent vector and simultaneously realize the control of random attributes of the image by combining the noise injected into the network. The task of the discriminator is to distinguish whether the picture is real or generated by the generator.

And generating a series of image images by the trained deep-landscapes, and simultaneously storing a latent vector code corresponding to the image. When the Encoder is trained, the Encoder can be a Resnet network, a picture generated for deep-landscapes is input, and a size following code vector is output.

The live-action picture generated in the step 104 is processed

Static and dynamic areas are marked, such as: the blue sky sea belongs to a static state, and the rest part belongs to a dynamic region. And combining the latent vector template code and the input noise with the picture composition template picture to perform fine tuning optimization. Optimizing latent vector latentcode output by the Encoder model according to the reconstruction error, generating corresponding input noise according to the marked static dynamic region information, and respectively inputting the noise generated for controlling the dynamic region

And controlling the static areaDomain generated noise input

。

Training deep-landscape: the generator and the discriminator both adopt an Adam optimizer during training, and the initial learning rate is 0.0001.

And (3) prediction: noise input to control dynamic region generation

And inputting the images into a generator of the trained deep-landscapes to generate m new images.

And m new live-action pictures generated subsequently are the final generated video, and m =200 in the invention.

And transforming the dynamic background information in the scene graph to generate a dynamic scene video. deep-landscapes are mainly divided into two parts, a generator and a discriminator. The generator is characterized in that a network structure comprises two parts, wherein the first part is Mapping network, the process of generating an intermediate hidden variable w by a hidden variable z is performed, the second part is Synthesis network, affine transformation obtained by converting w is used for controlling the style of generated images, and meanwhile random noise input in the network is used for enriching details of the generated images.

The discriminator is a classification network consisting of n convolution modules, the convolution modules carry out convolution and downsampling operations, and n =9 in the invention.

The style of the generated image is controlled by the adaptive instance normalization AdaIN, and the specific formula is as follows:

wherein

And

respectively representing input characteristics

Average difference and standard deviation of;

，

is the affine transformed generated scaling and bias values of w,

，

the style may be applied to the ith spatial feature map.

The method comprises the steps of coding a latent vector of the overall layout of colors and scenes in a scene graph and coding a latent vector of illumination brightness transformation in a generated video; the input noise in the coding-generated video is divided into noise encoding static object details and shapes and noise encoding dynamic object details and shapes.

In summary, the invention relates to basic network models such as VQ-VAE, GPT, pixel2pixel and styleGAN, etc., VQ-VAE can effectively utilize potential space, and can model important features which usually span multiple dimensions of data space; the method comprises the steps of compressing and quantizing a composition template image by using the VQ-VAE discretization characteristic to generate a serialized token, adopting a GPT autoregressive model for generating a formula task, completing mapping from a language to an image by using a pre-trained Bert language model and the serialized token of the composition template image, generating a series of tokens corresponding to the composition template image according to language input during prediction, and completing the translation from the image to the image by using pixel2pixel network. Based on the above, the invention provides a new material acquisition mode for video creation, reduces the video creation cost and improves the video creation efficiency.

Corresponding to the embodiment of the method for automatically generating the scene video by the characters, the invention also provides an embodiment of a device for automatically generating the scene video by the characters.

Referring to fig. 7, an apparatus for automatically generating a scene video with text according to an embodiment of the present invention includes one or more processors, and is configured to implement the method for automatically generating a scene video with text in the foregoing embodiment.

The embodiment of the apparatus for automatically generating scene video by text of the invention can be applied to any equipment with data processing capability, and the equipment with data processing capability can be equipment or apparatus such as a computer. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability where the apparatus for automatically generating scene videos by text is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for automatically generating the scene video by the characters in the embodiment is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the invention as described in the foregoing examples, or that certain features may be substituted in the same way. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims

1. a method for automatic text generation scene video, is characterized in that, comprises the following steps:

Step 1: Compress and quantize the composition template image through the vector quantization variational auto-encoder VQ-VAE to generate an image encoding vector and an image token;

Step 2: Encode the input language description through a pre-trained neural network language model to obtain word vectors and tokens;

Step 3: Flatten the image coding vector in Step 1 and splicing it with the word vector in Step 2, then input it into the GPT model for autoregressive training, establish a direct relationship between language and composition template image, and input a sentence after modeling the relationship. description, generate the corresponding composition template image;

Step 4: Generate a real image from the language composition template image generated in Step 3 based on the style transfer GAN network;

Step 5: Generate a subsequent series of image frames based on the image dynamic GAN network to generate a video from the reality map generated in step 4.

2. The method for automatic text generation scene video according to claim 1, wherein the step 1 is specifically: the composition template image

It is sent to the trained vector quantization variational auto-encoder VQ-VAE, and converted into a sequence of discrete recessive spaces. After discrete encoding, it is

image encoding vectors of dimension d

and image encoding token,

.

3. the method for a kind of automatic text generation scene video as claimed in claim 2, is characterized in that, described vector quantization variational automatic encoder VQ-VAE is mainly divided into three modules, Encoder, codebook and Decoder module, wherein , the Encoder module encodes the input composition template image, and the Decoder module decodes the composition template image. The two share the codebook module. Specifically, the Encoder encodes the image

encoded into

,

The vector in is quantized according to the Euclidean distance between it and the coodbook vector, that is, the nearest vector is found in the codebook by the method of the nearest neighbor look-up table, and the

Converted to the nearest discrete code e, that is, the output is

image encoding vectors of dimension d

and the corresponding token,

, and will

Send it to the Decoder module to decode and generate a composition template image

.

4. the method for a kind of automatic text generation scene video as claimed in claim 3 is characterized in that, the training mode of described vector quantization variational automatic encoder VQ-VAE is: adopt semantic segmentation image as training data set, by The Adam stochastic gradient backpropagation algorithm reduces the value of the VQ-VAE training loss function to obtain the optimal parameters of the model.

5. The method for automatic text generation scene video according to claim 3, wherein the step 2 is specifically: encoding the input language description through a pre-trained neural network language model, and generating k words vector,

, and k word tokens,

.

6. a kind of method for automatic text generation scene video as claimed in claim 5, is characterized in that, described step 3, specifically comprises the following steps:

(3.1) will

Flatten the image coding vectors to generate g image coding vectors, where

, g is a fixed value, add position embedding to g image encoding vectors;

(3.2) The k word vectors and g image encoding vectors are spliced, and the text

and images

The splicing operation is also performed to generate

embedding representation vector

;

(3.4) convert the f vectors

, send it into the GPT model for autoregressive training, and establish the relationship between the text word vector and the image encoding vector. The training of the GPT model is as follows: the training set is f encoding vectors and their corresponding images and word tokens. The f encoding vectors are sent into the GPT model. The GPT model predicts the next token that should appear according to the previously input vector, and reduces the softmax classification loss function through the stochastic gradient descent backpropagation algorithm;

When predicting, input the word vector of language input into the GPT model, gradually predict the composition image compression coding token, send the generated token to the codebook to find the corresponding vector, and then send the formed compressed coding vector image to the Decoder module to generate the composition template image

.

7. The method for automatic text generation scene video according to claim 6, wherein the step 4 is specifically: the composition template image generated in the step 3

And random noise is sent to the style transfer GAN network, the style transfer GAN network includes a generator and a discriminator, the generator uses the noise injected into the network to control the random attributes of the real image, and the discriminator distinguishes whether the real image is real or generated. When the network is trained, the generator and the discriminator are jointly trained. When predicting, the composition template image generated in step 3 is input.

To the generator, output the generated reality map

.

8. the method for a kind of automatic text generation scene video as claimed in claim 7, is characterized in that, described step 5, specifically comprises the following steps:

(5.1) Train image dynamic GAN network. The network is mainly divided into two modules: generator and discriminator. The generator controls the video features through the latent vector, and at the same time combines the noise injected into the network to control the random attributes of the video. The discriminator Distinguish whether the video is real or generated by the generator, where latent vectors are divided into two groups: latent vectors encoding color and image layout, and latent vectors encoding the overall brightness of the video; noise injected into the network is also divided into two groups: encoding video in Details and shapes of static entities and dynamic entities in encoded video;

(5.2) Train the Encoder model. The Encoder model takes the real image as input, maps the image to the latent space of the GAN network, and maps the real image output in step 4

Send the trained Encoder model to get the latent vector latent code corresponding to the real scene image;

(5.3) The real picture

The corresponding composition template

, mark the static area and dynamic area in the image, optimize the latent vector latent code output by the Encoder model according to the reconstruction error, and generate the corresponding input noise according to the marked static dynamic area information, which are the noise input generated by the control dynamic area respectively

and control the noise input generated by the static region

;

(5.4) will control the noise input generated by the dynamic region

Perform m dynamic transformations, and combine the previously generated latent vector latent code and control the noise generated by the static area

, input the generator of the pre-trained image dynamic GAN network model, output m frames of images, the original real scene image and the generated m frames of images form a dynamic scene video with a frame number of m+1.

9 . The method for automatically generating a scene video from text according to claim 8 , wherein the dynamic transformation in the step (5.4) is specifically: controlling the noise in the dynamic region, and for the transformation of the input noise, by changing the input noise. 10 . Control the noise generated by dynamic regions

linear or nonlinear transformation to simulate its motion.

10. adopt the system of the method for a kind of automatic text generation scene video as one of claim 1-9, is characterized in that, comprises:

The composition logic generation module is used to generate the composition template image of the composition according to the input text description;

Image content generation module, the input is the composition template image generated by the composition logic generation module, and the output is the rendered real image;

The image dynamic module transforms the real image output by the image content generation module into continuous multi-frame images to generate dynamic video.