WO2023070448A1 - 视频处理方法、装置、电子设备和可读存储介质 - Google Patents
视频处理方法、装置、电子设备和可读存储介质 Download PDFInfo
- Publication number
- WO2023070448A1 WO2023070448A1 PCT/CN2021/127079 CN2021127079W WO2023070448A1 WO 2023070448 A1 WO2023070448 A1 WO 2023070448A1 CN 2021127079 W CN2021127079 W CN 2021127079W WO 2023070448 A1 WO2023070448 A1 WO 2023070448A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- output
- video
- input
- dimensional
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 65
- 238000000034 method Methods 0.000 claims description 29
- 238000012549 training Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 11
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4046—Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/74—Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- Embodiments of the present disclosure relate to the technical field of image processing, and in particular, to a video processing method, device, electronic device, and readable storage medium.
- Image processing technology supports the generation of certain images to meet specific requirements through model training, for example, through continuous training of the model to generate images with higher resolution or different sizes based on the original image.
- an embodiment of the present disclosure provides a video processing method, including the following steps:
- the input data includes picture data and/or video data
- the video processing model includes a plurality of generators arranged in sequence and corresponding to different image resolutions, each generator includes A three-dimensional convolution unit and a plurality of first three-dimensional convolution layers are arranged, and the transposed three-dimensional convolution unit is used to generate first output data according to the input data and intermediate processing data of the generator, and the output video data is according to The first output data is obtained, and the intermediate processing data is obtained after inputting the input data into the plurality of first three-dimensional convolutional layers.
- the generator further includes a second three-dimensional convolutional layer, the second three-dimensional convolutional layer is used to adjust the first output data to obtain second output data, and the second three-dimensional convolutional layer
- the stride of at least one dimension of is larger than the stride of the transposed 3D convolutional unit.
- the second 3D convolutional layer includes a first step size corresponding to the time dimension, and a second step size corresponding to the output size dimension
- each of the transposed 3D convolution units includes a step size corresponding to the time dimension The corresponding third step size, and the fourth step size corresponding to the output size dimension.
- the ratio of the first step size to the third step size is not equal to the ratio of the second step size to the fourth step size.
- the number of the first three-dimensional convolutional layers is four.
- each of the first three-dimensional convolutional layers includes a three-dimensional convolutional unit and a residual layer arranged in sequence.
- multiple first images correspond to the same second image, wherein the multiple The first image is a plurality of different frames of images in the input data, and the second image is a frame of images in the output video data.
- the input video processing model is obtained by generating a confrontation network for model training, and the generation confrontation network includes the generator and a discriminator;
- the generator is a model with a low-resolution image as input and a high-resolution video sequence as output;
- the discriminator is a model that takes an image as an input and outputs a discrimination result for the image
- the loss function for model training is determined from the adversarial loss between the generator and the discriminator, and the input and output reconstruction losses of the generator.
- the method when the input data includes video data, before inputting the input data into a video processing model to obtain output video data, the method further includes:
- Extracting a 3D patch of the video sequence wherein each pixel in the video sequence exists in at least one of the 3D patches, and at least some of the pixels are located in a plurality of the 3D patches;
- the three-dimensional patch is used as an input of the video processing model.
- an embodiment of the present disclosure provides a video processing device, including:
- An input data acquisition module configured to acquire input data, wherein the input data includes picture data and/or video data;
- An input module configured to input the input data into a video processing model to obtain output video data
- the video processing model includes a plurality of generators arranged in sequence and corresponding to different image resolutions, each of the generators includes a transposed three-dimensional convolution unit and a plurality of first three-dimensional convolution layers, and the transposed three-dimensional
- the convolution unit is used to generate first output data according to the input data of the generator and intermediate processing data, the output video data is obtained according to the first output data, and the intermediate processing data is the input data obtained after inputting the plurality of first three-dimensional convolutional layers.
- the generator further includes a second three-dimensional convolutional layer, the second three-dimensional convolutional layer is used to adjust the first output data to obtain second output data, and the second three-dimensional convolutional layer
- the stride of at least one dimension of is larger than the stride of the transposed 3D convolutional unit.
- the second 3D convolutional layer includes a first step size corresponding to the time dimension, and a second step size corresponding to the output size dimension
- each of the transposed 3D convolution units includes a step size corresponding to the time dimension The corresponding third step size, and the fourth step size corresponding to the output size dimension.
- the ratio of the first step size to the third step size is not equal to the ratio of the second step size to the fourth step size.
- the number of the first three-dimensional convolutional layers is four.
- each of the first three-dimensional convolutional layers includes a three-dimensional convolutional unit and a residual layer arranged in sequence.
- multiple first images correspond to the same second image, wherein the multiple The first image is a plurality of different frames of images in the input data, and the second image is a frame of images in the output video data.
- the input video processing model is obtained by generating a confrontation network for model training, and the generation confrontation network includes the generator and a discriminator;
- the generator is a model with a low-resolution image as input and a high-resolution video sequence as output;
- the discriminator is a model that takes an image as an input and outputs a discrimination result for the image
- the loss function for model training is determined from the adversarial loss between the generator and the discriminator, and the input and output reconstruction losses of the generator.
- the input data includes video data
- it further includes:
- a video sequence division module configured to divide the input data into multiple video sequences according to preset time intervals
- a three-dimensional patch extraction module configured to extract three-dimensional patches of the video sequence, wherein each pixel in the video sequence exists in at least one of the three-dimensional patches, and at least some pixels are located in multiple of the three-dimensional patches. in the dough;
- the input data determination module is configured to use the 3D mesh as the input of the video processing model.
- an embodiment of the present disclosure provides an electronic device, including: a memory, a processor, and a program stored in the memory and operable on the processor; the processor is configured to read The program in the memory implements the steps in the method described in the aforementioned first aspect.
- embodiments of the present disclosure provide a readable storage medium for storing a program, and when the program is executed by a processor, the steps in the method described in the aforementioned first aspect are implemented.
- FIG. 1 is a schematic flowchart of a video processing method provided by an embodiment of the present disclosure
- Fig. 2 is a schematic diagram of a model of a convolutional neural network provided by an embodiment of the present disclosure
- Fig. 3 is a schematic diagram of the training of the generative confrontation network provided by an embodiment of the present disclosure
- Fig. 4 is a schematic structural diagram of a generator in an embodiment of the present disclosure.
- Fig. 5 is another schematic diagram of the training of the generative confrontation network provided by an embodiment of the present disclosure.
- Fig. 6 is a schematic structural diagram of another generator in an embodiment of the present disclosure.
- FIG. 7 is a schematic structural diagram of a video sequence in an embodiment of the present disclosure.
- Fig. 8 is a schematic structural diagram of a three-dimensional mesh in an embodiment of the present disclosure.
- FIG. 9 is a schematic diagram of an application scenario in an embodiment of the present disclosure.
- FIG. 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present disclosure.
- Fig. 11 is a schematic structural diagram of an electronic device provided by an implementation of the present disclosure.
- first”, “second” and the like in the embodiments of the present disclosure are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence.
- the terms “comprising” and “having”, as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
- the embodiment of the present disclosure provides a video processing method.
- the video processing method includes the following steps:
- Step 101 Obtain input data.
- the input data in this embodiment includes picture data and/or video data.
- Step 102 Input the input data into a video processing model to obtain output video data.
- the video processing model is obtained through model training, and the video processing model may be a convolutional neural network model.
- a Convolutional Neural Network is a neural network structure that uses images as input/output and replaces scalar weights with filters (convolutions).
- Figure 2 exemplarily shows a 3-layer convolutional neural network.
- This convolutional neural network takes 4 input images on the left, has 3 units (output images) in the hidden layer in the middle, and 2 units in the output layer, producing 2 output images.
- Each weighted box corresponds to a filter (for example, it can be a 3x3x3 or 5x5x5 kernel), where the superscript of each parameter is the label indicating the input layer number, and the subscript is the label of the input and output units in turn.
- the bias b is a scalar added to the output of the convolution.
- the result of adding multiple convolutions and biases is then passed through an activation box, which typically corresponds to a rectified linear unit (ReLU), sigmoid function, or hyperbolic tangent, etc. Filters and biases are fixed during system operation, obtained through a training process using a set of input/output example images, and adjusted according to the application to meet certain optimization criteria.
- a convolutional neural network with 3 layers is called a shallow convolutional neural network
- a convolutional neural network with more than 5 layers is usually called a deep convolutional neural network.
- the video processing model is obtained through model training using a Generative Adversarial Network (GAN).
- GAN Generative Adversarial Network
- the training process of GAN use generator to obtain output data based on input data, the output result of generator is marked as false (Fake), will satisfy the training target
- the real data is marked as Real
- the discriminator is used to discriminate the output result of the generator and the real data, and the parameters of the generator are further adjusted according to the discriminant result, and the generator and the discriminator are alternately trained according to the established loss function , until the loss function converges or reaches a certain number of iterations.
- the training of GAN is completed, and the trained generator is regarded as the trained model.
- the generation confrontation network includes a generator and a discriminator;
- the generator is a model with a low-resolution image as an input and a high-resolution video sequence as an output;
- the discriminator is an image as an input, with The discriminant result of the image is the output model;
- the loss function of the model training includes the confrontation loss Ladv between the generator and the discriminator, and the input and output reconstruction loss Lrec of the generator.
- the generator is used to remove noise based on the image added with the noise map zn, obtain an output result, compare the output result with the real image, and construct a loss function for model training.
- the constructed loss function includes:
- ⁇ in formula (1) is a preset coefficient
- Gn represents the nth generator
- Dn represents the nth discriminator
- the value range of n is from 0 to N.
- the first line is the reconstruction loss Lrec when n is not equal to N
- the second line is the reconstruction loss Lrec when n is equal to N.
- xn is Corresponding to the real result
- z is the noise map.
- the video processing model includes multiple generators that are arranged in sequence and correspond to different image resolutions, as shown in Figures 3 and 5.
- the output is scaled up, higher resolution noise is added, and a new generator is learned to create higher resolution images.
- the resolution of the output is the same as the resolution of the real image corresponding to the training target, complete Model training.
- the noise inputs corresponding to different resolutions are independent and identically distributed samples, so each pixel value is independent of other pixel values.
- the resolution of the noise input can be changed to generate images of arbitrarily different resolutions.
- each generator G includes a transposed three-dimensional convolution unit 602 and a plurality of first three-dimensional convolution layers 601, and the transposed three-dimensional convolution unit 602 is used for input data and intermediate processing data according to the generator G
- the first output data is generated, the output video data is obtained according to the first output data, and the intermediate processing data is obtained after inputting the input data into multiple first three-dimensional convolutional layers 601 .
- each first three-dimensional convolutional layer 601 includes a three-dimensional convolution unit 6011 and a residual Layer 6012.
- the existing pyramid-structured generation confrontation network (SinGAN) is only suitable for adjusting two-dimensional images.
- the transposed three-dimensional convolution unit 602 can maintain A one-to-many mapping, so as to increase the dimension of the data that can be processed, so that the processing of three-dimensional images can be realized, that is, it is possible to output the image as a video file.
- the generator G further includes a second three-dimensional convolutional layer 603.
- the second three-dimensional convolutional layer 603 is used to adjust the second output data obtained according to the first output data.
- the second three-dimensional convolutional layer 603 The stride of at least one dimension of the convolution layer 603 is larger than the stride of the transposed 3D convolution unit 602 .
- the step size of each dimension of the second 3D convolution layer 603 is greater than the step size of the transposed 3D convolution unit 602 .
- the step size of the second 3D convolutional layer 603 By controlling the step size of the second 3D convolutional layer 603 to be larger than the step size of the transposed 3D convolution unit 602, it is possible to use the second 3D convolutional layer 603 to perform an upsampling operation, thereby realizing the enlargement of the size of the image or the time compression of the image .
- adjusting the first output data specifically refers to adjusting the duration or resolution of the first output data, for example, it may be adjusting the length or width of the first output data, or compressing the duration of the first video,
- the adjusted first output data is used as the second output data.
- a second three-dimensional convolution layer 603 is further added, and the second three-dimensional convolution layer 603 adjusts the step size of the dimension to realize the time or space dimension adjustment.
- the second three-dimensional convolutional layer 603 includes a first step size corresponding to the time dimension, and a second step size corresponding to the output size dimension
- the transposed three-dimensional convolution unit 602 includes a step size corresponding to the time dimension A third step, and a fourth step corresponding to the output size dimension.
- the first step length of the second three-dimensional convolutional layer 603 corresponding to the time dimension T is A
- the second step lengths corresponding to the output dimension dimensions H and W are B and C
- the third step size corresponding to the time dimension T is X
- the fourth step size corresponding to the output size dimensions H and W are Y and Z for illustration.
- the time dimension T corresponds to the duration of the video data
- the output dimension dimensions H and W respectively correspond to the width and height of the video data to represent the resolution information of the video data.
- the output result of the transposed three-dimensional convolution unit 602 can be amplified by 1.5 times, that is, by a ratio of 3/2.
- the ratio of the first step to the third step is equal to or not equal to the ratio of the second step to the fourth step.
- the ratio of A to X can be equal to the ratio of B to Y or not, In this way, it is possible to realize zooming in different proportions for the time dimension and the space dimension.
- high-frequency textures in space may be stationary, which means that these textures have very low frequency in the temporal dimension. In this way, it can be compressed in the time dimension. Since the frequency of the texture in the time dimension is low, even if the time dimension is compressed, useful information will not be lost.
- the adaptability can be improved to meet the display requirements of different sizes.
- the spatial dimension of the input data By adjusting the spatial dimension of the input data, the duration of the output video data can be controlled while avoiding information loss, so that specific The content can meet the playback duration requirements in different scenarios, thus improving the processing effect of video data and meeting a wider range of needs.
- the time dimension and the space dimension are adjusted in different proportions, it can cope with more scenarios and fully meet the requirements for video duration and video size.
- the method when the input data includes video data, before inputting the input data into a video processing model to obtain output video data, the method further includes:
- Extracting a 3D patch of the video sequence wherein each pixel in the video sequence exists in at least one of the 3D patches, and at least some of the pixels are located in a plurality of the 3D patches;
- the three-dimensional patch is used as an input of the video processing model.
- the input data is converted into a three-dimensional patch (3D-Patch).
- Figure 7 shows a schematic structural diagram of a video sequence, which includes a time dimension T, corresponding to the duration of the video sequence, and two spatial dimensions H and W, respectively corresponding to the video The width and height of the sequence.
- the 3D patches in the video sequence are extracted. Since each pixel in the video sequence exists in at least one 3D patch, the method of all 3D patches actually includes the corresponding video All information in the sequence. By adjusting the size of the three-dimensional patch, it can meet the processing requirements of the device processor, reduce the data dimension, and make full use of the performance of the processor.
- the 3D patch processed by the first 3D convolution layer and the transposed 3D convolution unit can be understood as the above-mentioned first output data.
- the 3D patch Based on the relative positional relationship between the 3D patches, the 3D patch can be synthesized into a continuous video Data, the obtained video data can be understood as the above-mentioned output video data.
- a second three-dimensional convolution layer can also be set to process the first output data to obtain the second output data, and generate output video data according to the second output data.
- At least some of the pixels are located in multiple 3D patches. It can be understood that there is a certain overlap between the 3D patches, so that blocking shadows can be reduced. Since the time dimension and size dimension of the 3D patches are adjustable, by adding The video data is divided into three-dimensional patches, which can support the processing of video data in any time dimension and size dimension.
- multiple first images correspond to the same second image, wherein the multiple first images are more than one second image in the input data.
- the second image is a frame of image in the output video data.
- the technical solution of this embodiment can also be used to adjust the image or duration of video data, as shown in Figure 9, in one of the embodiments, the technical solution of this embodiment is used to compress video sequences, for example, For the video data whose original length is 10 seconds (sec), by adjusting the time scale parameter, the video data can be compressed to 5 seconds. It can be understood that the two cars originally displayed in two different frames are adjusted to be displayed in the same frame.
- the contents of multiple frames of images in the input data are displayed in one frame of images of the output video data, so that the duration of the video is compressed, but the content to be displayed will not be lost.
- the embodiment of the present disclosure also provides a video processing device.
- the video processing device 1000 includes:
- the input data acquisition module 1001 is configured to acquire input data, wherein the input data includes picture data and/or video data;
- An input module 1002 configured to input the input data into a video processing model to obtain output video data
- the video processing model includes a plurality of generators arranged in sequence and corresponding to different image resolutions, each of the generators includes a transposed three-dimensional convolution unit and a plurality of first three-dimensional convolution layers, and the transposed three-dimensional
- the convolution unit is used to generate first output data according to the input data of the generator and intermediate processing data, the output video data is obtained according to the first output data, and the intermediate processing data is the input data obtained after inputting the plurality of first three-dimensional convolutional layers.
- the generator further includes a second three-dimensional convolutional layer, the second three-dimensional convolutional layer is used to adjust the second output data obtained according to the first output data, and the second three-dimensional convolutional layer
- the stride of at least one dimension of the layer is larger than the stride of the transposed 3D convolutional unit.
- the second 3D convolutional layer includes a first step size corresponding to the time dimension, and a second step size corresponding to the output size dimension
- each of the transposed 3D convolution units includes a step size corresponding to the time dimension The corresponding third step size, and the fourth step size corresponding to the output size dimension.
- the ratio of the first step size to the third step size is not equal to the ratio of the second step size to the fourth step size.
- the number of the first three-dimensional convolutional layers is four.
- each of the first three-dimensional convolutional layers includes a three-dimensional convolutional unit and a residual layer arranged in sequence.
- multiple first images correspond to the same second image, wherein the multiple The first image is a plurality of different frames of images in the input data, and the second image is a frame of images in the output video data.
- the input video processing model is obtained by generating a confrontation network for model training, and the generation confrontation network includes the generator and a discriminator;
- the generator is a model with a low-resolution image as input and a high-resolution video sequence as output;
- the discriminator is a model that takes an image as an input and outputs a discrimination result for the image
- the loss function for model training is determined from the adversarial loss between the generator and the discriminator, and the input and output reconstruction losses of the generator.
- the input data includes video data
- it further includes:
- a video sequence division module configured to divide the input data into multiple video sequences according to preset time intervals
- a three-dimensional patch extraction module configured to extract three-dimensional patches of the video sequence, wherein each pixel in the video sequence exists in at least one of the three-dimensional patches, and at least some pixels are located in multiple of the three-dimensional patches. in the dough;
- the input data determination module is configured to use the 3D mesh as the input of the video processing model.
- the video processing apparatus 1000 of this embodiment can implement each step of the above-mentioned video processing method embodiment, and can achieve basically the same technical effect, which will not be repeated here.
- the embodiment of the present disclosure also provides an electronic device.
- the electronic device may include a processor 1101 , a memory 1102 and a program 11021 stored in the memory 1102 and executable on the processor 1101 .
- any steps in the method embodiment corresponding to FIG. 1 can be implemented and the same beneficial effect can be achieved, which will not be repeated here.
- the electronic device is a network-side device
- the program 11021 when executed by the processor 1101, any steps in the method embodiment corresponding to FIG. 11 can be implemented and the same beneficial effect can be achieved, and details are not repeated here.
- An embodiment of the present disclosure further provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any step in the above method embodiment corresponding to FIG. 1 can be implemented, and The same technical effect can be achieved, so in order to avoid repetition, details will not be repeated here.
- the storage medium is, for example, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
- ROM Read-Only Memory
- RAM Random Access Memory
- magnetic disk or an optical disk and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Processing (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
Claims (20)
- 一种视频处理方法,包括以下步骤:获取输入数据,其中,所述输入数据包括图片数据和/或视频数据;将所述输入数据输入视频处理模型获得输出视频数据;其中,所述输出视频数据和所述输入数据的分辨率和/或时长不相等,所述视频处理模型包括多个依次设置且对应不同图像分辨率的生成器,每一所述生成器包括转置三维卷积单元和多个第一三维卷积层,所述转置三维卷积单元用于根据所述生成器的输入数据和中间处理数据生成第一输出数据,所述输出视频数据是根据所述第一输出数据得到的,所述中间处理数据是将所述输入数据输入所述多个第一三维卷积层后得到的。
- 根据权利要求1所述的方法,其中,所述生成器还包括第二三维卷积层,所述第二三维卷积层用于调整所述第一输出数据获得第二输出数据,所述第二三维卷积层的至少一个维度的步长大于所述转置三维卷积单元的步长。
- 根据权利要求2所述的方法,其中,所述第二三维卷积层包括与时间维度对应的第一步长,以及与输出尺寸维度对应的第二步长,所述转置三维卷积单元均包括与时间维度对应的第三步长,以及与输出尺寸维度对应的第四步长。
- 根据权利要求3所述的方法,其中,所述第一步长和所述第三步长的比值与所述第二步长和所述第四步长的比值不相等。
- 根据权利要求1所述的方法,其中,所述第一三维卷积层的数量为四个。
- 根据权利要求5所述的方法,其中,每一所述第一三维卷积层包括依次设置的一个三维卷积单元和一个残差层。
- 根据权利要求1所述的方法,其中,在所述输入数据包括视频数据且所述输出视频数据和所述输入数据的时长不相等的情况下,多个第一图像对应同一第二图像,其中,所述多个第一图像为所述输入数据中多帧不同的图像,所述第二图像为所述输出视频数据中的一帧图像。
- 根据权利要求1所述的方法,其中,所述输入视频处理模型是通过生成对抗网络进行模型训练得到的,所述生成对抗网络包括所述生成器和判别器;其中,所述生成器是以低分辨率图像为输入,以高分辨率视频序列为输出的模型;所述判别器是以图像为输入,以对于图像的判别结果为输出的模型;模型训练的损失函数是根据所述生成器和所述判别器之间的对抗损失,以及所述生成器的输入和输出重构损失确定的。
- 根据权利要求1至8中任一项所述的方法,其中,在所述输入数据包括视频数据的情况下,所述将所述输入数据输入视频处理模型获得输出视频数据之前,所述方法还包括:将所述输入数据按照预设时间间隔划分为多个视频序列;提取所述视频序列的三维面片,其中,所述视频序列中的每一像素至少存在于一个所述三维面片中,且至少部分像素位于多个所述三维面片中;将所述三维面片作为所述视频处理模型的输入。
- 一种视频处理装置,包括:输入数据获取模块,用于获取输入数据,其中,所述输入数据包括图片数据和/或视频数据;输入模块,用于将所述输入数据输入视频处理模型获得输出视频数据;其中,所述视频处理模型包括多个依次设置且对应不同图像分辨率的生成器,每一所述生成器包括转置三维卷积单元和多个第一三维卷积层,所述转置三维卷积单元用于根据所述生成器的输入数据和中间处理数据生成第一输出数据,所述输出视频数据是根据所述第一输出数据得到的,所述中间处理数据是将所述输入数据输入所述多个第一三维卷积层后得到的。
- 根据权利要求10所述的装置,其中,所述生成器还包括第二三维卷积层,所述第二三维卷积层用于调整所述第一输出数据获得第二输出数据,所述第二三维卷积层的至少一个维度的步长大于所述转置三维卷积单元的步长。
- 根据权利要求11所述的装置,其中,所述第二三维卷积层包括与时 间维度对应的第一步长,以及与输出尺寸维度对应的第二步长,所述转置三维卷积单元均包括与时间维度对应的第三步长,以及与输出尺寸维度对应的第四步长。
- 根据权利要求12所述的装置,其中,所述第一步长和所述第三步长的比值与所述第二步长和所述第四步长的比值不相等。
- 根据权利要求10所述的装置,其中,所述第一三维卷积层的数量为四个。
- 根据权利要求14所述的装置,其中,每一所述第一三维卷积层包括依次设置的一个三维卷积单元和一个残差层。
- 根据权利要求10所述的装置,其中,在所述输入数据包括视频数据且所述输出视频数据和所述输入数据的时长不相等的情况下,多个第一图像对应同一第二图像,其中,所述多个第一图像为所述输入数据中多帧不同的图像,所述第二图像为所述输出视频数据中的一帧图像。
- 根据权利要求10所述的装置,其中,所述输入视频处理模型是通过生成对抗网络进行模型训练得到的,所述生成对抗网络包括所述生成器和判别器;其中,所述生成器是以低分辨率图像为输入,以高分辨率视频序列为输出的模型;所述判别器是以图像为输入,以对于图像的判别结果为输出的模型;模型训练的损失函数是根据所述生成器和所述判别器之间的对抗损失,以及所述生成器的输入和输出重构损失确定的。
- 根据权利要求10至17中任一项所述的装置,其中,在所述输入数据包括视频数据的情况下,还包括:视频序列划分模块,用于将所述输入数据按照预设时间间隔划分为多个视频序列;三维面片提取模块,用于提取所述视频序列的三维面片,其中,所述视频序列中的每一像素至少存在于一个所述三维面片中,且至少部分像素位于多个所述三维面片中;输入数据确定模块,用于将所述三维面片作为所述视频处理模型的输入。
- 一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的程序;其特征在于,所述处理器,用于读取存储器中的程序实现如权利要求1至9中任一项所述的视频处理方法中的步骤。
- 一种可读存储介质,用于存储程序,其特征在于,所述程序被处理器执行时实现如权利要求1至9中任一项所述的视频处理方法中的步骤。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/927,148 US20240135488A1 (en) | 2021-10-28 | 2021-10-27 | Video processing method and device, electronic apparatus, and readable storage medium |
PCT/CN2021/127079 WO2023070448A1 (zh) | 2021-10-28 | 2021-10-28 | 视频处理方法、装置、电子设备和可读存储介质 |
CN202180003120.5A CN116368511A (zh) | 2021-10-28 | 2021-10-28 | 视频处理方法、装置、电子设备和可读存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/127079 WO2023070448A1 (zh) | 2021-10-28 | 2021-10-28 | 视频处理方法、装置、电子设备和可读存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023070448A1 true WO2023070448A1 (zh) | 2023-05-04 |
Family
ID=86160367
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/127079 WO2023070448A1 (zh) | 2021-10-28 | 2021-10-28 | 视频处理方法、装置、电子设备和可读存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240135488A1 (zh) |
CN (1) | CN116368511A (zh) |
WO (1) | WO2023070448A1 (zh) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109118430A (zh) * | 2018-08-24 | 2019-01-01 | 深圳市商汤科技有限公司 | 超分辨率图像重建方法及装置、电子设备及存储介质 |
CN111340711A (zh) * | 2020-05-21 | 2020-06-26 | 腾讯科技(深圳)有限公司 | 一种超分辨率重建方法、装置、设备和存储介质 |
CN111429355A (zh) * | 2020-03-30 | 2020-07-17 | 新疆大学 | 一种基于生成对抗网络的图像超分辨率重建方法 |
CN111739635A (zh) * | 2020-06-10 | 2020-10-02 | 四川大学华西医院 | 一种用于急性缺血性卒中的诊断辅助模型及图像处理方法 |
WO2020234449A1 (en) * | 2019-05-23 | 2020-11-26 | Deepmind Technologies Limited | Generative adversarial networks with temporal and spatial discriminators for efficient video generation |
US20210209459A1 (en) * | 2017-05-08 | 2021-07-08 | Boe Technology Group Co., Ltd. | Processing method and system for convolutional neural network, and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020104498A1 (en) * | 2018-11-20 | 2020-05-28 | Deepmind Technologies Limited | Neural network systems for decomposing video data into layered representations |
US11430138B2 (en) * | 2020-03-05 | 2022-08-30 | Huawei Technologies Co., Ltd. | Systems and methods for multi-frame video frame interpolation |
US20230344962A1 (en) * | 2021-03-31 | 2023-10-26 | Meta Platforms, Inc. | Video frame interpolation using three-dimensional space-time convolution |
-
2021
- 2021-10-27 US US17/927,148 patent/US20240135488A1/en active Pending
- 2021-10-28 WO PCT/CN2021/127079 patent/WO2023070448A1/zh active Application Filing
- 2021-10-28 CN CN202180003120.5A patent/CN116368511A/zh active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210209459A1 (en) * | 2017-05-08 | 2021-07-08 | Boe Technology Group Co., Ltd. | Processing method and system for convolutional neural network, and storage medium |
CN109118430A (zh) * | 2018-08-24 | 2019-01-01 | 深圳市商汤科技有限公司 | 超分辨率图像重建方法及装置、电子设备及存储介质 |
WO2020234449A1 (en) * | 2019-05-23 | 2020-11-26 | Deepmind Technologies Limited | Generative adversarial networks with temporal and spatial discriminators for efficient video generation |
CN111429355A (zh) * | 2020-03-30 | 2020-07-17 | 新疆大学 | 一种基于生成对抗网络的图像超分辨率重建方法 |
CN111340711A (zh) * | 2020-05-21 | 2020-06-26 | 腾讯科技(深圳)有限公司 | 一种超分辨率重建方法、装置、设备和存储介质 |
CN111739635A (zh) * | 2020-06-10 | 2020-10-02 | 四川大学华西医院 | 一种用于急性缺血性卒中的诊断辅助模型及图像处理方法 |
Also Published As
Publication number | Publication date |
---|---|
CN116368511A (zh) | 2023-06-30 |
US20240135488A1 (en) | 2024-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11900567B2 (en) | Image processing method and apparatus, computer device, and storage medium | |
Fadnavis | Image interpolation techniques in digital image processing: an overview | |
CN110533594B (zh) | 模型训练方法、图像重建方法、存储介质及相关设备 | |
US8565554B2 (en) | Resizing of digital images | |
WO2022267641A1 (zh) | 一种基于循环生成对抗网络的图像去雾方法及系统 | |
CN111192292A (zh) | 基于注意力机制与孪生网络的目标跟踪方法及相关设备 | |
US20170024852A1 (en) | Image Processing System for Downscaling Images Using Perceptual Downscaling Method | |
CN112927137B (zh) | 一种用于获取盲超分辨率图像的方法、设备及存储介质 | |
Zhu et al. | Generative high-capacity image hiding based on residual CNN in wavelet domain | |
Guan et al. | Srdgan: learning the noise prior for super resolution with dual generative adversarial networks | |
BR102020027013A2 (pt) | Método para gerar uma imagem multiplano adaptativa a partir de uma única imagem de alta resolução | |
CN113160055A (zh) | 一种基于深度学习的图像超分辨率重建方法 | |
CN111951171A (zh) | Hdr图像生成方法、装置、可读存储介质及终端设备 | |
CN115471736A (zh) | 基于注意力机制和知识蒸馏的伪造图像检测方法和装置 | |
US20250014144A1 (en) | Physics-informed adaptive fourier neural interpolation operator for synthetic frame generation | |
CN118781453A (zh) | 一种扩散模型训练方法、装置、设备、介质以及视频生成方法 | |
US20240202886A1 (en) | Video processing method and apparatus, device, storage medium, and program product | |
WO2023070448A1 (zh) | 视频处理方法、装置、电子设备和可读存储介质 | |
US12307628B2 (en) | System and method for image transformation | |
CN116739888A (zh) | 视频风格迁移方法及视频风格迁移装置 | |
KR20220158540A (ko) | 영상 처리 장치 및 그 동작방법 | |
EP4318377A1 (en) | Image processing device and operating method therefor | |
CN117475088B (zh) | 基于极平面注意力的光场重建模型训练方法及相关设备 | |
US12079957B2 (en) | Modeling continuous kernels to generate an enhanced digital image from a burst of digital images | |
CN115115682A (zh) | 一种图像配准方法及其相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 17927148 Country of ref document: US |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21961814 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.08.2024) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21961814 Country of ref document: EP Kind code of ref document: A1 |