CN112750094A - Video processing method and system - Google Patents

Video processing method and system Download PDF

Info

Publication number
CN112750094A
CN112750094A CN202011611610.7A CN202011611610A CN112750094A CN 112750094 A CN112750094 A CN 112750094A CN 202011611610 A CN202011611610 A CN 202011611610A CN 112750094 A CN112750094 A CN 112750094A
Authority
CN
China
Prior art keywords
frame
fusion
feature
time sequence
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011611610.7A
Other languages
Chinese (zh)
Other versions
CN112750094B (en
Inventor
赵洋
马彦博
曹力
贾伟
李琳
刘晓平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202011611610.7A priority Critical patent/CN112750094B/en
Publication of CN112750094A publication Critical patent/CN112750094A/en
Application granted granted Critical
Publication of CN112750094B publication Critical patent/CN112750094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Television Systems (AREA)

Abstract

The invention relates to a video processing method and a system, and the method specifically comprises the following steps: obtaining an initial interlaced input frame; vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames; and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image. Compared with the prior art, the method can uniformly realize the de-interlacing, compression post-processing and super-resolution of the low-quality video, thereby recovering the reconstructed video with high visual quality.

Description

Video processing method and system
Technical Field
The present invention relates to the field of video and image processing, and in particular, to a video processing method and system.
Background
Interlaced technology has found widespread use in early television broadcast systems (e.g., NTSC, PAL, and SECAM). The odd line pixels and the even line pixels of the interlaced scanning frame image are respectively from two different frames, which are called as an odd field and an even field, and the video frame rate and the video bandwidth can be well balanced based on the mode. Since the two fields are captured at different time instances, the images of the different fields actually have a certain displacement difference and cannot be perfectly aligned spatially. So when two fields of content are interlaced, noticeable jagged artifacts are observed in the video, and in the case of large motion, the effect of such artifacts is more severe. In addition to the interlacing artifacts, old videos often contain other complex noises and are poor in definition, the traditional de-interlacing method is single and limited in effect and cannot uniformly remove the complex negative effects, and the current deep learning methods are all based on a single frame and cannot well process the severe artifacts caused by large motion, so that a clean picture is recovered.
Therefore, how to design a processing method and system capable of recovering higher visual effect for video images containing complex interlacing artifacts becomes a problem to be solved in the art.
Disclosure of Invention
The invention aims to provide a video processing method and a video processing system aiming at a video image containing complex staggered artifacts. And respectively carrying out high-frequency enhancement and deep information reuse from the fusion characteristics containing the time sequence information by using a double-flow network structure, thereby further removing compression noise and recovering details. Finally, the image features are over-divided using an up-sampling method to obtain the final high resolution output image.
In order to achieve the purpose, the invention provides the following scheme:
a video processing method, comprising the steps of:
obtaining an initial interlaced input frame;
vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames;
and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image.
Optionally, the vertically interpolating the initial interlaced input frame to the complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion feature frame specifically includes the following steps:
splitting the initial interlaced input frame into an odd field and an even field;
acquiring depth features corresponding to the odd field and the even field by adopting a feature extraction network based on a depth neural network;
adopting a vertical up-sampling network based on a depth neural network to carry out vertical interpolation on the depth characteristics to obtain a vertical interpolation reconstruction frame;
and performing time sequence alignment fusion on the vertical interpolation reconstruction frame to obtain the time sequence fusion characteristic frame.
Optionally, the obtaining of the depth features corresponding to the odd field and the even field by using the feature extraction network based on the deep neural network specifically includes the following steps:
acquiring images corresponding to the odd field and the even field;
converting the image into 1 times of feature maps through a first convolution layer, and transmitting the 1 times of feature maps to a plurality of first residual blocks connected with the first convolution layer through residual errors;
the first residual blocks perform feature extraction and reconstruction on the feature maps with the number being 1 time that of the first residual blocks to obtain depth features corresponding to the odd fields and the even fields;
outputting the depth feature through a second convolutional layer.
Optionally, the vertically interpolating the depth feature by using a vertical upsampling network based on a depth neural network to obtain a vertically interpolated reconstructed frame specifically includes the following steps:
inputting said depth features into a convolutional neural network comprising a third convolutional layer and a vertical pixel scrambling block;
increasing the dimension of the 1-fold number of feature maps to 2-fold number of feature maps by the third convolutional layer, and transmitting the 2-fold number of feature maps to the vertical pixel scrambling block;
and performing up-sampling on the feature maps with the quantity being 2 times of that of the feature maps in the vertical direction through the vertical pixel scrambling block to obtain the vertically interpolated reconstructed frame.
Optionally, performing time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion feature frame specifically includes the following steps:
connecting adjacent frames in series with frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetric to each frame in the vertically interpolated reconstructed frames;
obtaining the deformable convolution offset required by the adjacent frame through offset network learning;
according to the deformable convolution offset, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;
and performing fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.
Optionally, the obtaining of the deformable convolution offset required by the adjacent frame through offset network learning specifically includes the following steps:
inputting said vertically interpolated reconstructed frame to an offset learning unit comprising a fifth convolution layer, a U-Net structure;
reducing the feature maps of the adjacent frames from 2 times to 1 times by the fifth convolution layer;
and according to the adjacent frames after the dimension reduction, obtaining the deformable convolution offset required by the adjacent frames through an offset learning unit of the U-Net structure.
Optionally, the step of removing compression noise from the time-series fusion feature frame to obtain a reconstructed output image specifically includes the following steps:
inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of multi-scale blocks to obtain a multi-layer output result, and connecting the multi-layer output result to an output layer from a residual error for accumulation to obtain an enhanced processing characteristic frame;
meanwhile, inputting the time sequence fusion feature frame into a convolutional neural network comprising a plurality of second residual blocks to obtain an information reuse feature frame;
and accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame.
And performing super-resolution reconstruction on the accumulated image characteristic frame to obtain the reconstructed output image.
Optionally, network training is carried out on the whole neural network, the network training is end-to-end whole training,
the samples of the network training use the synthesized frame sequence subjected to the interleaving degradation processing as a training set, and the loss function L in the whole training process is as follows:
Figure BDA0002874762920000031
wherein,
Figure BDA0002874762920000032
for outputting the image, OtFor the reference picture, t is the timestamp and e is a constant.
The present invention also provides a video processing system, comprising:
an acquisition module for acquiring an initial interlaced input frame;
a multi-field fusion alignment de-interlacing module for vertically interpolating the initial interlaced input frame to a complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion feature frame;
and the de-interlacing feature optimization module is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image.
Optionally, the multi-field fusion alignment de-interlacing module includes:
a field splitting unit for splitting the initial interlaced input frame into an odd field and an even field;
the characteristic extraction unit is connected with the field splitting unit and used for acquiring depth characteristics corresponding to the odd field and the even field;
the vertical up-sampling unit is connected with the feature extraction unit and is used for performing vertical interpolation on the depth features to obtain a vertical interpolation reconstruction frame;
and the multi-frame alignment fusion unit is connected with the vertical up-sampling unit and is used for carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame.
The de-interlacing feature optimization module comprises:
the multi-scale enhancement unit is used for carrying out high-frequency enhancement processing on the time sequence fusion feature frame to obtain an enhanced processing feature frame;
the depth residual error unit is used for reusing depth information of the time sequence fusion characteristic frame to obtain an information reuse characteristic frame;
and the accumulation unit is used for accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic.
And the up-sampling unit is used for performing super-resolution reconstruction on the accumulated image characteristics to obtain the reconstructed output image.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
different from the traditional single de-interlacing method, the method can uniformly realize de-interlacing, de-compression, frame interpolation and super-resolution of the old video. Through vertical field interpolation of a plurality of fields and efficient time sequence alignment, limited field information is jointly utilized to effectively remove staggered saw teeth, meanwhile, the full utilization of time redundant information further eliminates complex artifacts such as compression and the like, and recovers more high-frequency details as much as possible, thereby recovering a high-visual-effect high-resolution image.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a video processing method according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of a video processing system according to embodiment 2 of the present invention.
Fig. 3 is a block diagram of a video processing system according to embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a video processing method and a video processing system, which are implemented by firstly splitting an input frame into an odd field and an even field, putting the odd field and the even field into a network for feature extraction, then merging the odd field and the even field through vertical pixel interpolation, thereby obtaining a plurality of reconstructed frames with full resolution, aligning the time sequence of the frame features of adjacent frames and the features of intermediate frames, and then fusing and reducing the dimension of the plurality of features by using a convolution network to obtain the fusion features with time sequence information. And then, a double-current network structure is used for respectively carrying out high-frequency enhancement and deep information reuse from the fusion characteristics containing the time sequence information, thereby further removing the compression noise and recovering the details. Finally, the image features are over-divided using an up-sampling method to obtain the final high resolution output image. Compared with the prior art, the method can uniformly realize the de-interlacing, compression post-processing and super-resolution of the low-quality video, thereby recovering the reconstructed video with high visual quality.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1:
referring to fig. 1, a video processing method according to the present invention includes the following steps:
s1: obtaining an initial interlaced input frame;
s2: vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames; wherein, S2 specifically includes the following steps:
s21: splitting the initial interlaced input frame into an odd field and an even field;
s22: acquiring depth features corresponding to the odd field and the even field by adopting a feature extraction network based on a depth neural network; specifically, S22 further includes the following steps:
s221: acquiring images corresponding to the odd field and the even field;
s222: converting the image into 1 times of feature maps through a first convolution layer, and transmitting the 1 times of feature maps to a plurality of first residual blocks connected with the first convolution layer through residual errors;
s223: the first residual blocks perform feature extraction and reconstruction on the feature maps with the number being 1 time that of the first residual blocks to obtain depth features corresponding to the odd fields and the even fields;
s224: outputting the depth feature through a second convolutional layer.
S23: adopting a vertical up-sampling network based on a depth neural network to carry out vertical interpolation on the depth characteristics to obtain a vertical interpolation reconstruction frame; specifically, S23 further includes the following steps:
s231: inputting said depth features into a convolutional neural network comprising a third convolutional layer and a vertical pixel scrambling block;
s232, increasing the dimension of the feature maps of 1 time number to feature maps of 2 times number through the third convolution layer, and transmitting the feature maps of 2 times number to the vertical pixel scrambling block;
and S233, performing up-sampling on the feature maps with the number 2 times that of the feature maps in the vertical direction through the vertical pixel scrambling block to obtain the vertical interpolation reconstruction frame.
S24: and performing time sequence alignment fusion on the vertical interpolation reconstruction frame to obtain the time sequence fusion characteristic frame. Specifically, S24 further includes the following steps:
s241: connecting adjacent frames in series with frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetric to each frame in the vertically interpolated reconstructed frames;
s242: obtaining the deformable convolution offset required by the adjacent frame through offset network learning; specifically, S242 further includes the following steps:
s2421: inputting said vertically interpolated reconstructed frame to an offset learning unit comprising a fifth convolution layer, a U-Net structure;
s2422: reducing the feature maps of the adjacent frames from 2 times to 1 times by the fifth convolution layer;
s2423: and according to the adjacent frames after the dimension reduction, obtaining the deformable convolution offset required by the adjacent frames through an offset learning unit of the U-Net structure.
S243: according to the deformable convolution offset, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;
s244: and performing fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.
S3: and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image. Specifically, S3 further includes the following steps:
s31: inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of multi-scale blocks to obtain a multi-layer output result, and connecting the multi-layer output result to an output layer from a residual error for accumulation to obtain an enhanced processing characteristic frame;
s32: meanwhile, inputting the time sequence fusion feature frame into a convolutional neural network comprising a plurality of second residual blocks to obtain an information reuse feature frame;
s33: and accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame.
S34: and performing super-resolution reconstruction on the accumulated image characteristic frame to obtain the reconstructed output image.
As a possible implementation manner, the network training is carried out on the whole neural network, the network training is end-to-end whole training,
the samples of the network training use the synthesized frame sequence subjected to the interleaving degradation processing as a training set, and the loss function L in the whole training process is as follows:
Figure BDA0002874762920000071
wherein,
Figure BDA0002874762920000072
for outputting the image, OtFor reference picture, t is the time stamp, e is 1 × 10-3
Through the steps, the method can uniformly realize the de-interlacing, the de-compression, the frame interpolation and the super-resolution of the old video. Through vertical field interpolation of a plurality of fields and efficient time sequence alignment, limited field information is jointly utilized to effectively remove staggered saw teeth, meanwhile, the full utilization of time redundant information further eliminates complex artifacts such as compression and the like, and recovers more high-frequency details as much as possible, thereby recovering a high-visual-effect high-resolution image.
Example 2:
referring to fig. 2 and fig. 3, the present invention further provides a video processing system, including:
an acquisition module for acquiring an initial interlaced input frame;
a multi-field fusion alignment de-interlacing module 1, configured to vertically interpolate the initial interlaced input frame to a complete frame resolution, and perform time sequence alignment fusion to obtain a time sequence fusion feature frame;
specifically, the multi-field fusion alignment de-interlacing module 1 includes:
a field splitting unit 3 for splitting the initial interlaced input frame into an odd field and an even field;
the feature extraction unit 4 is connected with the field splitting unit 3 and is used for acquiring depth features corresponding to the odd field and the even field; the feature extraction unit 4 is a feature extraction network based on a deep neural network, and the feature extraction network sequentially comprises:
an input layer for converting the images corresponding to the odd field and the even field into 1 times of feature maps by a first convolution layer;
the first residual blocks are connected with the input layer through residual errors and used for carrying out feature extraction and reconstruction on the feature maps with the quantity being 1 time that of the feature maps to obtain depth features corresponding to the odd fields and the even fields;
an output layer for outputting the depth feature through a second convolutional layer.
The vertical up-sampling unit 5 is connected with the feature extraction unit 4 and is used for performing vertical interpolation on the depth features to obtain a vertical interpolation reconstruction frame; the vertical upsampling unit 5 is a vertical upsampling network based on a deep neural network, and the vertical upsampling network sequentially comprises:
an input layer for increasing the 1-fold number of feature maps to 2-fold number of feature maps by a third convolution layer;
and the vertical pixel scrambling block is used for performing up-sampling on the feature maps with the quantity being 2 times in the vertical direction to obtain the vertically interpolated reconstructed frame.
And the multi-frame alignment fusion unit 6 is connected with the vertical upsampling unit 5 and is used for performing time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame. The method specifically comprises the following steps:
a concatenation unit, configured to concatenate an adjacent frame and a frame of the vertically interpolated reconstructed frames corresponding to the adjacent frame, where the adjacent frame is a frame symmetric to each frame of the vertically interpolated reconstructed frames;
the deformable convolution offset unit is used for obtaining the deformable convolution offset required by the adjacent frame through offset network learning; specifically, the offset network is based on a convolutional neural network, and the offset network sequentially includes:
an input layer for reducing the feature maps of the adjacent frames from 2 times to 1 times by a fifth convolution layer;
the offset learning unit of the U-Net structure is used for obtaining the deformable convolution offset required by the adjacent frames according to the adjacent frames after the dimension reduction;
the offset learning unit of the U-Net structure comprises: 3 downsample blocks and 3 upsample blocks, wherein each downsample block comprises, in order, a sixth convolutional layer, a first LRelu activation function, a seventh convolutional layer, and a second LRelu activation line number; each up-sampling block comprises an eighth convolution layer, a third LRelu activating function, a ninth convolution layer, a fourth LRelu function and a bilinear interpolation operation in sequence; each of the downsample blocks and the corresponding upsample block are connected by a residual.
The alignment unit is used for sequentially aligning the adjacent frames through a deformable convolution layer according to the deformed convolution offset to obtain aligned frame characteristics;
and the fusion unit is used for carrying out fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.
And the de-interlacing feature optimization module 2 is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image.
Specifically, the de-interlacing feature optimization module includes:
the multi-scale enhancement unit 7 is used for performing high-frequency enhancement processing on the time sequence fusion feature frame to obtain an enhanced processing feature frame; specifically, the multi-scale enhancement unit is a multi-scale enhancement network based on a convolutional neural network, the multi-scale enhancement network comprises a plurality of multi-scale blocks, each multi-scale block is formed by stacking a tenth convolutional layer, an eleventh convolutional layer and a twelfth convolutional layer, and the outputs of the tenth convolutional layer, the eleventh convolutional layer and the twelfth convolutional layer are connected to the tail through residual errors for accumulation.
A depth residual error unit 8, configured to perform depth information reuse on the time sequence fusion feature frame to obtain an information reuse feature frame; specifically, the depth residual unit is a depth residual network based on a convolutional neural network, the depth residual network includes a plurality of second residual blocks, and each second residual block includes a thirteenth convolutional layer, a Relu activation function, and a fourteenth convolutional layer; the depth residual network further comprises a residual concatenation of outputs from the inputs to a plurality of the second residual blocks.
And an accumulation unit 9, configured to accumulate the enhancement processing feature frame and the information reuse feature frame to obtain an accumulated image feature.
And the up-sampling unit 10 is used for performing super-resolution reconstruction on the accumulated image characteristics to obtain the reconstructed output image. Specifically, the upsampling unit is an upsampling network based on a convolutional neural network, and the upsampling network sequentially includes:
an input layer for increasing the dimension of the reduced 1-fold number of feature maps to 2-fold number of feature maps by a fifteenth convolution layer;
a sub-pixel convolution layer for performing 2 times of upsampling on the feature map with the quantity being 2 times to obtain the feature map with the quantity being 1 time after the upsampling;
an output layer for outputting the up-sampled 1-fold number of feature maps through a sixteenth convolution layer.
As a possible implementation manner, the system further includes a network training unit, which is used for performing network training on the whole network, wherein the network training is end-to-end whole training,
the sample of the network training uses a synthesized frame sequence of high-quality high-resolution video subjected to interleaving and quality degradation processing as a training set, wherein the quality degradation refers to changing a common clear video frame into a fuzzy interleaved frame so as to simulate the effect of old video, and a loss function L in the whole training process is as follows:
Figure BDA0002874762920000101
wherein,
Figure BDA0002874762920000102
for outputting the image, OtFor reference picture, t is the time stamp, e is 1 × 10-3
In specific implementation, the acquisition module acquires N consecutive interlaced frames, where N is an odd number greater than or equal to 3, and for convenience of description, N is 3 in the following example. For 3 consecutive interlaced frames I [ t-1, t, t +1], we denote the intermediate frame I [ t ] as the reference frame and the remaining two symmetric frames as the neighboring frames, all with a resolution H × W, where H is the height and W is the width. The odd line of each frame pixel is an odd field, the even line of each frame pixel is an even field, six fields adjacent in time sequence are obtained through field separation, and the resolution is H/2 multiplied by W. For a plurality of fields, corresponding depth features are acquired using the feature extraction unit 4.
It should be noted that the feature extraction unit 4 in the present invention is a deep neural network, for example, an example of the feature extraction unit 4 in the present invention is a feature extraction network based on a convolutional neural network, and the specific structure sequentially includes:
an input layer, specifically a convolution layer of 3 × 3 size, converts the input image into a 64-layer feature map;
5 residual blocks, wherein each residual block is specifically a convolution layer with the size of 3 multiplied by 3, a ReLU activation function layer and a convolution layer with the size of 3 multiplied by 3, the number of characteristic channels is 64, and the characteristic channels are used for carrying out characteristic extraction and reconstruction;
a residual join from after the input layer to after the residual block.
An output layer, specifically a 3 x 3 sized convolutional layer.
After the depth features are acquired, the features are vertically interpolated using a vertical upsampling unit to restore the original resolution and reduce the interleaved comb artifacts.
The vertical upsampling unit 5 in the present invention is a vertical upsampling network based on a deep neural network. Specifically, the invention provides one example of a vertical field interpolation network based on a convolutional neural network, and the structure of the vertical field interpolation network sequentially comprises the following components:
an input layer, specifically a convolution layer with a size of 3 × 3, which increases the feature maps of 64 layers in the previous stage to 128 layers;
a vertical pixel scrambling block for up-sampling the image in the vertical direction. It should be noted that, the Pixel Scrambling (PS) module is widely used in super-resolution networks, and our vertical pixel scrambling module only multiplies the features by 2 in the vertical direction.
Then obtaining six continuous reconstruction characteristic frames F with the resolution of H multiplied by Wt[1,2,3...6]
The plurality of frames reconstructed in the above steps have the interleaving artifacts removed preliminarily, but since only the information of a single field is utilized, a large amount of temporal and spatial feature information is missing, and many high-frequency details cannot be recovered. For this reason we need to aggregate more timing information to enrich the feature information.
Since the performance of the model is seriously affected by the existence of masks and large-scale motion in the video frame sequence, in order to fully utilize the time information of a plurality of reconstructed frames, the frames need to be aligned in time sequence.
The multi-frame alignment fusion unit 6 of the present invention is shown in fig. 3. For the successive reconstructed frames F obtained in the previous stept[1,2,3...6]Our goal is to obtain a fused feature F with timing information3A and F4For this purpose, all the neighboring frames of the target frame need to be aligned with it first. Due to the existence of various complex artifacts, the traditional optical flow alignment cannot well learn accurate flow to realize alignment, so that implicit alignment based on deformable convolution is adopted, and an efficient offset learning structure is designed to learn more adaptive offset. We take F3For example, first, the adjacent frames and F3Deformable convolution offset deltaP required for concatenation and learning alignment by offset networkjIf the offset network is R, then
△Pj=R([F3,Fj]),j=(1,2,4,5)
The offset network R in the invention is a method based on a convolutional neural network, and the specific structure sequentially comprises the following steps:
an input layer, specifically a convolution layer with the size of 3 multiplied by 3, reduces the dimension of the spliced adjacent frame features from 128 layers to 64 layers;
an offset learning unit of U-Net structure comprises 3 downsampling blocks and 3 upsampling blocks, wherein the downsampling blocks sequentially comprise a 3 x 3 convolutional layer with an expansion rate of 2, an LRelu activating function, a 3 x 3 convolutional layer with an expansion rate of 1 and an LRelu activating function. The upsampling block comprises, in order, a 3 x 3 convolutional layer with a spreading factor of 1, an LRelu activation function and a bilinear interpolation operation with a scaling factor of 2. Wherein each downsample block and corresponding upsample block residual are concatenated.
Obtaining a deformable convolution offset Δ PjThen, the adjacent frames are aligned in sequence, so as to obtain the aligned frame characteristics Aj
Aj=D(Fj,△Pj),j=(1,2,4,5)
Where D is an alignment operation, which is performed by a deformable convolution layer.
The aligned frame characteristics AjThrough the fusion operation to obtain the final output fusion characteristic F of the step3*。
In the invention, the fusion operation is a convolution layer with the size of 1 multiplied by 1, and the fusion dimensionality of a plurality of groups of features is reduced to 64 layers.
Fusion feature F4Using the same F3Obtained in a completely consistent manner, with the neighboring frame index j ═ 2,3,5, 6.
The obtained fusion features contain rich space-time information and greatly eliminate staggered artifacts, but the image features still have complex negative effects such as blurring and compression block artifacts, and in order to further learn depth features to restore details, the fusion features output in the previous stage are put into a de-staggered feature optimization module to perform further feature optimization and perform up-sampling to improve resolution.
The de-interlacing feature optimization module 2 of the present invention further comprises a multi-scale enhancement unit 7, a depth residual connecting unit 8 and an upsampling unit 10, as shown in fig. 3. For the input fusion feature F, an up-sampling unit connected at the tail of a double-current network consisting of a multi-scale enhancement unit and a depth residual error unit is used for obtaining a final reconstructed image
Figure BDA0002874762920000121
Setting a multi-scale enhancement unit as M, a depth residual error connecting unit as S and an up-sampling unit as U, namely:
Figure BDA0002874762920000122
the multi-scale enhancement unit 7 in the invention is a method based on a convolutional neural network, and the specific structure of the embodiment is as follows:
3 multi-scale blocks, each multi-scale block being stacked of one 3 x 3 convolutional layer with a span of 1, one 3 x 3 convolutional layer with a span of 2, one 3 x 3 convolutional layer with a span of 1 and one 3 x 3 convolutional layer with a span of 2, the output of each layer being connected by a residual to a tail accumulation.
The depth residual error unit 8 in the present invention is also a method based on a convolutional neural network, and its specific structure is as follows:
6 residual blocks, each of which consists of a 3 × 3 convolutional layer, a Relu activation function and a 3 × 3 convolutional layer;
a residual concatenation from the input to the output of the residual block;
the multi-scale enhancement unit 7 aims at enhancing the representation of high frequency details by a multi-scale stacked reception field, and the depth residual unit 8 aims at further extracting spatio-temporal features and avoiding information loss. The outputs of the two units are accumulated and input into an up-sampling unit 10 for super-resolution reconstruction.
The upsampling unit 10 in the present invention may be a conventional upsampling algorithm, such as interpolation or the like; or an up-sampling network based on a deep neural network.
Specifically, the present invention provides one example of an upsampling network based on a convolutional neural network, and the structure of the upsampling network sequentially includes:
an input layer, specifically a convolution layer with a size of 3 × 3, which increases the feature maps of 64 layers in the previous stage to 128 layers;
a sub-pixel convolution layer with a magnification factor of 2 for up-sampling the image by a factor of 2;
reconstructing 64 layers of characteristic diagram into output result by an output layer, specifically a convolution layer with 3 × 3 size
Figure BDA0002874762920000133
It should be noted that the vertical upsampling unit 5 in the present invention may be separately trained in advance, or may be integrated into the entire network for end-to-end training. The training samples for the entire network use the synthetic frame sequence of high quality high resolution video with cross-fading as the training set. The main constraint is that the final reconstructed high resolution result is consistent with the original unprocessed sample image, and the loss function of the training is as follows:
Figure BDA0002874762920000131
wherein,
Figure BDA0002874762920000132
for outputting the image, OtFor reference pictures, t is the time stamp, and e is empirically set to 1 × 10-3The charbonierloss, which is a common problem in image enhancement and reconstruction.
Different from the traditional single de-interlacing method, the method can uniformly realize de-interlacing, de-compression, frame interpolation and super-resolution of the old video. Through vertical field interpolation of a plurality of fields and efficient time sequence alignment, limited field information is jointly utilized to effectively remove staggered saw teeth, meanwhile, the full utilization of time redundant information further eliminates complex artifacts such as compression and the like, and recovers more high-frequency details as much as possible, thereby recovering a high-visual-effect high-resolution image.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A video processing method, comprising the steps of:
obtaining an initial interlaced input frame;
vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames;
and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image.
2. The video processing method of claim 1, wherein vertically interpolating the initial interleaved input frames to full frame resolution and performing time alignment fusion to obtain a time sequence fusion feature frame comprises the steps of:
splitting the initial interlaced input frame into an odd field and an even field;
acquiring depth features corresponding to the odd field and the even field by adopting a feature extraction network based on a depth neural network;
adopting a vertical up-sampling network based on a depth neural network to carry out vertical interpolation on the depth characteristics to obtain a vertical interpolation reconstruction frame;
and performing time sequence alignment fusion on the vertical interpolation reconstruction frame to obtain the time sequence fusion characteristic frame.
3. The video processing method according to claim 2, wherein the obtaining of the depth features corresponding to the odd field and the even field by using the feature extraction network based on the deep neural network specifically comprises the following steps:
acquiring images corresponding to the odd field and the even field;
converting the image into 1 times of feature maps through a first convolution layer, and transmitting the 1 times of feature maps to a plurality of first residual blocks connected with the first convolution layer through residual errors;
the first residual blocks perform feature extraction and reconstruction on the feature maps with the number being 1 time that of the first residual blocks to obtain depth features corresponding to the odd fields and the even fields;
outputting the depth feature through a second convolutional layer.
4. The video processing method according to claim 3, wherein said vertically interpolating the depth features using a vertical upsampling network based on a depth neural network to obtain a vertically interpolated reconstructed frame comprises the following steps:
inputting said depth features into a convolutional neural network comprising a third convolutional layer and a vertical pixel scrambling block;
increasing the dimension of the 1-fold number of feature maps to 2-fold number of feature maps by the third convolutional layer, and transmitting the 2-fold number of feature maps to the vertical pixel scrambling block;
and performing up-sampling on the feature maps with the quantity being 2 times of that of the feature maps in the vertical direction through the vertical pixel scrambling block to obtain the vertically interpolated reconstructed frame.
5. The video processing method according to claim 4, wherein performing time-series alignment fusion on the vertically interpolated reconstructed frame to obtain the time-series fusion feature frame specifically comprises the following steps:
connecting adjacent frames in series with frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetric to each frame in the vertically interpolated reconstructed frames;
obtaining the deformable convolution offset required by the adjacent frame through offset network learning;
according to the deformable convolution offset, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;
and performing fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.
6. The video processing method according to claim 5, wherein said obtaining the deformable convolution offset required by the adjacent frame through offset network learning specifically comprises the following steps:
inputting said vertically interpolated reconstructed frame to an offset learning unit comprising a fifth convolution layer, a U-Net structure;
reducing the feature maps of the adjacent frames from 2 times to 1 times by the fifth convolution layer;
and according to the adjacent frames after the dimension reduction, obtaining the deformable convolution offset required by the adjacent frames through an offset learning unit of the U-Net structure.
7. The video processing method according to claim 6, wherein the step of performing compression noise removal processing on the time-series fusion feature frame to obtain a reconstructed output image specifically comprises the steps of:
inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of multi-scale blocks to obtain a multi-layer output result, and connecting the multi-layer output result to an output layer from a residual error for accumulation to obtain an enhanced processing characteristic frame;
meanwhile, inputting the time sequence fusion feature frame into a convolutional neural network comprising a plurality of second residual blocks to obtain an information reuse feature frame;
accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame;
and performing super-resolution reconstruction on the accumulated image characteristic frame to obtain the reconstructed output image.
8. The video processing method according to any of claims 1-7, further comprising network training the entire neural network, said network training being end-to-end overall training,
the samples of the network training use the synthesized frame sequence subjected to the interleaving degradation processing as a training set, and the loss function L in the whole training process is as follows:
Figure FDA0002874762910000031
wherein,
Figure FDA0002874762910000032
for outputting the image, OtFor the reference picture, t is the timestamp and e is a constant.
9. A video processing system, comprising:
an acquisition module for acquiring an initial interlaced input frame;
a multi-field fusion alignment de-interlacing module for vertically interpolating the initial interlaced input frame to a complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion feature frame;
and the de-interlacing feature optimization module is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image.
10. The video processing system of claim 9, wherein the multi-field fusion alignment de-interlacing module comprises:
a field splitting unit for splitting the initial interlaced input frame into an odd field and an even field;
the characteristic extraction unit is connected with the field splitting unit and used for acquiring depth characteristics corresponding to the odd field and the even field;
the vertical up-sampling unit is connected with the feature extraction unit and is used for performing vertical interpolation on the depth features to obtain a vertical interpolation reconstruction frame;
the multi-frame alignment fusion unit is connected with the vertical up-sampling unit and is used for carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame;
the de-interlacing feature optimization module comprises:
the multi-scale enhancement unit is used for carrying out high-frequency enhancement processing on the time sequence fusion feature frame to obtain an enhanced processing feature frame;
the depth residual error unit is used for reusing depth information of the time sequence fusion characteristic frame to obtain an information reuse characteristic frame;
the accumulation unit is used for accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic;
and the up-sampling unit is used for performing super-resolution reconstruction on the accumulated image characteristics to obtain the reconstructed output image.
CN202011611610.7A 2020-12-30 2020-12-30 Video processing method and system Active CN112750094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011611610.7A CN112750094B (en) 2020-12-30 2020-12-30 Video processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011611610.7A CN112750094B (en) 2020-12-30 2020-12-30 Video processing method and system

Publications (2)

Publication Number Publication Date
CN112750094A true CN112750094A (en) 2021-05-04
CN112750094B CN112750094B (en) 2022-12-09

Family

ID=75649732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011611610.7A Active CN112750094B (en) 2020-12-30 2020-12-30 Video processing method and system

Country Status (1)

Country Link
CN (1) CN112750094B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114092339A (en) * 2022-01-24 2022-02-25 南京理工大学 Space-time video super-resolution reconstruction method based on cross-frame self-attention transformation network
CN115150201A (en) * 2022-09-02 2022-10-04 南通市艺龙科技有限公司 Remote encryption transmission method for cloud computing data
CN115348432A (en) * 2022-08-15 2022-11-15 上海壁仞智能科技有限公司 Data processing method and device, image processing method, electronic device and medium
WO2023274405A1 (en) * 2021-07-01 2023-01-05 Beijing Bytedance Network Technology Co., Ltd. Super resolution position and network structure
CN115994857A (en) * 2023-01-09 2023-04-21 深圳大学 Video super-resolution method, device, equipment and storage medium
WO2024082933A1 (en) * 2022-10-21 2024-04-25 抖音视界有限公司 Video processing method and apparatus, and electronic device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477684A (en) * 2008-12-11 2009-07-08 西安交通大学 Process for reconstructing human face image super-resolution by position image block
CN101621652A (en) * 2009-07-21 2010-01-06 上海华平信息技术股份有限公司 Method for transmitting interlaced picture in high quality and changing the interlaced picture into non-interlaced picture in picture transmission system
CN102204250A (en) * 2008-08-29 2011-09-28 Gvbb控股股份有限公司 Encoding method, encoding device, and encoding program for encoding interlaced image
CN104519363A (en) * 2013-09-26 2015-04-15 汤姆逊许可公司 Video encoding/decoding methods, corresponding computer programs and video encoding/decoding devices
WO2019009488A1 (en) * 2017-07-06 2019-01-10 삼성전자 주식회사 Method and device for encoding or decoding image
CN111524068A (en) * 2020-04-14 2020-08-11 长安大学 Variable-length input super-resolution video reconstruction method based on deep learning
CN111583112A (en) * 2020-04-29 2020-08-25 华南理工大学 Method, system, device and storage medium for video super-resolution
CN112102163A (en) * 2020-08-07 2020-12-18 南京航空航天大学 Continuous multi-frame image super-resolution reconstruction method based on multi-scale motion compensation framework and recursive learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102204250A (en) * 2008-08-29 2011-09-28 Gvbb控股股份有限公司 Encoding method, encoding device, and encoding program for encoding interlaced image
CN101477684A (en) * 2008-12-11 2009-07-08 西安交通大学 Process for reconstructing human face image super-resolution by position image block
CN101621652A (en) * 2009-07-21 2010-01-06 上海华平信息技术股份有限公司 Method for transmitting interlaced picture in high quality and changing the interlaced picture into non-interlaced picture in picture transmission system
CN104519363A (en) * 2013-09-26 2015-04-15 汤姆逊许可公司 Video encoding/decoding methods, corresponding computer programs and video encoding/decoding devices
WO2019009488A1 (en) * 2017-07-06 2019-01-10 삼성전자 주식회사 Method and device for encoding or decoding image
US20200120340A1 (en) * 2017-07-06 2020-04-16 Samsung Electronics Co., Ltd. Method and device for encoding or decoding image
CN111524068A (en) * 2020-04-14 2020-08-11 长安大学 Variable-length input super-resolution video reconstruction method based on deep learning
CN111583112A (en) * 2020-04-29 2020-08-25 华南理工大学 Method, system, device and storage medium for video super-resolution
CN112102163A (en) * 2020-08-07 2020-12-18 南京航空航天大学 Continuous multi-frame image super-resolution reconstruction method based on multi-scale motion compensation framework and recursive learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
G.F.BERTSCH等: "Odd-even mass differences from self-consistent mean field theory", 《PHYSICAL REVIEW》 *
景文博 等: "虚影图像中去隔行算法的研究", 《长春理工大学学报(自然科学版)》 *
毕雨来: "视频的分辨率增强技术及GPU实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023274405A1 (en) * 2021-07-01 2023-01-05 Beijing Bytedance Network Technology Co., Ltd. Super resolution position and network structure
CN114092339A (en) * 2022-01-24 2022-02-25 南京理工大学 Space-time video super-resolution reconstruction method based on cross-frame self-attention transformation network
CN114092339B (en) * 2022-01-24 2022-05-20 南京理工大学 Space-time video super-resolution reconstruction method based on cross-frame self-attention transformation network
CN115348432A (en) * 2022-08-15 2022-11-15 上海壁仞智能科技有限公司 Data processing method and device, image processing method, electronic device and medium
CN115348432B (en) * 2022-08-15 2024-05-07 上海壁仞科技股份有限公司 Data processing method and device, image processing method, electronic equipment and medium
CN115150201A (en) * 2022-09-02 2022-10-04 南通市艺龙科技有限公司 Remote encryption transmission method for cloud computing data
CN115150201B (en) * 2022-09-02 2022-11-08 南通市艺龙科技有限公司 Remote encryption transmission method for cloud computing data
WO2024082933A1 (en) * 2022-10-21 2024-04-25 抖音视界有限公司 Video processing method and apparatus, and electronic device and storage medium
CN115994857A (en) * 2023-01-09 2023-04-21 深圳大学 Video super-resolution method, device, equipment and storage medium
CN115994857B (en) * 2023-01-09 2023-10-13 深圳大学 Video super-resolution method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112750094B (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN112750094B (en) Video processing method and system
CN102714726B (en) Edge enhancement for temporal scaling with metadata
CN1210954C (en) Equipment and method for covering interpolation fault in alternate-line scanning to line-by-line scanning converter
DE69120139T2 (en) Device and method for the adaptive compression of successive blocks of a digital video signal
US6118488A (en) Method and apparatus for adaptive edge-based scan line interpolation using 1-D pixel array motion detection
EP2101506B1 (en) Image processing apparatus and method for format conversion
CN111885280B (en) Hybrid convolutional neural network video coding loop filtering method
CN102685475B (en) Interlace compressed display method and system for reducing video frame rate
CN113055674B (en) Compressed video quality enhancement method based on two-stage multi-frame cooperation
CN110062232A (en) A kind of video-frequency compression method and system based on super-resolution
CN113066022B (en) Video bit enhancement method based on efficient space-time information fusion
CN113850718A (en) Video synchronization space-time super-resolution method based on inter-frame feature alignment
CN115689917A (en) Efficient space-time super-resolution video compression restoration method based on deep learning
Kwon et al. A motion-adaptive de-interlacing method
US6804303B2 (en) Apparatus and method for increasing definition of digital television
Zhao et al. Multiframe joint enhancement for early interlaced videos
KR101979584B1 (en) Method and Apparatus for Deinterlacing
EP1590964B1 (en) Video coding
CN116418990A (en) Method for enhancing compressed video quality based on neural network
CN115065796A (en) Method and device for generating video intermediate frame
Yeh et al. VDNet: video deinterlacing network based on coarse adaptive module and deformable recurrent residual network
US9049448B2 (en) Bidimensional bit-rate reduction processing
KR100827214B1 (en) Motion compensated upconversion for video scan rate conversion
KR101037023B1 (en) High resolution interpolation method and apparatus using high frequency synthesis based on clustering
CN115861078B (en) Video enhancement method and system based on bidirectional space-time recursion propagation neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant