CN114663285B - Old movie super-resolution system based on convolutional neural network - Google Patents

Old movie super-resolution system based on convolutional neural network Download PDF

Info

Publication number
CN114663285B
CN114663285B CN202210339390.XA CN202210339390A CN114663285B CN 114663285 B CN114663285 B CN 114663285B CN 202210339390 A CN202210339390 A CN 202210339390A CN 114663285 B CN114663285 B CN 114663285B
Authority
CN
China
Prior art keywords
features
feature
image
primary
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210339390.XA
Other languages
Chinese (zh)
Other versions
CN114663285A (en
Inventor
闫子飞
赵延超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202210339390.XA priority Critical patent/CN114663285B/en
Publication of CN114663285A publication Critical patent/CN114663285A/en
Application granted granted Critical
Publication of CN114663285B publication Critical patent/CN114663285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

An old film super-resolution system based on a convolutional neural network belongs to the field of digital image processing and deep learning. The invention aims at solving the problems that in the existing old film restoration, only one piece of super-processing or plaque processing can be performed, and the restoration effect is poor. Comprising the following steps: the smoothing module is used for extracting features of the adjacent previous k frame images, the current frame image and the adjacent next k frame images to obtain middle-low frequency global degradation features of all frame input images; the feature extraction module is used for extracting features to obtain extracted features of each frame of input image; the PCD module performs alignment operation on each group of images; the time attention module is used for obtaining the fused high-frequency characteristics of the current frame image; the reconstruction module is used for obtaining the reconstructed characteristics of the current frame image; and finally, obtaining a high-definition repair image of the current frame image through an adding unit. The invention improves the plaque problem of the old film while realizing the superdivision.

Description

Old movie super-resolution system based on convolutional neural network
Technical Field
The invention relates to an old film super-resolution system based on a convolutional neural network, belonging to the field of digital image processing and deep learning.
Background
The old film is used as an early artistic form, records the appearance of the early society, and is precious cultural heritage and indispensable historic data. In order to bring the value of the video display to the public in the current age, the video display needs to be subjected to high definition restoration so as to meet the requirements of viewers on the image quality. The Super-Resolution (SR) of an image is a subtask of image restoration, and aims to process an input image or image sequence with lower Resolution (Low-Resolution, LR) into an image or image sequence with High Resolution (High-Resolution, HR), and the processed image detail information is full, rich, fine and clear.
The SRCNN proposed in 2015 applies convolutional neural networks (Convolutional Neural Network, CNN) to the field of image super-resolution for the first time, and only three convolutional layers are used for quite excellent performance. The improved FSRCNN is added with a deconvolution layer at the end of the network for enlarging the image size, so that the LR image can be directly input without enlarging the image size by a bicubic interpolation method, and the operation speed and the generated image quality are improved. The VDSR adopts a residual structure, and the depth of the network is expanded to 20 layers, so that a better effect is brought. ESPCN introduces a sub-pixel convolution layer to replace up-sampling operation, so that the calculation overhead is greatly reduced, and the structure is widely applied to the subsequent algorithm. Video Super-Resolution (VSR) can use information for multiple frames, theoretically better performing, but depending on the exact alignment between Video frames, than single frame Super-Resolution (SISR). The TDAN adopts deformable convolution (Deformable Convolution) for alignment in the superdivision task for the first time, avoiding the two-stage process adopted by the optical flow-based method. The EDVR algorithm in 2019 refers to a deformable convolution alignment mode of TDAN, and provides a PCD (PCD device, cascading and Deformable) alignment module on the basis of the deformable convolution alignment mode, so that the alignment precision is improved in the operation process from coarse to fine, and severe and complex movements can be successfully processed.
However, the film medium is very fragile, physical abrasion is caused by repeated projection, and mildew and chemical corrosion are easy to generate after years of storage; contaminants such as dust, hair, plant fibers, etc., attached to the film surface can also leave noise, stains, etc. These problems can be represented in the picture as patches (Blotch), exhibit arbitrary size, color, shape and distribution position, and have discontinuities in timing, i.e., the same patches do not appear on two consecutive frames. If the old film is directly processed in an over-processing mode, the patch existing on the picture can interfere with the alignment between frames, and can be synchronously amplified in the output image, so that the image quality is affected.
For plaque problem, early repair methods are mostly simple filtering methods, such as median filtering, multistage median filtering, LUM filtering, topological median filtering, and the like. In 1998, kokaram proposed a special two-level median filter ML3Dex that not only incorporates pixels in the 8-neighborhood of each position into the calculation range, but also uses the information of the previous and subsequent frames. In 2008 Jain and Seung used CNN to handle the image denoising problem, and had better denoising effect compared to the conventional wavelet transform and markov random field (Markov Random Field, MRF) methods. Later, deeper image denoising methods such as DnCNN begin to be popular, but most of the methods only aim at simple Gaussian noise, the deep learning methods aiming at old film restoration tasks are few, and the deep learning results are not fully utilized. In 2019, iizuka et al proposed a method for coloring old movies, wherein the preprocessing module consisted of a 3D convolution layer, and the restoration capacity of the movie plaque exceeded DnCNN. These algorithms represent a great potential of CNN in image denoising, old film plaque removal tasks.
However, the prior art generally only accomplishes a single task, either for the supersplit problem or the plaque problem, and lacks an integrated solution for advanced movie high definition repair.
Disclosure of Invention
Aiming at the problem that in the existing old film restoration, only single super-processing or plaque processing can be carried out, and the restoration effect is poor, the invention provides an old film super-resolution system based on a convolutional neural network.
The invention relates to an old film super-resolution system based on a convolutional neural network, which comprises,
smoothing module for adjacent previous k frame image x t-i Current frame image x t And adjacent k frame image x t+i Extracting features to obtain middle-low frequency global degradation features of all frame input images; where k is a positive integer, i=1, 2,3, … …, k;
the feature extraction module is used for respectively extracting features of all the frame input images to obtain extracted features of each frame input image; the extracted features of all the frame input images are pairwise paired to form 2k+1 groups of images; the 2k+1 group of images includes a current frame image x t The corresponding extracted features and the extracted features corresponding to all frame input images including the corresponding extracted features are sequentially combined to form an image group;
PCD module for corresponding to current frame image x in a group of images t Taking the extracted feature of (a) as a reference feature, and carrying out feature pair on the other extracted feature to the reference featurePerforming alignment operation; obtaining aligned features for each set of images;
a time attention module for performing equalization processing on all the aligned features to obtain a current frame image x t Is a fused high frequency characteristic M t
A reconstruction module for the middle-low frequency global degradation characteristic and the fused high frequency characteristic M t The added sums are subjected to image super-division reconstruction to obtain a current frame image x t Is a reconstructed feature of (a);
an adding unit for adding the reconstructed features to the current frame image x t Adding the up-sampling results of the (2) to obtain the current frame image x t High definition repair image of (a)
Figure BDA0003578380110000031
Further, according to the old film super-resolution system based on the convolutional neural network,
the process of obtaining the aligned features of each set of images by the PCD module includes:
respectively performing first-level downsampling on the reference feature and the other extracted feature to obtain a first-level reference feature and a first-level extracted feature, and respectively performing second-level downsampling on the first-level reference feature and the first-level extracted feature to obtain a second-level reference feature and a second-level extracted feature;
connecting the secondary reference feature with the secondary extracted feature to obtain a secondary offset;
the secondary offset and the secondary extracted features are subjected to deformable convolution treatment to obtain primary aligned features;
the primary offset is obtained by combining the results of the connection of the primary reference feature and the primary extracted feature with the secondary offset;
after the primary offset and the primary extracted features are subjected to deformable convolution treatment, combining the primary aligned features to obtain secondary aligned features;
obtaining a primary offset by combining a primary offset with a result obtained by connecting the reference feature and another extracted feature;
the primary offset and the other extracted feature are subjected to deformable convolution treatment, and the feature after the secondary alignment is combined to obtain a feature after the tertiary alignment;
connecting the reference feature with the feature after three times of alignment to obtain a final offset;
and carrying out deformable convolution processing on the final offset and the three aligned features to obtain the aligned features of the frame input image corresponding to the other extracted features.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the time attention module obtains the high-frequency characteristic M after fusion t The process of (1) comprises:
for the current frame image x t Respectively carrying out primary equalization treatment on the aligned features of the other frame images to obtain primary fused high-frequency features of the other frame images;
all primary fused high-frequency characteristics and the current frame image x t The aligned features of the (a) are connected to obtain a connected high-frequency feature, and then the connected high-frequency feature is convolved and activated to obtain a fused high-frequency feature M t
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the primary equalization process includes:
image x of current frame t The aligned features of the (a) and any other frame images are connected to obtain connected features, and the connected features are subjected to convolution operation and ReLU function activation to obtain first-level mixed features; performing convolution operation and sigmoid function activation on the first-level mixed features to obtain second-level mixed features; and after the secondary mixed feature is activated by global average pooling operation and sigmoid function, obtaining time attention, and multiplying the time attention with the aligned feature points of any other frame image to obtain the primary fused high-frequency feature.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the calculation method of the secondary mixing characteristics comprises the following steps:
z t-i =σ(W 2 δ(W 1 [F t ,F t-i ])):
z in t-i For the second order hybrid feature corresponding to any other frame image, σ (·) represents the sigmoid function, δ (·) represents the ReLU function, W 1 Weights of convolution layers operating on connected features, W 2 Weights of convolutional layers for operating on first-order hybrid features, F t For the current frame image x t Is the aligned feature of F t-i Is an aligned feature of any one of the adjacent previous k-frame images.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the time attention calculating method comprises the following steps:
s t-i =σ(AvgPooling(z t-i )),
s in t-i For the temporal attention corresponding to any of the other frame images, avgPooling represents global average pooling.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the high frequency characteristics after primary fusion are represented as M t-i
M t-i =s t-i F t-i
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
post-fusion high frequency feature M t The method comprises the following steps:
M t =φ(W 3 [{M t-i },F t ]),
wherein φ (·) represents the leakage ReLU function, W 3 Is the weight of the 1 x 1 convolutional layer operating on the connected high frequency features, { M t-i And represents the set of all primary fused high frequency features.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the feature extraction module comprises 5 residual blocks; the reconstruction module comprises 10 residual blocks; the number of channels per residual block is 64.
Still further, according to the old movie super-resolution system based on convolutional neural network of the present invention,
the smoothing module employs a 3 x 3 13 3D convolutional layer encoder and decoder structure.
The invention has the beneficial effects that: the invention provides an end-to-end convolutional neural network architecture, which improves the plaque problem of the old film while realizing superdivision.
Experiments prove that the system has obvious advantages in the aspects of picture definition, natural degree, satisfaction of human perception and the like after the old film is repaired. The time attention module branches pay more attention to high-frequency information such as edges, textures and the like of the image, and the outline of the object is clear and visible; the smoothing module outputs medium-low frequency characteristics, and the image is blurred. After the two are added, the high-frequency information and the low-frequency information are fused, so that the super-division reconstruction is conveniently carried out by the reconstruction module. As can be seen from the results of tables 1 and 2, the method not only makes the image clearer, but also effectively reduces the plaque problem, and the generated image is closest to the true value.
Experiments prove that the method has good robustness in the old movies with various styles, can generate sharper edges, and has a certain removal effect on real plaques.
Drawings
FIG. 1 is a network overall structure diagram of an old film super-resolution system based on a convolutional neural network;
FIG. 2 is a particular flow diagram of the feature extraction module and PCD module processing an image; inputting an intermediate frame and an adjacent frame, and performing multilayer operation on the characteristics of the intermediate frame and the adjacent frame to align the characteristics after the characteristics are extracted. In the figure, the dashed line represents the downsampling or upsampling of the feature or offset itself, and DConv represents the deformable convolution;
FIG. 3 is a data processing flow diagram of the time attention module; element-Wise Product in the figure represents a dot Product operation, and connect represents a connect operation; the method includes the steps that the aligned features are input, and the fused features are output;
FIG. 4 is a graph showing the average brightness of the old film before and after restoration in a manual simulation;
FIG. 5 is a flow chart of the old movie data simulation employed by the present invention;
FIG. 6 is an original video of a real old movie;
FIG. 7 is the result of repairing FIG. 6 using the system of the present invention;
FIG. 8 is a result of repairing FIG. 6 after removing the smoothing module in the system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting.
The invention provides an old film super-resolution system based on a convolutional neural network, which is shown in the accompanying figures 1 to 3, and comprises,
smoothing module for adjacent previous k frame image x t-i Current frame image x t And adjacent k frame image x t+i Extracting features to obtain middle-low frequency global degradation features of all frame input images; where k is a positive integer, i=1, 2,3, … …, k;
the feature extraction module is used for respectively extracting features of all the frame input images to obtain extracted features of each frame input image; the extracted features of all the frame input images are pairwise paired to form 2k+1 groups of images; the 2k+1 group of images includes a current frame image x t Corresponding post-extraction featuresAn image group formed by sequentially combining extracted features corresponding to all frame input images including the image group;
PCD module for corresponding to current frame image x in a group of images t Taking the extracted feature of the (a) as a reference feature, and carrying out feature alignment operation on the other extracted feature to the reference feature; obtaining aligned features for each set of images;
a time attention module for performing equalization processing on all the aligned features to obtain a current frame image x t Is a fused high frequency characteristic M t
A reconstruction module for the middle-low frequency global degradation characteristic and the fused high frequency characteristic M t The added sums are subjected to image super-division reconstruction to obtain a current frame image x t Is a reconstructed feature of (a);
an adding unit for adding the reconstructed features to the current frame image x t Adding the up-sampling results of the (2) to obtain the current frame image x t High definition repair image of (a)
Figure BDA0003578380110000061
A core problem in video superdivision VSR tasks is how to fully exploit the information of neighboring frames. For the old film, the difference of image quality of different frames is obvious due to the problems of plaque and flicker. Thus, the present embodiment proposes a time attention module to balance the information amount of the multi-frame images and perform feature fusion. Furthermore, while image superfractionation tasks are more focused on the restoration of high frequency information, the global degradation information of old motion picture video in both temporal and spatial dimensions can assist in the restoration of motion pictures. To extract these mid-low frequency information, a 3D convolution layer composition smoothing module is employed. For feature alignment, a PCD module is employed.
In the present embodiment, the input image is an intermediate frame, i.e., the current frame image x t And one or more adjacent frames before and after it, and obtaining the image x of the current frame by calculation t Corresponding high definition repair image
Figure BDA0003578380110000062
The flow before the reconstruction module is divided into two branches, wherein the smoothing module branch is responsible for extracting the global degradation characteristics of medium and low frequencies, and the other branch is composed of a characteristic extraction module, a PCD module and a time attention module and is responsible for extracting the high-frequency characteristics. By means of the 3D convolution layer and feature alignment and fusion, the two branches integrate the effective information of multiple frames. The reconstruction module performs image super-resolution reconstruction by using the sum of the results of the two branches.
Further, as shown in connection with FIG. 2, the PCD module obtains the aligned features for each set of images including:
respectively performing first-level downsampling on the reference feature and the other extracted feature to obtain a first-level reference feature and a first-level extracted feature, and respectively performing second-level downsampling on the first-level reference feature and the first-level extracted feature to obtain a second-level reference feature and a second-level extracted feature;
connecting the secondary reference feature with the secondary extracted feature to obtain a secondary offset;
the secondary offset and the secondary extracted features are subjected to deformable convolution treatment to obtain primary aligned features;
the primary offset is obtained by combining the results of the connection of the primary reference feature and the primary extracted feature with the secondary offset;
after the primary offset and the primary extracted features are subjected to deformable convolution treatment, combining the primary aligned features to obtain secondary aligned features;
obtaining a primary offset by combining a primary offset with a result obtained by connecting the reference feature and another extracted feature;
the primary offset and the other extracted feature are subjected to deformable convolution treatment, and the feature after the secondary alignment is combined to obtain a feature after the tertiary alignment;
connecting the reference feature with the feature after three times of alignment to obtain a final offset;
and carrying out deformable convolution processing on the final offset and the three aligned features to obtain the aligned features of the frame input image corresponding to the other extracted features.
Still further, as shown in connection with FIG. 3, the temporal attention module obtains the fused high frequency features M t The process of (1) comprises:
for the current frame image x t Respectively carrying out primary equalization treatment on the aligned features of the other frame images to obtain primary fused high-frequency features of the other frame images;
all primary fused high-frequency characteristics and the current frame image x t The aligned features of the (a) are connected to obtain a connected high-frequency feature, and then the connected high-frequency feature is convolved and activated to obtain a fused high-frequency feature M t
In the present embodiment, the current frame image x t Is referred to as: in one group of images, the extracted features of the two frames of input images are the current frame image x t The extracted features of (2) are subjected to alignment operation to obtain results; and another extracted feature corresponds to an aligned feature of the frame input image, which means: in one group of images, one of the extracted features of the two frames of input images corresponds to the current frame image x t The other is the extracted feature of the non-current frame image, and the two extracted features are aligned to obtain the aligned feature corresponding to the extracted feature of the non-current frame image.
As shown in connection with fig. 3, the primary equalization process includes:
image x of current frame t The aligned features of the (a) and any other frame images are connected to obtain connected features, and the connected features are subjected to convolution operation and ReLU function activation to obtain first-level mixed features; performing convolution operation and sigmoid function activation on the first-level mixed features to obtain second-level mixed features; and after the secondary mixed feature is activated by global average pooling operation and sigmoid function, obtaining time attention, and multiplying the time attention with the aligned feature points of any other frame image to obtain the primary fused high-frequency feature.
The calculation method of the secondary mixing characteristics comprises the following steps:
z t-i =σ(W 2 δ(W 1 [F t ,F t-i ])):
z in t-i For the second order hybrid feature corresponding to any other frame image, σ (·) represents the sigmoid function, δ (·) represents the ReLU function, W 1 Weights of convolution layers operating on connected features, W 2 Weights of convolutional layers for operating on first-order hybrid features, F t For the current frame image x t Is the aligned feature of F t-i Is an aligned feature of any one of the adjacent previous k-frame images.
In the present embodiment, although z t-i The time information of two frames is fused, but the convolution kernel also captures the information of spatial correlation. Thus, global averaging pooling is utilized to compress the spatial correlation information.
The time attention calculating method comprises the following steps:
s t-i =σ(AvgPooling(z t-i )),
s in t-i For the temporal attention corresponding to any of the other frame images, avgPooling represents global average pooling.
s t-i From F t And F t-i The calculated value reflects the relative difference between the two, and the numerical value reflects the adjustment strength of the information quantity of different frames. Intuitively, the time attention corresponding to the frame with lower brightness due to degradation is larger, which is beneficial to fully mining dark information; conversely, frames with higher brightness are suppressed. Scaled features corresponding to all adjacent frames and F t And (3) connecting, and carrying out the next fusion:
the high frequency characteristics after primary fusion are represented as M t-i
M t-i =s t-i F t-i
Post-fusion high frequency feature M t The method comprises the following steps:
M t =φ(W 3 [{M t-i },F t ]),
wherein φ (·) represents the leakage ReLU function, W 3 Is the weight of the 1 x 1 convolutional layer operating on the connected high frequency features, { M t-i The set of all primary fused high frequency featuresAnd (5) combining.
Due to the problems of patches, flickering and the like, the input image sequence has not only the motion of a scene or an object, but also brightness fluctuation. After passing through the PCD module, the features of different frames have been aligned in the spatial direction, but the feature values are still unevenly distributed in time sequence. In order to effectively integrate multi-frame features, the embodiment provides a time attention module, and the features are subjected to equalization processing, so that the sensitivity of the network to information is improved. It uses mid-frame feature F t And features of neighboring frames, calculating temporal attentiveness s t-i ∈R C Emphasizing or suppressing F as a weight t-i Where C is the number of channels. By s t-i To F t-i Scaling and combining with F t Connecting, and obtaining a fusion characteristic M after characteristic fusion t
As an example, the feature extraction module includes 5 residual blocks; the reconstruction module comprises 10 residual blocks; the number of channels per residual block is 64.
As an example, the smoothing module employs a 3 x 3 13 3D convolutional layer encoder and decoder structure as shown in connection with fig. 1.
The motion and the plaque cause information difference among different frames, but global information such as ambiguity, texture features and the like still have space-time consistency; meanwhile, the characteristics of space consistency such as brightness exist, and the global information can assist the restoration of the old film. To fully extract and equalize this global information, an "encoder-decoder" structure of 3D convolutional layers is employed as a smoothing module. A 3 x 3D convolution is used for video global feature extraction, greatly increases the receptive field of the network. The smoothing module directly extracts middle-low frequency characteristics of the original input multi-frame images, and the obtained characteristics have stronger characterization capability through accumulation of 13 convolution layers. M of extracted features and high frequency features of interest t Fusion is performed in an additive manner.
The old movie data simulation flow is described as follows:
in conjunction with fig. 5, in order to train the end-to-end, strong supervision deep neural network proposed by the present invention, a large number of paired high-quality and low-quality movies need to be prepared, but manually repaired high-definition movie resources are relatively scarce. The invention adopts a novel data simulation flow and utilizes film noise image materials to automatically generate artificially simulated old film data. The old movie dataset is made manually by degrading the high resolution video.
Operations in the degradation process are divided into two categories: inter-frame consistency operation and independent operation, which are used for distinguishing whether parameters of a certain operation are consistent for input continuous N frames. Intuitively, the plaque and noise in the real old movie are discontinuous in time sequence and should be used as an inter-frame independent operation; and global degradation effects such as blurring, JPEG compression and the like have time sequence consistency. The original image is firstly subjected to consistency preprocessing operation to obtain a true value image sequence { y } in the data pair, and the preprocessing comprises turning, rotating and random cutting so as to increase the breadth of the data. Then, the image is sequentially subjected to processing of an independent operation and a consistency operation to perform simple degradation. In the experiment, the independent operation is to add Gaussian noise, and the consistency operation is to add Gaussian blur and downsampling. Finally, a final low-quality image sequence { x } is obtained through a material-based degradation process. The material-based degradation may be performed twice for each frame, i.e., two noise material images are applied to increase the richness of the degradation. It was verified that the degraded image of the flow of fig. 5 was not only visually similar to the old movie, but also varied in style.
And (3) experimental verification: for super-resolution and other image restoration tasks, peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR) and structural similarity (Structure Similarity, SSIM) are common evaluation indexes, but are all reference evaluation indexes, and cannot be applied to experimental result analysis of real old movies; in addition, they suffer from the disadvantage of not conforming to the subjective perception of the person. In 2021, khrulkov et al proposed a superscalar task oriented, non-reference evaluation index, neurolsbs (neurolside-By-Side), which takes as input a pair of images giving a relative score of image quality in the (0, 1) range. If the two inputted images are identical, the result is 0.5. The integrated method of "deep remaster+edvr" is used herein as a reference method. For plaque removal effect analysis, three image quality evaluation indexes are adopted: BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator), NIQE (Natural Image Quality Evaluator) and LPIPS (Learned Perceptual Image Patch Similarity). The utility of brique is based on MSCN (Mean Subtracted Contrast Normalized) coefficients that exhibit different distribution characteristics in natural and distorted images, from which the type of distortion of the image and its perceived quality can be predicted. NIQE is improved by BRISQUE, except that NIQE only extracts sharp regions in the image for modeling and does not require training on the human scored dataset. The LPIPS utilizes the deep neural network to simulate the visual perception of a person, inputs an image to be tested and a reference image into the same pre-training network, calculates the distance between the depth features of the image to be tested and the reference image, and shows that the smaller the distance is, the more similar the image is. BRISQUE and NIQE are no reference indexes, the degree of the image approaching the natural image is measured, and LPIPS reflects subjective feeling of people.
Because there are no other comparable comprehensive models for the superdivision tasks of old movies, ML3Dex or DeepRemaster, respectively, was combined with EDVR as a comparison method. The set of the residual blocks and the channel number of the EDVR pre-training model are the same as the set of the channel number of the EDVR pre-training model, so that the parameter number is guaranteed to be similar.
Tables 1 and 2 show the quantitative results, with the best scores shown bolded. From both tables, it can be seen that the present invention performs best in other metrics than PSNR, SSIM. It can be seen that although the repaired video of the present invention is not closest to the original image at the pixel level, it has advantages in terms of picture sharpness, naturalness, satisfaction of human perception, and the like.
Table 1 comparison of quantitative results on artificial simulation data
Figure BDA0003578380110000101
Figure BDA0003578380110000111
Note that: bold fonts are optimal values for each column.
Table 2 comparison of quantitative results on real movies
Figure BDA0003578380110000112
Note that: the font is bolded to the optimal value for each line. Wherein the movie fragment 1 is taken from the "factory gate", 2 is taken from the "atomic bomb explosion effect", 3 is taken from the "Sanmaowan Ji", 4 is taken from the "Shanggan", 5 is taken from the "Water casting garden", 6 is taken from the "1933 panning girls", 7 is taken from the "train in-station".
FIG. 4 shows the average brightness from frame to frame before and after restoration of a section of artificial simulated old film, and obviously the brightness of the original video varies drastically, with serious flicker problem; after the complete network structure is adopted for repairing, the flicker problem is obviously improved, and after the smoothing module is removed, the repairing effect on the flicker problem is reduced. It can be seen that the smoothing module also plays a role in equalizing the inter-frame luminance information in time sequence.
As shown in fig. 6 to 8, by comparing the repairing effect of the incomplete network of the present invention with that of the unmodified complete network on the real old movie, it can be found that the image generated by the incomplete network of the smoothing module is removed, and the problem of oversharpening exists in fig. 8, so that an unrealistic texture is generated, which proves that the branching of the smoothing module plays the roles of supplementing global information and improving the smoothness of the image to a certain extent.
To train the network model of the present invention, the loss function is used to make pixel loss L pixel Perceptual Loss (Perceptual Loss) L perceptual Loss of countermeasure (Loss of countermeasure) L gan Three parts. L (L) pixel Computing high definition repair images
Figure BDA0003578380110000121
And the true value y of the image t Pixel level loss in between, the invention adopts MSE loss:
Figure BDA0003578380110000122
Figure BDA0003578380110000123
representing the averaging of all real value samples in the batch. However, the pixel level loss caused by other distortion problems of the old film (e.g., blobs, flicker, etc.) is more pronounced than the resolution degradation, using only L pixel The learning strength of the superdivision process is insufficient easily, and the output result of the network is not clear enough, so that the perception loss and the countermeasures loss are introduced. The calculation of the perceptual loss is based on a pre-trained deep neural network model, essentially the distance of the two images in the feature space, which helps to obtain results that better fit the human sensory characteristics. The characteristics of VGG19-54 layers before activation are selected for calculating the perception loss, so that the problems of too sparse characteristics and inconsistent brightness of the reconstructed image caused by the characteristics after activation are avoided.
The antagonism loss embodies the core idea of generating the antagonism network (Generative Adversarial Networks, GAN) for mutual supervision of the Generator (Generator) and the arbiter (Discriminator). The proposed end-to-end convolution network is used as a generator, and the discriminator network adopts the VGG structure which is the same as ESRGAN. Meanwhile, a relative discrimination mechanism is adopted, and a discriminator predicts a real image y t Ratio-generated image
Figure BDA0003578380110000124
More realistic relative probabilities, which helps the network learn sharper edges and finer textures. The countering loss of the generator is defined as:
Figure BDA0003578380110000125
wherein D is Ra Representing the relative arbiter, combined from the results of arbiter D:
Figure BDA0003578380110000126
accordingly, loss of the arbiter network
Figure BDA0003578380110000131
The definition is as follows:
Figure BDA0003578380110000132
the complete loss of the generator network can be expressed as:
Figure BDA0003578380110000133
wherein alpha is 1 、α 2 And alpha 3 Representing the corresponding weights.
For training the network, a data set is prepared according to the proposed data simulation procedure. The material-based degradation is performed on the image after simple degradation, and the specific procedure is as follows. The applied noise material images are classified into black background material, white background material and transparent material. A noise image S is randomly selected, firstly, the noise image S is subjected to pretreatment of random overturn, rotation and cutting, and is uniformly processed into black background materials, the pixel value of the noise image S represents the degree of offset, and the pixel value of the noise image S represents no offset. The preprocessed noise map is then multiplied by a random factor w e-1, 1 to set the transparency of the material. Through this process, an image S' is obtained:
S′=wO(S),
wherein O represents pretreatment. Because the size of the clipping region is small, and a blank region with a large area is arranged in the noise material image, a judging mechanism is introduced for ensuring that S' contains effective offset information. First, binarization processing using r as a threshold value is performed on S', and a binary image b is obtained:
b=threshhold(|S′|-μ,r),
where μ represents the mean of |s' |. And b is actually a mask map to be added with the offset, the sum result of all pixel values reflects the area of the offset area, and the sum result of 0 indicates that the offset area does not contain effective offset information and needs to be cut again. Finally, S' satisfying the condition is fused into the input frame image in an additive manner, with a positive or negative value indicating that it makes the corresponding region of the input frame image brighter or darker.
One parameter setting that can be referenced is as follows: the complete training process is divided into two phases, initially, let alpha 2 =α 3 =0, i.e. training the network using MSE loss only. After about 600,000 iterations, adding the perceived loss and the counterloss to fine tune the alpha 1 =α 2 =1,α 3 Set to 0.05. The iteration is continued for 400,000 iterations to complete the training. The data batch size was 20 and the input image size was 32 x 32. ADAM optimizers are used for both the generator and the discriminator, and the parameters beta of the optimizers are 1 =0.9,β 2 =0.999. The initial learning rate of the generator network is 1 multiplied by 10 -5 The second stage is 1×10 -6 The method comprises the steps of carrying out a first treatment on the surface of the The learning rate of the discriminator is 1×10 -7
Experiments show that the method not only exceeds the prior method in a plurality of indexes, but also has obvious advantages in aspects of picture definition, naturalness, satisfaction of human perception and the like.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that the different dependent claims and the features described herein may be combined in ways other than as described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims (10)

1. An old film super-resolution system based on convolutional neural network is characterized by comprising,
smoothing module for oppositeAdjacent previous k frame image x t-i Current frame image x t And adjacent k frame image x t+i Extracting features to obtain middle-low frequency global degradation features of all frame input images; where k is a positive integer, i=1, 2,3, … …, k;
the feature extraction module is used for respectively extracting features of all the frame input images to obtain extracted features of each frame input image; the extracted features of all the frame input images are pairwise paired to form 2k+1 groups of images; the 2k+1 group of images includes a current frame image x t The corresponding extracted features and the extracted features corresponding to all frame input images including the corresponding extracted features are sequentially combined to form an image group;
PCD module for corresponding to current frame image x in a group of images t Taking the extracted feature of the (a) as a reference feature, and carrying out feature alignment operation on the other extracted feature to the reference feature; obtaining aligned features for each set of images;
a time attention module for performing equalization processing on all the aligned features to obtain a current frame image x t Is a fused high frequency characteristic M t
A reconstruction module for the middle-low frequency global degradation characteristic and the fused high frequency characteristic M t The added sums are subjected to image super-division reconstruction to obtain a current frame image x t Is a reconstructed feature of (a);
an adding unit for adding the reconstructed features to the current frame image x t Adding the up-sampling results of the (2) to obtain the current frame image x t High definition repair image of (a)
Figure FDA0003578380100000011
2. The old motion picture super-resolution system based on convolutional neural network of claim 1, wherein,
the process of obtaining the aligned features of each set of images by the PCD module includes:
respectively performing first-level downsampling on the reference feature and the other extracted feature to obtain a first-level reference feature and a first-level extracted feature, and respectively performing second-level downsampling on the first-level reference feature and the first-level extracted feature to obtain a second-level reference feature and a second-level extracted feature;
connecting the secondary reference feature with the secondary extracted feature to obtain a secondary offset;
the secondary offset and the secondary extracted features are subjected to deformable convolution treatment to obtain primary aligned features;
the primary offset is obtained by combining the results of the connection of the primary reference feature and the primary extracted feature with the secondary offset;
after the primary offset and the primary extracted features are subjected to deformable convolution treatment, combining the primary aligned features to obtain secondary aligned features;
obtaining a primary offset by combining a primary offset with a result obtained by connecting the reference feature and another extracted feature;
the primary offset and the other extracted feature are subjected to deformable convolution treatment, and the feature after the secondary alignment is combined to obtain a feature after the tertiary alignment;
connecting the reference feature with the feature after three times of alignment to obtain a final offset;
and carrying out deformable convolution processing on the final offset and the three aligned features to obtain the aligned features of the frame input image corresponding to the other extracted features.
3. The old motion picture super-resolution system based on convolutional neural network of claim 2, wherein,
the time attention module obtains the high-frequency characteristic M after fusion t The process of (1) comprises:
for the current frame image x t Respectively carrying out primary equalization treatment on the aligned features of the other frame images to obtain primary fused high-frequency features of the other frame images;
all primary fused high-frequency characteristics and the current frame image x t The aligned features of the (a) are connected to obtain connected high-frequency features, and then the connected high-frequency features are convolved and activated to obtain fused high-frequency featuresHigh frequency characteristic M t
4. The old film super-resolution system based on convolutional neural network of claim 3,
the primary equalization process includes:
image x of current frame t The aligned features of the (a) and any other frame images are connected to obtain connected features, and the connected features are subjected to convolution operation and ReLU function activation to obtain first-level mixed features; performing convolution operation and sigmoid function activation on the first-level mixed features to obtain second-level mixed features; and after the secondary mixed feature is activated by global average pooling operation and sigmoid function, obtaining time attention, and multiplying the time attention with the aligned feature points of any other frame image to obtain the primary fused high-frequency feature.
5. The old motion picture super-resolution system based on convolutional neural network of claim 4, wherein,
the calculation method of the secondary mixing characteristics comprises the following steps:
z t-i =σ(W 2 δ(W 1 [F t ,F t-i ])):
z in t-i For the second order hybrid feature corresponding to any other frame image, σ (·) represents the sigmoid function, δ (·) represents the ReLU function, W 1 Weights of convolution layers operating on connected features, W 2 Weights of convolutional layers for operating on first-order hybrid features, F t For the current frame image x t Is the aligned feature of F t-i Is an aligned feature of any one of the adjacent previous k-frame images.
6. The old motion picture super-resolution system based on convolutional neural network of claim 5, wherein,
the time attention calculating method comprises the following steps:
s t-i =σ(AvgPooling(z t-i )),
s in t-i For the temporal attention corresponding to any of the other frame images, avgPooling represents global average pooling.
7. The old motion picture super-resolution system based on convolutional neural network of claim 6, wherein,
the high frequency characteristics after primary fusion are represented as M t-i
M t-i =s t-i F t-i
8. The old motion picture super-resolution system based on a convolutional neural network as recited in claim 7, wherein,
post-fusion high frequency feature M t The method comprises the following steps:
M t =φ(W 3 [{M t-i },F t ]),
wherein φ (·) represents the leakage ReLU function, W 3 Is the weight of the 1 x 1 convolutional layer operating on the connected high frequency features, { M t-i And represents the set of all primary fused high frequency features.
9. The old motion picture super-resolution system based on a convolutional neural network according to any one of claim 1 to 8,
the feature extraction module comprises 5 residual blocks; the reconstruction module comprises 10 residual blocks; the number of channels per residual block is 64.
10. The old motion picture super-resolution system based on a convolutional neural network according to any one of claim 1 to 8,
the smoothing module employs a 3 x 3 13 3D convolutional layer encoder and decoder structure.
CN202210339390.XA 2022-04-01 2022-04-01 Old movie super-resolution system based on convolutional neural network Active CN114663285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210339390.XA CN114663285B (en) 2022-04-01 2022-04-01 Old movie super-resolution system based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210339390.XA CN114663285B (en) 2022-04-01 2022-04-01 Old movie super-resolution system based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN114663285A CN114663285A (en) 2022-06-24
CN114663285B true CN114663285B (en) 2023-06-09

Family

ID=82034088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210339390.XA Active CN114663285B (en) 2022-04-01 2022-04-01 Old movie super-resolution system based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN114663285B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120011A (en) * 2019-05-07 2019-08-13 电子科技大学 A kind of video super resolution based on convolutional neural networks and mixed-resolution
CN111833261A (en) * 2020-06-03 2020-10-27 北京工业大学 Image super-resolution restoration method for generating countermeasure network based on attention
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion
CN114202463A (en) * 2021-12-15 2022-03-18 陕西师范大学 Video super-resolution method and system for cloud fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109345449B (en) * 2018-07-17 2020-11-10 西安交通大学 Image super-resolution and non-uniform blur removing method based on fusion network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110120011A (en) * 2019-05-07 2019-08-13 电子科技大学 A kind of video super resolution based on convolutional neural networks and mixed-resolution
CN111833261A (en) * 2020-06-03 2020-10-27 北京工业大学 Image super-resolution restoration method for generating countermeasure network based on attention
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion
CN114202463A (en) * 2021-12-15 2022-03-18 陕西师范大学 Video super-resolution method and system for cloud fusion

Also Published As

Publication number Publication date
CN114663285A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN109360155B (en) Single-frame image rain removing method based on multi-scale feature fusion
CN104134196B (en) Split Bregman weight iteration image blind restoration method based on non-convex higher-order total variation model
Shen et al. Convolutional neural pyramid for image processing
CN111161360A (en) Retinex theory-based image defogging method for end-to-end network
CN116797488A (en) Low-illumination image enhancement method based on feature fusion and attention embedding
CN113392711A (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN115063318A (en) Adaptive frequency-resolved low-illumination image enhancement method and related equipment
CN114627002A (en) Image defogging method based on self-adaptive feature fusion
CN114897742A (en) Image restoration method with texture and structural features fused twice
CN113160286A (en) Near-infrared and visible light image fusion method based on convolutional neural network
CN115953321A (en) Low-illumination image enhancement method based on zero-time learning
CN111553856A (en) Image defogging method based on depth estimation assistance
Liu et al. Facial image inpainting using multi-level generative network
CN113962878A (en) Defogging model method for low-visibility image
CN116128768B (en) Unsupervised image low-illumination enhancement method with denoising module
Yahia et al. Frame interpolation using convolutional neural networks on 2d animation
CN114663285B (en) Old movie super-resolution system based on convolutional neural network
CN117151990A (en) Image defogging method based on self-attention coding and decoding
CN116563133A (en) Low-illumination color image enhancement method based on simulated exposure and multi-scale fusion
Hou et al. Learning deep image priors for blind image denoising
CN116228550A (en) Image self-enhancement defogging algorithm based on generation of countermeasure network
CN114418872A (en) Real image aesthetic feeling enhancing method based on mGANPrior
Tian et al. A modeling method for face image deblurring
CN114463189A (en) Image information analysis modeling method based on dense residual UNet
CN113077385A (en) Video super-resolution method and system based on countermeasure generation network and edge enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant