CN116681594B - Image processing method and device, equipment and medium - Google Patents
Image processing method and device, equipment and medium Download PDFInfo
- Publication number
- CN116681594B CN116681594B CN202310926580.6A CN202310926580A CN116681594B CN 116681594 B CN116681594 B CN 116681594B CN 202310926580 A CN202310926580 A CN 202310926580A CN 116681594 B CN116681594 B CN 116681594B
- Authority
- CN
- China
- Prior art keywords
- acquired image
- image
- network
- super
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 40
- 238000012545 processing Methods 0.000 claims abstract description 130
- 238000000034 method Methods 0.000 claims abstract description 77
- 238000005070 sampling Methods 0.000 claims abstract description 46
- 230000008569 process Effects 0.000 claims abstract description 36
- 230000002123 temporal effect Effects 0.000 claims abstract description 35
- 230000006870 function Effects 0.000 claims description 40
- 238000012549 training Methods 0.000 claims description 39
- 238000009877 rendering Methods 0.000 claims description 25
- 238000003860 storage Methods 0.000 claims description 23
- 230000003287 optical effect Effects 0.000 claims description 21
- 239000013598 vector Substances 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 30
- 230000000694 effects Effects 0.000 description 13
- 238000013135 deep learning Methods 0.000 description 11
- 230000009286 beneficial effect Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000006835 compression Effects 0.000 description 6
- 238000007906 compression Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 230000008707 rearrangement Effects 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 238000005265 energy consumption Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000001370 static light scattering Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4046—Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure relates to the field of image processing technology, and in particular to an image processing method, an apparatus, a device, and a medium, where the method is applied to a super-division model, where the super-division model includes at least one processing unit, each processing unit includes a sparse convolution network and an up-sampling network, and the method includes inputting a first acquired image into the sparse convolution network to extract spatial features of the first acquired image and determining temporal features of the first acquired image based on an acquired image sequence; determining space-time characteristics of the first acquired image based on the space-domain characteristics and the time-domain characteristics; and inputting the space-time characteristics of the first acquired image into an up-sampling network to obtain a first super-resolution image corresponding to the first acquired image, wherein the resolution of the first super-resolution image is higher than that of the first acquired image. The image processing method, the device, the equipment and the medium can efficiently process the image superdivision task in real time, ensure that the superdivision result has space-time continuity, and realize smooth picture and stable frame rate.
Description
Technical Field
The disclosure relates to the technical field of image processing, and in particular relates to an image processing method, an image processing device, image processing equipment and an image processing medium.
Background
Image super-resolution is a type of visual generation task (simply referred to as image super-division task), and can reconstruct low-resolution images into high-resolution images, which are commonly applied to image enhancement of mobile phones, digital cameras and geographic information, or compression and reconstruction of live video.
In the related art, the image superdivision task can be realized through an interpolation algorithm. The interpolation algorithm can be cubic spline interpolation or algorithms such as Laplacian pyramid and sparse coding. At present, the image superdivision task can also be realized through deep learning. In 2014, a series of neural networks such as Super-Resolution convolutional neural networks (Super-Resolution Convolutional Network, SRCNN), enhanced deep Super-Resolution (EDSR) networks, enhanced Super-Resolution Generative Adversarial Network (ESRGAN) networks and the like appear in the international computer vision and pattern recognition conference, and the Enhanced Super-Resolution convolutional neural networks can amplify a single image by 2-4 times; in 2022, VRT uses the attention mechanism of Tansfomer to design a deep learning model with better effect, and fills in the missing pixels of the image by using the priori knowledge learned in big data training to obtain clear textures and image details, but the deep learning model needs to consume larger calculation amount.
There are still problems with the current methods of the related art. For example, conventional superdivision techniques cannot achieve greater magnification and better visual effects. Furthermore, the deep learning speed is low, the operation efficiency is low, and the effect and the performance are difficult to be simultaneously achieved. In addition, the details of the current image reconstructed based on deep learning may not be consistent with the space-time information, and a good result cannot be obtained in continuous frames, so that the content after super-division reconstruction is discontinuous, and the problems of flickering, blurring and artifact are easy to occur.
Disclosure of Invention
In view of this, the disclosure provides an image processing method, an apparatus, a device, and a medium, which can efficiently and real-timely process an image superdivision task while guaranteeing space-time continuity of the superdivision result, and realize smooth image and stable frame rate.
According to an aspect of the present disclosure, there is provided an image processing method applied to a super-division model including at least one processing unit, each processing unit including a sparse convolution network for feature extraction with a sparse convolution kernel and an up-sampling network for up-sampling, the method comprising:
Inputting a first acquired image into the sparse convolution network, extracting spatial domain characteristics of the first acquired image, and determining time domain characteristics of the first acquired image based on an acquired image sequence, wherein the first acquired image represents an acquired image to be subjected to superdivision processing in the acquired image sequence;
determining spatio-temporal features of the first acquired image based on the spatial features and the temporal features;
and inputting the space-time characteristics of the first acquired image into the up-sampling network to obtain a first super-resolution image corresponding to the first acquired image, wherein the resolution of the first super-resolution image is higher than that of the first acquired image.
The spatial domain characteristics of the to-be-superdivided image are extracted through a sparse convolution network of the superdivided model, the time domain characteristics of the to-be-superdivided image are determined according to the image acquisition sequence, the space-time characteristics of the to-be-superdivided image are determined through the spatial domain characteristics and the time domain characteristics, the up-sampling network based on the superdivided model carries out up-sampling on the space-time characteristics to obtain the corresponding superdivided image, and in this way, the superdivided image which is finally generated can keep space-time continuity based on the characteristic information reflecting the to-be-superdivided image in two dimensions of space and time, the problems of flickering, blurring and artifact are avoided, the frame rate is kept stable and the picture is smooth, and meanwhile, the superdivided model has higher running efficiency so as to realize real-time image processing.
In one possible implementation, the number of processing units in the superdivision model is determined by the magnification required for the superdivision process; in case the superdivision model comprises at least two of the processing units, the method further comprises: taking the superdivision image output by the previous processing unit as the input of the sparse convolution network of the next processing unit; and determining the time-space characteristics of the up-sampling network input to the current processing unit according to the time-space characteristics used by the previous processing unit and the spatial characteristics of the sparse convolution network output of the current processing unit.
The scaling scale can be conveniently modified by automatically setting the number of the processing units, namely the magnification ratio of the image can be flexibly adjusted, and under the condition that more than one processing units are used, the spatial domain characteristics are repeatedly adjusted by utilizing the time domain characteristics of the image to be superseparated, so that the image definition can be better maintained while the high-magnification is carried out, and the problems of blurring, saw teeth and artifacts are avoided.
In one possible implementation, the time domain features are different in different application scenarios.
By determining different time domain features under different application scenes, the time-space features input to the subsequent up-sampling network under different application scenes are also different, so that the targeted superdivision task is realized.
In one possible implementation, where the hyper-segmentation model is used for a video enhancement scene, the determining temporal features of the first acquired image based on a sequence of acquired images includes: determining optical flow information of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned in front of the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned behind the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the optical flow information.
And determining the time domain characteristics of the to-be-superseparated image through the optical flow information of the to-be-superseparated image, and utilizing the time domain characteristics to adjust and optimize the subsequent characteristic images, thereby being beneficial to generating the superseparated image which is more fit with the video enhancement scene and leading the superseparated task to have pertinence.
In one possible implementation, where the hyper-segmentation model is used for game rendering of a scene, the determining temporal features of the first captured image based on a sequence of captured images includes: determining jitter offset information and motion vectors of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned before the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned after the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the jitter offset information and the motion vector.
And determining the time domain characteristics of the to-be-superseparated image through the jitter offset information and the motion vector of the to-be-superseparated image, and utilizing the time domain characteristics to adjust and optimize the subsequent characteristic images, thereby being beneficial to generating the superseparated image which is more attached to the game rendering scene and leading the superseparated task to have pertinence.
In one possible implementation, the determining the spatio-temporal feature of the first acquired image based on the spatial domain feature and the temporal domain feature includes: and adjusting the spatial domain features by using the time domain features to obtain the spatial-temporal features of the first acquired image.
By utilizing the time domain feature to adjust and optimize the space domain feature, the space-time feature can be ensured to reflect the feature information of the to-be-superdivided image in two dimensions of space and time, and the time-space continuity of the superdivided image generated subsequently can be ensured.
In one possible implementation, the sparse convolution network includes at least a first sparse convolution layer, 11 convolution layer and a second sparse convolution layer.
By setting and using 1 in two sparse convolutionsThe convolution layer has similar normalization effect, can reduce the calculated amount, is beneficial to efficiently completing the superdivision task, and reduces the energy consumption in the running process.
In one possible implementation, the upsampling network includes a sub-pixel convolution layer.
Up-sampling is achieved through sub-pixel convolution, more texture regions can be reserved in a low-resolution space, a better reconstruction effect can be obtained, and efficient, rapid and parameter-free pixel rearrangement can be achieved.
In one possible implementation manner, the training process of the superdivision model at least includes: initializing network parameters of the sparse convolution network; determining a gradient of an objective function based on an input image sequence of a current network layer of the sparse convolutional network, a weight matrix of the current network layer and optimization parameters, wherein the input image sequence is derived from a training sample set, the training sample set is set according to an application scene of the super-division model, and the objective function represents a coding length function related to the input image sequence and the optimization parameters; and along the gradient direction, forward updating each network layer of the sparse convolution network with the objective function maximization as a target until the sparse convolution network meeting the preset condition is obtained.
The white box model is obtained through the forward updating mode training, so that the training and modifying time of the model is reduced, the model with strong interpretation is obtained, the reason for updating the weight of each network layer is known, and the manual modification of a user is facilitated.
In a possible implementation manner, the preset condition is that the number of times of gradient updating reaches a first threshold value, or the preset condition is that the value of the objective function is equal to or greater than a second threshold value.
By setting different preset conditions, the method is beneficial to flexibly selecting proper preset conditions according to actual requirements so as to obtain the trained superscore model.
According to another aspect of the present disclosure, there is provided an image processing apparatus applied to a super-division model including at least one processing unit, each processing unit including a sparse convolution network for feature extraction using a sparse convolution kernel and an up-sampling network for up-sampling, the apparatus comprising:
a feature extraction module configured to input a first acquired image into the sparse convolutional network, extract spatial features of the first acquired image, and determine temporal features of the first acquired image based on a sequence of acquired images, wherein the first acquired image represents an acquired image to be superprocessed in the sequence of acquired images;
A feature determination module configured to determine spatiotemporal features of the first acquired image based on the spatial features and the temporal features;
and the up-sampling module is configured to input the space-time characteristics of the first acquired image into the up-sampling network to obtain a first super-resolution image corresponding to the first acquired image, wherein the resolution of the first super-resolution image is higher than that of the first acquired image.
In one possible implementation, the number of processing units in the superdivision model is determined by the magnification required for the superdivision process; in case the hyper-model comprises at least two of the processing units, the apparatus further comprises: a processing module configured to take the superdivision image output by the previous processing unit as an input of a sparse convolution network of the next processing unit; a determining module configured to determine a spatio-temporal feature of an up-sampling network input to a current processing unit based on a temporal feature used by a previous processing unit and a spatial feature of a sparse convolution network output of the current processing unit.
In one possible implementation, the time domain features are different in different application scenarios.
In one possible implementation, where the hyper-segmentation model is used for a video enhancement scene, the determining temporal features of the first acquired image based on a sequence of acquired images includes: determining optical flow information of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned in front of the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned behind the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the optical flow information.
In one possible implementation, where the hyper-segmentation model is used for game rendering of a scene, the determining temporal features of the first captured image based on a sequence of captured images includes: determining jitter offset information and motion vectors of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned before the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned after the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the jitter offset information and the motion vector.
In one possible implementation, the determining the spatio-temporal feature of the first acquired image based on the spatial domain feature and the temporal domain feature includes: and adjusting the spatial domain features by using the time domain features to obtain the spatial-temporal features of the first acquired image.
In one possible implementation, the sparse convolution network includes at least a first sparse convolution layer, 11 convolution layer and a second sparse convolution layer.
In one possible implementation, the upsampling network includes a sub-pixel convolution layer.
In one possible implementation manner, the training process of the superdivision model at least includes: initializing network parameters of the sparse convolution network; determining a gradient of an objective function based on an input image sequence of a current network layer of the sparse convolutional network, a weight matrix of the current network layer and optimization parameters, wherein the input image sequence is derived from a training sample set, the training sample set is set according to an application scene of the super-division model, and the objective function represents a coding length function related to the input image sequence and the optimization parameters; and along the gradient direction, forward updating each network layer of the sparse convolution network with the objective function maximization as a target until the sparse convolution network meeting the preset condition is obtained.
In a possible implementation manner, the preset condition is that the number of times of gradient updating reaches a first threshold value, or the preset condition is that the value of the objective function is equal to or greater than a second threshold value.
According to another aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described image processing method when executing the instructions stored in the memory.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described image processing method.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure.
Fig. 2 shows a flowchart of an image processing method provided according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure.
Fig. 4 shows a schematic diagram of an acquisition process of a superdivision model provided according to an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of a superminute task execution process provided according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of an operating platform provided in accordance with an embodiment of the present disclosure.
Fig. 8 shows a schematic diagram of a video enhancement task execution process provided according to an embodiment of the present disclosure.
Fig. 9 shows a schematic diagram of a game rendering task execution process provided according to an embodiment of the present disclosure.
Fig. 10 shows a block diagram of an image processing apparatus provided according to an embodiment of the present disclosure.
Fig. 11 shows a block diagram of an apparatus for performing an image processing method according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
In order to facilitate understanding of the technical solutions provided by the embodiments of the present disclosure by those skilled in the art, a technical environment in which the technical solutions are implemented is described below.
Image super-resolution is a type of visual generation task, i.e., reconstructing a low-resolution image into a high-resolution image, which may be simply referred to as an image super-resolution task. The image superdivision task is generally applied to image enhancement of mobile phones, digital cameras and geographic information, or can also be applied to compression and reconstruction of live video.
The following description is directed to single-frame image supersection, multi-frame image supersection and supersampling in the image supersection task respectively.
And super-dividing a single frame image, namely performing super-processing on one image. Traditionally, interpolation processing is performed on an image, such as cubic spline interpolation, or algorithms such as Laplacian pyramid and sparse coding are utilized to realize image super-division. The speed of processing images is high by using the conventional method, but the magnification and super-resolution effects are not ideal. With the rapid development of deep learning, the image reconstruction by using the deep learning algorithm can obtain the magnification and superdivision effect far superior to the conventional method. Deep learning superdivision, while capable of acquiring clearer images, still has some problems. For example, the reconstructed superdivision image details are brain-supplemented by artificial intelligence (Artificial Intelligence, AI) models, and may therefore come in and go out with the actual images. Furthermore, the details of the reconstructed superdivision image may not match with the space-time information, which may not obtain a good superdivision result when continuous image frames (such as video and game scenes) are processed, but rather make the reconstructed content discontinuous, for example, for an image superdivision task in a dynamic scene that an object is in free fall, the superdivision image reconstructed based on a single frame image may become clear, which seems to realize the goal of reconstructing a low resolution image into a high resolution image, but does not conform to the real situation of being in a fuzzy state when the object is in free fall, thereby causing the free fall process of the object to become clear, and making the dynamic object become like a static object after superdivision (similarly, there is a static object to become like a dynamic object after superdivision). In addition, the deep learning algorithm has a very slow speed of reconstructing images, and real-time image superdivision is difficult to realize.
And performing multi-frame image super-division, namely performing super-division processing on the multi-frame images. The multi-frame image super-division can be used for processing the video, continuous frames in the video with low resolution are generally taken as input, and the multi-frame image can be subjected to super-division reconstruction by combining an image fusion technology, so that a super-division video with high resolution is output. Similarly, multi-frame image superdivision is also used for processing photographic works, and generally takes multiple exposure results obtained by photography as input, and can output a finer image by combining an image fusion technology. In the related art, a deformable convolution (Deformable Convolution, DC) network, a cyclic neural network (Resolution Neural Network, RNN) or a three-dimensional convolution neural network is generally utilized by using more models, and although the amplification factor and the super-resolution effect far superior to those of the traditional method can be obtained, the processing speed and the operation efficiency are not ideal enough, the real-time video streaming can not be realized, and the effect and the performance are difficult to be considered.
Supersampling is a technique proposed for the field of game rendering. Because game rendering is generally realized by a sampling means, the higher the sampling rate is, the finer the rendering result is, so that a low-resolution video is generally rendered by using a low sampling rate, and then the low-resolution video is processed by a super-resolution technology, so that a high-resolution image is obtained, and the rendering efficiency can be improved. Such as currently applied image scaling (Nvidia Image Scaling, NIS) techniques and super-resolution sharpening techniques, which use conventional methods such as edge adaptive spatial up-conversion (Edge Adaptive Spatial Upsampling, EASU) and robust contrast adaptive sharpening (Robust Contrast Adaptive Sharpening, RCAS) methods. Regarding the supersampling technique of deep neural network methods, such as current deep learning supersampling (Deep Learning Super Sampling, DLSS). However, the existing deep learning technology still has a certain limitation, such as limited interpretation, relies on a huge training data set, has poor generalization capability, is simple and easy to use on the surface, but cannot be modified according to the characteristics of the task, while the existing deep learning technology has the problem that most of the weights of the model only correspond to one scaling scale, and the weights are replaced if the scaling rate is required to be adjusted, so that the weight updating in the training process is complicated, and a large amount of time is required to train the model. Moreover, since the deep learning model has a certain requirement on the computational power of the device, in order to enable the deep learning model to be applied to the actual industry, a series of model compression methods, such as quantization, pruning or targeted compression of operators in the model, are often needed to meet the requirements of video and game rendering, and the quantization and compression lose precision, so that the final picture effect is greatly compromised. In addition, the technical scheme provided by the related technology still can generate ghosts around a moving object in a video under a specific condition or enable the surface of a static object to generate flickering and mole lines, and meanwhile, the frame rate is unstable, so that the phenomena of smooth pictures, delayed response, handfeel, jamming, stagnation and the like are caused. In addition, the game picture is real-time, the operation requirement is higher, and the real-time rendering is difficult to realize by the existing scheme.
According to the image processing method provided by the embodiment of the disclosure, the spatial domain characteristics of the image to be superdivided are extracted through the sparse convolution network of the superdivision model, the time domain characteristics of the image to be superdivided are determined according to the image acquisition sequence, the spatial domain characteristics and the time domain characteristics of the image to be superdivided are further determined, the spatial-temporal characteristics are up-sampled based on the up-sampling network of the superdivision model to obtain the corresponding superdivision image, so that the superdivision image is superprocessed based on the characteristic information reflecting the image to be superdivision image in two dimensions of space and time, the space-time continuity of the finally generated superdivision image can be maintained, the problems of flickering, blurring and artifact are avoided, the frame rate stability and the picture smoothness are maintained, and meanwhile, the superdivision model has higher operation efficiency, so that the image generation task of real-time response is realized.
In addition, the super-division model provided by the embodiment of the disclosure can flexibly set the number of processing units, so that the magnification is ensured, meanwhile, the definition of the image is kept, and the problems of blurring, sawtooth and artifact are avoided. Furthermore, the embodiment of the disclosure provides training the white-box model by a forward updating mode to obtain the superdivision model, and the superdivision model has stronger interpretation and generalization while obviously reducing the model training time and simplifying the model weight modification process. In addition, according to the embodiment of the disclosure, different inputs are respectively set in the video enhancement scene and the game rendering scene, so that the super-resolution result meeting the requirements of the corresponding scene can be correspondingly generated, and the method has high pertinence.
The embodiment of the disclosure provides an image processing method. The image processing method can be applied to the super-division model, and the super-division image corresponding to the original image can be output after the original image to be super-divided is input into the super-division model and is processed by the super-division model. Fig. 1 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure. As shown in fig. 1, the super-division model may include a sparse convolution network and an upsampling network, wherein the sparse convolution network may be used to perform feature extraction on an original image to be super-divided by using a sparse convolution kernel, and the upsampling network may be used to perform upsampling.
Fig. 2 shows a flowchart of an image processing method provided according to an embodiment of the present disclosure. As shown in fig. 2, the image processing method may include:
step S201, inputting the first acquired image into a sparse convolution network, extracting spatial domain features of the first acquired image, and determining the temporal domain features of the first acquired image based on the acquired image sequence.
Step S202, determining the space-time characteristics of the first acquired image based on the space-domain characteristics and the time-domain characteristics.
Step S203, inputting the space-time characteristics of the first acquired image into an up-sampling network to obtain a first super-resolution image corresponding to the first acquired image.
The first acquired image may represent an acquired image to be super-separated in the sequence of acquired images. By executing steps S201 to S203, the super-resolution processing for the first captured image can be realized, resulting in a first super-resolution image corresponding to the first captured image. The first super-resolution image may represent a super-resolution image obtained by performing super-resolution processing on the first acquired image. The resolution of the first superdivision image is higher than the resolution of the first acquisition image.
In the embodiment of the present disclosure, the acquired image sequence may include at least an acquired image to be superprocessed (i.e., a first acquired image), an acquired image located before the acquired image to be superprocessed (i.e., a second acquired image), and an acquired image located after the acquired image to be superprocessed (i.e., a third acquired image). The acquired images in the acquired image sequence may be obtained by means of spaced sampling, for example the acquired image sequence may comprise T-2 frames of acquired images, T frames of acquired images and t+2 frames of acquired images. Alternatively, the acquired images in the acquired image sequence may be obtained by continuous sampling, for example, the acquired image sequence may include a T-1 frame acquired image, a T frame acquired image, and a t+1 frame acquired image, which is not limited by the embodiments of the present disclosure.
In this way, step S201 to step S203 are executed, that is, the spatial domain feature of the image to be superdivided is extracted through the sparse convolution network of the superdivision model, the time domain feature of the image to be superdivided is determined according to the image acquisition sequence, the space-time feature of the image to be superdivided is determined through the spatial domain feature and the time domain feature, and the space-time feature is up-sampled based on the up-sampling network of the superdivision model to obtain the corresponding superdivision image, so that the superdivision processing is performed based on the feature information reflecting the image to be superdivision in two dimensions of space and time, the finally generated superdivision image can keep the space-time continuity, the problems of flickering, blurring and artifacts are avoided, the frame rate stability and the smooth picture are kept, and meanwhile, the superdivision model has higher operation efficiency, so that the real-time image processing is realized.
The spatial and temporal features of the first acquired image may be extracted by performing step S201.
In step S201, features of the first collected image may be extracted through a sparse convolution network of the super-division model, so as to obtain spatial features corresponding to the first collected image. In one possible implementation, the sparse convolution network may include at least a first sparse convolution layer, 1 1 convolution layer and a second sparse convolution layer. As shown in FIG. 1, the first acquired image is used as an input of a sparse convolution network and can sequentially pass through a first sparse convolution layer and 1 +.>And 1, a convolution layer and a second sparse convolution layer, and outputting the spatial domain characteristics of the first acquired image. Unlike batch normalization used in the related art, the disclosed embodiments provide for the use of 1 +_1 by setting and using in two sparse convolution layers>The convolution layer has similar normalization effect, can reduce the calculated amount, is beneficial to efficiently completing the superdivision task, and reduces the energy consumption in the running process.
In step S201, the temporal feature of the first acquired image may also be obtained by processing the acquired image sequence. The temporal features may be used to adjust the spatial features to obtain spatio-temporal features (see below for details).
In one possible implementation, the time domain features are different in different application scenarios. In this way, different time domain features are determined under different application scenes, so that the time-space features input to the subsequent up-sampling network under different application scenes are also different, and the targeted superdivision task is realized. That is, when the superdivision model is used for different application scenarios, the time domain features used may be different. The following description will take video enhancement scenes and game rendering scenes as examples.
When the superdivision model is used for video-enhanced scenes, the temporal features used may include at least optical flow information, where optical flow information may refer to the instantaneous velocity of the spatial moving object's pixel motion on the viewing imaging plane. In one possible implementation, where the super-division model is used for a video enhancement scene, determining temporal features of the first acquired image based on the sequence of acquired images in step S201 may include: determining optical flow information of the first acquired image relative to the second acquired image and the third acquired image according to the acquired image sequence, wherein the optical flow information can be the instantaneous speed of the first acquired image relative to the second acquired image and the third acquired image; a temporal feature of the first acquired image is determined from the optical flow information. In this way, the time domain characteristics of the super-division image are determined through the optical flow information of the super-division image, and the time domain characteristics are utilized to adjust and optimize the subsequent characteristic images, so that the super-division image which is more fit with the video enhancement scene is generated, and the super-division task has pertinence.
When the superdivision model is used for game rendering of a scene, the temporal features used may include at least jitter offset information, which may refer to the offset of pixels resulting from image jitter, and motion vectors, which may refer to the relative displacement between the current encoded block and the best matching block in its reference image. In one possible implementation, where the superscore model is used for game rendering of a scene, determining temporal features of the first captured image based on the captured image sequence in step S201 may include: according to the acquired image sequence, shake offset information and motion vectors of the first acquired image relative to the second acquired image and the third acquired image are determined, wherein the shake offset information can be offset of the first acquired image relative to the second acquired image and the third acquired image, and the motion vectors can be displacement information of the first acquired image relative to the second acquired image and the third acquired image; and determining the time domain characteristics of the first acquired image according to the jitter offset information and the motion vector. In this way, the time domain characteristics of the super-resolution image are determined through the jitter offset information and the motion vector of the super-resolution image, and the time domain characteristics are utilized to adjust and optimize the subsequent characteristic images, so that the super-resolution image which is more attached to the game rendering scene is generated, and the super-resolution task has pertinence.
Of course, the superdivision model may also be applied to other application scenarios, which are not limited by the embodiments of the present disclosure.
The spatiotemporal features of the first acquired image may be obtained by performing step S202.
In step S202, the spatio-temporal features corresponding to the first acquired image may be obtained through the spatial features and the temporal features. In one possible implementation, step S202 may include: and adjusting the spatial characteristics by utilizing the time domain characteristics to obtain the space-time characteristics of the first acquired image. The adjustment method may be to directly add the first feature map representing the time domain feature and the second feature map representing the space domain feature to obtain the target feature map representing the space-time feature. Alternatively, the adjustment may be performed by weighted addition, that is, by multiplying a first feature map representing a time domain feature by a first coefficient to obtain a third feature map, multiplying a second feature map representing a spatial domain feature by a second coefficient to obtain a fourth feature map, and adding the third feature map to the fourth feature map to obtain a target feature map representing a spatial-temporal feature. In this way, the spatial domain characteristics are adjusted and optimized by utilizing the time domain characteristics, so that the time-space characteristics can be ensured to reflect the characteristic information of the to-be-superdivision image in two dimensions of space and time, and the time-space continuity of the subsequently generated superdivision image can be ensured.
A first hyper-resolution image corresponding to the first acquired image can be obtained by executing step S203.
In step S203, the spatio-temporal features of the first acquired image may be processed through an upsampling network of the super-resolution model, so as to obtain a first super-resolution image corresponding to the first acquired image. In one possible implementation, the upsampling network may include a sub-pixel convolution layer. As shown in fig. 1, the spatio-temporal characteristics of the first acquired image are used as input of an up-sampling network, and the first super-resolution image corresponding to the first acquired image is output through a sub-pixel convolution layer of the up-sampling network. The upsampling network performs a function of rearranging pixels of an input Low Resolution (LR) image by a subpixel convolved PixelShuffle structure, thereby outputting a High Resolution (HR) image. In this way, up-sampling is achieved through sub-pixel convolution, more texture regions can be reserved in a low-resolution space, a better reconstruction effect can be obtained, and efficient, rapid and parameter-free pixel rearrangement can be achieved.
The structure of the superdivision model used in steps S201 to S203 may be in a manner of stacking processing units, that is, the superdivision model may include at least one processing unit, as shown in fig. 1, each of which may include a sparse convolution network and an up-sampling network. The number of processing units in the superdivision model may be determined by the magnification required for the superdivision processing (i.e., the target magnification required for the image superdivision task to be completed). Each stack of processing units may amplify the acquired image to be super-processed by a factor of two in length and width, respectively, so that the actual pixel values of the resulting super-divided image are amplified by a factor of four. Thus, one processing unit may be used when the acquired image to be superprocessed needs to be magnified four times, and two processing units may be stacked when the acquired image to be superprocessed needs to be magnified sixteen times. Thus, the scaling scale can be conveniently modified by self-setting the number of the processing units, namely, the magnification ratio of the image can be flexibly adjusted.
In the case that the super-resolution model includes a processing unit, as shown in fig. 1, feature extraction may be performed on a first collected image through a sparse convolution network of the processing unit to obtain spatial features of the first collected image, and spatial features of the first collected image may be obtained by adjusting the spatial features by using the temporal features of the first collected image determined based on the collected image sequence, so that the spatial features of the first collected image may be up-sampled through an up-sampling network of the processing unit to obtain a first super-resolution image corresponding to the first collected image.
In case the hyper-segmentation model comprises at least two processing units, the image processing method may further comprise: the superdivision image output by the previous processing unit is used as the input of the sparse convolution network of the next processing unit. That is, the processing object of the sparse convolution network of the first processing unit is the first acquired image, and the processing object of the sparse convolution network of the non-first processing unit is the super-resolution image output by the last processing unit.
In addition, the image processing method may further include: and determining the time-space characteristics of the up-sampling network input to the current processing unit according to the time-domain characteristics used by the previous processing unit and the space-space characteristics output by the sparse convolution network of the current processing unit. That is, the temporal features of the first acquired image determined based on the sequence of acquired images may be used to adjust the spatial features of the sparse convolutional network output in each processing unit to obtain corresponding temporal-spatial features.
Fig. 3 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure. As shown in fig. 3, the time domain features of the first acquired image may be determined based on the acquired image sequence, and the time domain features may be used to adjust the spatial domain features of the sparse convolutional network output by the first processing unit to obtain the spatial-temporal features of the up-sampling network input to the first processing unit; taking the super-resolution image output by the up-sampling network of the first processing unit as the input of the sparse convolution network of the second processing unit, and adjusting the spatial domain characteristics output by the sparse convolution network of the second processing unit by utilizing the time domain characteristics to obtain the spatial-temporal characteristics of the up-sampling network input to the second processing unit; and the like until a first super-resolution image corresponding to the first acquired image is obtained. Therefore, under the condition that more than one processing units are used, the spatial domain characteristics are repeatedly adjusted by utilizing the time domain characteristics of the image to be superseparated, the image definition can be better maintained while high-magnification is carried out, and the problems of blurring, sawtooth and artifact are avoided.
The superscore model used in steps S201 to S203 may use a superscore model trained in advance. The superdivision model may include a sparse convolution network and an upsampling network, such that the training process of the superdivision model may include a training process of the sparse convolution network and a training process of the upsampling network.
In one possible implementation, the training process of the sparse convolution network of the superdivision model may at least include: initializing network parameters of a sparse convolution network; determining gradients of an objective function based on an input image sequence of a current network layer of a sparse convolution network, a weight matrix of the current network layer and optimization parameters, wherein the input image sequence is derived from a training sample set, the training sample set is set according to an application scene of a superdivision model, and the objective function represents a coding length function related to the input image sequence and the optimization parameters; and along the gradient direction, aiming at maximizing the objective function, and updating each network layer of the sparse convolution network forward until the sparse convolution network meeting the preset condition is obtained, wherein the preset condition is that the number of gradient updating times reaches a first threshold value or that the value of the objective function is equal to or larger than a second threshold value. Therefore, the white box model is obtained through training in a forward updating mode, the training and modifying time of the model is reduced, the model with strong interpretation is obtained, the reason for updating the weight of each network layer is known, the manual modification of a user is facilitated, and meanwhile, through setting different preset conditions, the training completion superminute model is obtained through flexibly selecting appropriate preset conditions according to actual requirements.
In one example, each network layer of the sparse convolutional network may be forward compressed and optimized by maximum code rate compression. In other words, the update manner of the sparse convolution network may be implemented in the following manner: the method comprises the steps of compressing samples of a training sample set forwards, and optimizing and updating weights of a current network layer, wherein each sample in the training sample set updates each network layer in a sparse convolution network once and is in a forward updating mode. In this way, the white box model obtained through training in the forward updating mode can have stronger interpretation, a user can know which layer is currently updated, and can know whether the layer is a shallow layer feature or a high-dimensional feature, so that model super parameters can be manually modified to maximally adjust the model.
First, network parameters of a sparse convolutional network may be initialized. Second, with the input image sequence of the current network layer based on sparse convolution network Weight matrix of current network layer->Optimization parameters->Determining an objective function->Gradient of (2)Wherein the objective function->Can be determined by the following formula:
equation one
In the method, in the process of the invention,representing +.>And optimization parameters->Is a function of the code length of (a); />Representing a sequence of input images,/->,/>Representing the vectorized input image sequence, +.>Satisfy->,/>Representation->Unit vector in direction, +_>Representing a random parameter; />Representing the number of samples in the sample training set; />Representing a dimension of the image sequence; />Representing the identity matrix. Thus objective function->Gradient of->Can be determined by the following formula:
formula II
In the method, in the process of the invention,representing the gradient of the objective function; />An input representing a current network layer of the sparse convolutional network; />Weight matrix representing the current network layer of a sparse convolution network,/->Representing +.>A layer network layer; />Representing optimization parameters, namely random parameters; />Representing the identity matrix; />Representing a dimension of the image sequence; />Indicate->Trace of weight matrix of layer network layer; />The coding rate representing all features; />Representing the coding rate of a certain class of features.
Then, along the gradientDirection, as an objective function->Maximization (i.e. to maximize +. >Minimize->) To this end, each network layer of the sparse convolutional network is updated forward until a sparse convolutional network satisfying a preset condition is obtained. By combining the formula I and the formula II, when the sample expression is optimal, the space occupied by the sample expression can be maximized, namely the whole sample set has the maximum coding length, namely +.>Maximizing; for data of mixed categories, when the coding length required by the data category division is minimized, samples belonging to different structures can be made to be close to each other, namely +.>Minimizing.
In one possible implementation, the training process of the up-sampling network of the super-division model may be set by itself according to the sub-pixel convolution layer, which is not limited by the embodiments of the present disclosure.
Fig. 4 shows a schematic diagram of an acquisition process of a superdivision model provided according to an embodiment of the present disclosure. In one example, as shown in fig. 4, the process of obtaining the superscore model may include:
first, an application scenario is determined. Different application scenarios, the superdivision model has different inputs. When the superdivision model is used for game rendering of a scene, the input of the model may include the first captured image and jitter offset information and motion vectors associated with the first captured image. When the superdivision model is used for video enhancement scenes, the input of the model may include the first captured image and optical flow information associated with the first captured image. Fig. 5 shows a schematic diagram of an image processing method provided according to an embodiment of the present disclosure. As shown in fig. 5, the inputs of this example may be a T-1 st frame captured image (i.e., a second captured image), a T-1 st frame captured image (i.e., a first captured image), and a t+1 th frame captured image (i.e., a third captured image).
The feature extraction layer is then designed, and the upsampling layer is designed. The feature extraction layer may be used to extract spatial and temporal features of the first acquired image. The up-sampling layer may be configured to obtain a first super-resolution image corresponding to the first acquired image.
Secondly, designing a super-division model structure. The structure of the superdivision model (e.g. determining the number of processing units and/or determining the manner in which feature extraction, upsampling is used) may be designed according to different resolution requirements. In this example, the super-division model may be designed to include one processing unit, where the feature extraction portion of the processing unit may be implemented by a sparse convolution network and the upsampling portion of the processing unit may be implemented by a sub-pixel convolution network.
Then, the superdivision model is deduced and trained. Judging whether the training-obtained superscore model meets preset conditions or not, and if the training-obtained superscore model meets the preset conditions, outputting the training-completed superscore model, wherein the superscore model can be used for realizing image superscore tasks in corresponding application scenes; if the super-division model obtained through training does not meet the preset condition, fine tuning and optimizing the super-division model are conducted until the super-division model meets the preset condition. The preset conditions can be reasonably set according to different resolution requirements. In this example, the preset condition may be set such that the number of updates is consistent with the number of samples of the training sample set, i.e., the model is trained once with all samples, respectively.
Fig. 6 shows a schematic diagram of a superminute task execution process provided according to an embodiment of the present disclosure. As shown in fig. 6, in this example, the number of channels, the image width, the image height are input into the super-division model; then, feature extraction is carried out and other inputs are combined to obtain a feature map, and the size of the feature map is the number of channelsr 2 Image width, image height; then, up-sampling is performed based on the feature map, the number of output channels, the image width +.>r, image height->r. Where r is the magnification of the image length and width, not the magnification of the overall resolution, for example, the input size is 100 +.>100, r=4, output size 400 +.>400, the image is improved by 16 times, and the channel number of all the images is 3. The convolution function of the neural network is utilized when calculating the feature map, and the channel number can be increased to 3 in the convolution process>r 2 The number of channels is reduced to 3 by pixel rearrangement (Pixelshuffle), a combination of which is also called sub-pixel convolution.
Fig. 7 shows a schematic diagram of an operating platform provided in accordance with an embodiment of the present disclosure. After the training of the superscore model is completed, the superscore model package obtained after the training is deployed to the running platform shown in fig. 7, i.e. the model file can be loaded into software of Superscore (SR) SDK to run, and relies on an underlying AI engine, driver and graphics processor (Graphics processing unit, GPU) to facilitate the user to perform superscore tasks in an operating system (Linux/Windows/iOS), wherein the SR SDK may include a superscore core management process (SR Manager) and a video superscore model (Video Super Resolution Models, VSR Models)/game superscore model (Game Super Resolution Models, GSR Models), and the AI engine may include an acceleration reasoning engine (NCNN/TensorRT). The running platform provides a calling interface to realize video enhancement tasks or game rendering tasks. The running platform adopts an efficient multi-stage pipeline architecture to provide a basis for efficient execution of video enhancement tasks or game rendering tasks.
Fig. 8 shows a schematic diagram of a video enhancement task execution process provided according to an embodiment of the present disclosure. In one example, as shown in FIG. 8, a process for performing video enhancement tasks using a runtime platform may include: and acquiring optical flow information and an LR image of the video to be rendered from a rendering Buffer zone Render Buffer, inputting the optical flow information and the LR image into a super-division model to generate an HR image, and then transmitting the HR image into a video decoder to be output to the Stream Buffer zone Stream Buffer so as to carry out subsequent data reading and writing.
Fig. 9 shows a schematic diagram of a game rendering task execution process provided according to an embodiment of the present disclosure. In yet another example, as shown in FIG. 9, a process of performing a game rendering task with a running platform may include: and obtaining jitter offset information, a running vector and LR of a game to be rendered from a rendering Buffer zone Render Buffer, inputting the jitter offset information, the motion vector and an LR image into a super-division model to generate an HR image, and outputting the HR image to a Stream Buffer zone Stream Buffer after processing such as high dynamic range imaging so as to perform subsequent data reading and writing.
The embodiment of the disclosure also provides an image processing device. The image processing apparatus may be applied to a super division model. The superdivision model may include at least one processing unit, each of which may include at least a sparse convolution network, which may be used to perform feature extraction using sparse convolution kernels, and an upsampling network, which may be used to perform upsampling.
Fig. 10 shows a block diagram of an image processing apparatus provided according to an embodiment of the present disclosure. As shown in fig. 10, the apparatus 500 may include a feature extraction module 501, a feature determination module 502, and an upsampling module 503. The feature extraction module 501 may be configured to input a first acquired image into a sparse convolutional network, extract spatial features of the first acquired image, and determine temporal features of the first acquired image based on a sequence of acquired images, wherein the first acquired image represents an acquired image to be superprocessed in the sequence of acquired images. The feature determination module 502 may be configured to determine spatiotemporal features of the first acquired image based on the spatial features and the temporal features. The upsampling module 503 may be configured to input the spatiotemporal features of the first acquired image into the upsampling network to obtain a first super separated image corresponding to the first acquired image, wherein a resolution of the first super separated image is higher than a resolution of the first acquired image.
In this way, the spatial domain characteristics of the to-be-superdivided image are extracted through the sparse convolution network of the superdivision model, the time domain characteristics of the to-be-superdivided image are determined according to the image acquisition sequence, the spatial domain characteristics and the time domain characteristics of the to-be-superdivided image are further determined, the spatial-temporal characteristics are up-sampled based on the up-sampling network of the superdivision model to obtain the corresponding superdivision image, the superdivision processing is performed based on the characteristic information reflecting the to-be-superdivision image in two dimensions of space and time, so that the finally generated superdivision image can keep spatial-temporal continuity, the problems of flickering, blurring and artifacts are avoided, the frame rate is kept stable and the picture is smooth, and meanwhile, the superdivision model has higher operation efficiency so as to realize real-time image processing.
In one possible implementation, the number of processing units in the superdivision model is determined by the magnification required for the superdivision process; in case the hyper-model comprises at least two of the processing units, the apparatus further comprises: a processing module configured to take the superdivision image output by the previous processing unit as an input of a sparse convolution network of the next processing unit; a determining module configured to determine a spatio-temporal feature of an up-sampling network input to a current processing unit based on a temporal feature used by a previous processing unit and a spatial feature of a sparse convolution network output of the current processing unit.
Therefore, the scaling scale can be conveniently modified by automatically setting the number of the processing units, namely the magnification ratio of the image can be flexibly adjusted, and under the condition that more than one processing units are used, the spatial domain characteristics are repeatedly adjusted by utilizing the time domain characteristics of the image to be superdivided, so that the definition of the image can be better maintained while the high-magnification is carried out, and the problems of blurring, saw teeth and artifacts are avoided.
In one possible implementation, the time domain features are different in different application scenarios.
In this way, different time domain features are determined under different application scenes, so that the time-space features input to the subsequent up-sampling network under different application scenes are also different, and the targeted superdivision task is realized.
In one possible implementation, where the hyper-segmentation model is used for a video enhancement scene, the determining temporal features of the first acquired image based on a sequence of acquired images includes: determining optical flow information of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned in front of the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned behind the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the optical flow information.
In this way, the time domain characteristics of the super-division image are determined through the optical flow information of the super-division image, and the time domain characteristics are utilized to adjust and optimize the subsequent characteristic images, so that the super-division image which is more fit with the video enhancement scene is generated, and the super-division task has pertinence.
In one possible implementation, where the hyper-segmentation model is used for game rendering of a scene, the determining temporal features of the first captured image based on a sequence of captured images includes: determining jitter offset information and motion vectors of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned before the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned after the first acquired image in the acquired image sequence; and determining the time domain characteristics of the first acquired image according to the jitter offset information and the motion vector.
In this way, the time domain characteristics of the super-resolution image are determined through the jitter offset information and the motion vector of the super-resolution image, and the time domain characteristics are utilized to adjust and optimize the subsequent characteristic images, so that the super-resolution image which is more attached to the game rendering scene is generated, and the super-resolution task has pertinence.
In one possible implementation, the determining the spatio-temporal feature of the first acquired image based on the spatial domain feature and the temporal domain feature includes: and adjusting the spatial domain features by using the time domain features to obtain the spatial-temporal features of the first acquired image.
In this way, the spatial domain characteristics are adjusted and optimized by utilizing the time domain characteristics, so that the time-space characteristics can be ensured to reflect the characteristic information of the to-be-superdivision image in two dimensions of space and time, and the time-space continuity of the subsequently generated superdivision image can be ensured.
In one possible implementation, the sparse convolution network includes at least a first sparse convolution layer, 11 convolution layer and a second sparse convolution layer.
Thus, by setting and using 1 in two sparse convolution layersThe convolution layer has similar normalization effect, can reduce the calculated amount, is beneficial to efficiently completing the superdivision task, and reduces the energy consumption in the running process.
In one possible implementation, the upsampling network includes a sub-pixel convolution layer.
In this way, up-sampling is achieved through sub-pixel convolution, more texture regions can be reserved in a low-resolution space, a better reconstruction effect can be obtained, and efficient, rapid and parameter-free pixel rearrangement can be achieved.
In one possible implementation manner, the training process of the superdivision model at least includes: initializing network parameters of the sparse convolution network; determining a gradient of an objective function based on an input image sequence of a current network layer of the sparse convolutional network, a weight matrix of the current network layer and optimization parameters, wherein the input image sequence is derived from a training sample set, the training sample set is set according to an application scene of the super-division model, and the objective function represents a coding length function related to the input image sequence and the optimization parameters; and along the gradient direction, forward updating each network layer of the sparse convolution network with the objective function maximization as a target until the sparse convolution network meeting the preset condition is obtained.
Therefore, the white box model is obtained through the forward updating mode training, so that the training and modifying time of the model is reduced, the model with strong interpretation is obtained, the knowledge of the reasons of weight updating of each network layer is facilitated, and the manual modification of a user is facilitated.
In a possible implementation manner, the preset condition is that the number of times of gradient updating reaches a first threshold value, or the preset condition is that the value of the objective function is equal to or greater than a second threshold value.
Therefore, by setting different preset conditions, the method is beneficial to flexibly selecting proper preset conditions according to actual requirements so as to obtain the trained superscore model.
In some embodiments, functions or modules included in the image processing apparatus provided by the embodiments of the present disclosure may be used to perform the method described in the embodiments of the image processing method, and specific implementation thereof may refer to the description of the embodiments of the image processing method, which is not repeated herein for brevity.
The embodiment of the disclosure also provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described image processing method when executing the instructions stored in the memory.
In some embodiments, functions or modules included in the electronic device provided by the embodiments of the present disclosure may be used to perform the methods described in the embodiments of the image processing methods, and specific implementations thereof may refer to the descriptions of the embodiments of the image processing methods, which are not described herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described image processing method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
In some embodiments, functions or modules included in the computer readable storage medium provided by the embodiments of the present disclosure may be used to perform the methods described in the embodiments of the image processing methods, and specific implementations thereof may refer to the descriptions of the embodiments of the image processing methods above, which are not repeated herein for brevity.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described image processing method.
In some embodiments, a function or a module included in a computer program product provided by the embodiments of the present disclosure may be used to perform a method described in the embodiments of the image processing method, and a specific implementation of the method may refer to the description of the embodiments of the image processing method, which is not repeated herein for brevity.
Fig. 11 shows a block diagram of an apparatus for performing an image processing method according to an embodiment of the present disclosure. For example, the apparatus 1900 may be provided as a server or terminal device. Referring to FIG. 11, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the image processing methods described above.
The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server TM ,Mac OS X TM ,Unix TM , Linux TM ,FreeBSD TM Or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (11)
1. An image processing method, characterized in that the method is applied to a super-division model, the super-division model comprising at least one processing unit, each processing unit comprising a sparse convolution network for feature extraction with a sparse convolution kernel and an upsampling network for upsampling, the method comprising:
inputting a first acquired image into the sparse convolution network, extracting spatial domain characteristics of the first acquired image, and determining time domain characteristics of the first acquired image based on an acquired image sequence and an application scene applied by the superdivision model, wherein the first acquired image represents an acquired image to be superdivided in the acquired image sequence, and the time domain characteristics under different application scenes are different;
Determining spatio-temporal features of the first acquired image based on the spatial features and the temporal features;
inputting the space-time characteristics of the first acquired image into the up-sampling network to obtain a first super-resolution image corresponding to the first acquired image, wherein the resolution of the first super-resolution image is higher than that of the first acquired image;
the number of the processing units in the superdivision model is determined by the magnification required by the superdivision processing; in case the superdivision model comprises at least two of the processing units, the method further comprises: taking the superdivision image output by the previous processing unit as the input of the sparse convolution network of the next processing unit; and determining the time-space characteristics of the up-sampling network input to the current processing unit according to the time-space characteristics used by the previous processing unit and the spatial characteristics of the sparse convolution network output of the current processing unit.
2. The method of claim 1, wherein, in the case where the hyper-segmentation model is used for a video enhancement scene, the determining temporal features of the first acquired image based on a sequence of acquired images and an application scene to which the hyper-segmentation model is applied comprises:
Determining optical flow information of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned in front of the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned behind the first acquired image in the acquired image sequence;
and determining the time domain characteristics of the first acquired image according to the optical flow information.
3. The method of claim 1, wherein, in the case where the superscore model is used for game rendering a scene, the determining the temporal feature of the first captured image based on the sequence of captured images and the application scene to which the superscore model is applied comprises:
determining jitter offset information and motion vectors of the first acquired image relative to a second acquired image and a third acquired image according to the acquired image sequence, wherein the second acquired image represents an acquired image positioned before the first acquired image in the acquired image sequence, and the third acquired image represents an acquired image positioned after the first acquired image in the acquired image sequence;
And determining the time domain characteristics of the first acquired image according to the jitter offset information and the motion vector.
4. The method of claim 1, wherein the determining the spatio-temporal characteristics of the first acquired image based on the spatial and temporal characteristics comprises:
and adjusting the spatial domain features by using the time domain features to obtain the spatial-temporal features of the first acquired image.
5. The method of claim 1, wherein the sparse convolution network comprises at least a first sparse convolution layer, a 1*1 convolution layer, and a second sparse convolution layer.
6. The method of claim 1, wherein the upsampling network comprises a sub-pixel convolution layer.
7. The method according to claim 1, wherein the training process of the superscore model comprises at least:
initializing network parameters of the sparse convolution network;
determining a gradient of an objective function based on an input image sequence of a current network layer of the sparse convolutional network, a weight matrix of the current network layer and optimization parameters, wherein the input image sequence is derived from a training sample set, the training sample set is set according to an application scene of the super-division model, and the objective function represents a coding length function related to the input image sequence and the optimization parameters;
And along the gradient direction, forward updating each network layer of the sparse convolution network with the objective function maximization as a target until the sparse convolution network meeting the preset condition is obtained.
8. The method of claim 7, wherein the preset condition is that the number of gradient updates reaches a first threshold, or the preset condition is that the value of the objective function is equal to or greater than a second threshold.
9. An image processing apparatus, the apparatus being applied to a super-division model comprising at least one processing unit, each processing unit comprising a sparse convolution network for feature extraction using a sparse convolution kernel and an upsampling network for upsampling, the apparatus comprising:
the feature extraction module is configured to input a first acquired image into the sparse convolution network, extract spatial features of the first acquired image, and determine time domain features of the first acquired image based on an acquired image sequence and an application scene to which the super-division model is applied, wherein the first acquired image represents an acquired image to be super-divided in the acquired image sequence, and the time domain features are different in different application scenes;
A feature determination module configured to determine spatiotemporal features of the first acquired image based on the spatial features and the temporal features;
the up-sampling module is configured to input the space-time characteristics of the first acquired image into the up-sampling network to obtain a first super-resolution image corresponding to the first acquired image, wherein the resolution of the first super-resolution image is higher than that of the first acquired image;
the number of the processing units in the superdivision model is determined by the magnification required by the superdivision processing; in case the hyper-model comprises at least two of the processing units, the apparatus is further to: taking the superdivision image output by the previous processing unit as the input of the sparse convolution network of the next processing unit; and determining the time-space characteristics of the up-sampling network input to the current processing unit according to the time-space characteristics used by the previous processing unit and the spatial characteristics of the sparse convolution network output of the current processing unit.
10. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to implement the image processing method of any one of claims 1 to 8 when executing the instructions stored by the memory.
11. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the image processing method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310926580.6A CN116681594B (en) | 2023-07-26 | 2023-07-26 | Image processing method and device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310926580.6A CN116681594B (en) | 2023-07-26 | 2023-07-26 | Image processing method and device, equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116681594A CN116681594A (en) | 2023-09-01 |
CN116681594B true CN116681594B (en) | 2023-11-21 |
Family
ID=87789427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310926580.6A Active CN116681594B (en) | 2023-07-26 | 2023-07-26 | Image processing method and device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116681594B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019145767A1 (en) * | 2018-01-25 | 2019-08-01 | King Abdullah University Of Science And Technology | Deep-learning based structure reconstruction method and apparatus |
CN112529776A (en) * | 2019-09-19 | 2021-03-19 | 中移(苏州)软件技术有限公司 | Training method of image processing model, image processing method and device |
KR20210041694A (en) * | 2019-10-07 | 2021-04-16 | 한국항공대학교산학협력단 | Method and apparatus for upscaling image |
CN113177888A (en) * | 2021-04-27 | 2021-07-27 | 北京有竹居网络技术有限公司 | Hyper-resolution restoration network model generation method, image hyper-resolution restoration method and device |
CN116156172A (en) * | 2021-11-23 | 2023-05-23 | 广州视源电子科技股份有限公司 | Video processing method, device, storage medium and electronic equipment |
-
2023
- 2023-07-26 CN CN202310926580.6A patent/CN116681594B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019145767A1 (en) * | 2018-01-25 | 2019-08-01 | King Abdullah University Of Science And Technology | Deep-learning based structure reconstruction method and apparatus |
CN112529776A (en) * | 2019-09-19 | 2021-03-19 | 中移(苏州)软件技术有限公司 | Training method of image processing model, image processing method and device |
KR20210041694A (en) * | 2019-10-07 | 2021-04-16 | 한국항공대학교산학협력단 | Method and apparatus for upscaling image |
CN113177888A (en) * | 2021-04-27 | 2021-07-27 | 北京有竹居网络技术有限公司 | Hyper-resolution restoration network model generation method, image hyper-resolution restoration method and device |
CN116156172A (en) * | 2021-11-23 | 2023-05-23 | 广州视源电子科技股份有限公司 | Video processing method, device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN116681594A (en) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Real-esrgan: Training real-world blind super-resolution with pure synthetic data | |
Chen et al. | Fast image processing with fully-convolutional networks | |
CN109102462B (en) | Video super-resolution reconstruction method based on deep learning | |
CN110163237B (en) | Model training and image processing method, device, medium and electronic equipment | |
US10579908B2 (en) | Machine-learning based technique for fast image enhancement | |
TW202134997A (en) | Method for denoising image, method for augmenting image dataset and user equipment | |
CN112541877B (en) | Defuzzification method, system, equipment and medium for generating countermeasure network based on condition | |
WO2023005140A1 (en) | Video data processing method, apparatus, device, and storage medium | |
CN115035011B (en) | Low-illumination image enhancement method of self-adaption RetinexNet under fusion strategy | |
CN115661403A (en) | Explicit radiation field processing method, device and storage medium | |
CN116547694A (en) | Method and system for deblurring blurred images | |
Li et al. | Unidirectional video denoising by mimicking backward recurrent modules with look-ahead forward ones | |
CN115409716A (en) | Video processing method, device, storage medium and equipment | |
Elwarfalli et al. | Fifnet: A convolutional neural network for motion-based multiframe super-resolution using fusion of interpolated frames | |
Wan et al. | Progressive convolutional transformer for image restoration | |
US20230379475A1 (en) | Codec Rate Distortion Compensating Downsampler | |
CN116681594B (en) | Image processing method and device, equipment and medium | |
CN115170713B (en) | Three-dimensional scene cloud rendering method and system based on super network | |
CN111861877A (en) | Method and apparatus for video hyper-resolution | |
Wang et al. | Exposure fusion using a relative generative adversarial network | |
CN115330633A (en) | Image tone mapping method and device, electronic equipment and storage medium | |
Banterle et al. | Self-supervised High Dynamic Range Imaging: What Can Be Learned from a Single 8-bit Video? | |
CN106204445A (en) | Image/video super-resolution method based on structure tensor total variation | |
Rakvi et al. | Super Resolution of Videos using E-GAN | |
RU2799237C1 (en) | Method for constructing a scene representation with direct correction for real-time image synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |