CN111932459A

CN111932459A - Video image processing method and device, electronic equipment and storage medium

Info

Publication number: CN111932459A
Application number: CN202010795380.8A
Authority: CN
Inventors: 李兴龙
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-11-13

Abstract

The application discloses a video image processing method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring N frames of first image frames; performing super-resolution reconstruction on each frame of the N frames of first image frames by using a first network module to obtain N frames of second image frames; and aligning and fusing the N frames of second image frames by using a second network module to obtain a result frame.

Description

Video image processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a video image, an electronic device, and a storage medium.

Background

Aiming at the processing of video images, in some schemes, a plurality of image frames in a video are simply and directly sent to a network module for training, and when the scheme is used for training the plurality of image frames, the difficulty of the training of the network module is increased due to the existence of blurring, noise, downsampling, compression and the like in the plurality of image frames, and the finally obtained video images cannot achieve a good display effect.

Disclosure of Invention

In order to solve the foregoing technical problem, embodiments of the present application provide a method and an apparatus for processing a video image, an electronic device, and a storage medium.

The embodiment of the application provides a video image processing method, which comprises the following steps:

acquiring N frames of first image frames;

performing super-resolution reconstruction on each frame of the N frames of first image frames by using a first network module to obtain N frames of second image frames;

and aligning and fusing the N frames of second image frames by using a second network module to obtain a result frame.

An embodiment of the present application further provides a device for processing a video image, where the device includes:

the input module is used for acquiring N frames of first image frames and inputting the N frames of first image frames into the first module;

the first network module is used for performing super-resolution reconstruction on each frame of first image frames in the N frames of first image frames to obtain N frames of second image frames;

and the second network module is used for carrying out alignment fusion processing on the N frames of second image frames to obtain a result frame.

An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is enabled to execute the method for processing a video image according to the foregoing embodiment.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the video image processing method according to the foregoing embodiment.

According to the technical scheme of the embodiment of the application, N first image frames are obtained; performing super-resolution reconstruction on each frame of the N frames of first image frames by using a first network module to obtain N frames of second image frames; and aligning and fusing the N frames of second image frames by using a second network module to obtain a result frame. Therefore, the super-resolution reconstruction can be performed on the acquired low-quality N first image frames by using the first network module, so that the noise and the fuzzy degree of the N first image frames are reduced, and the quality of each image frame in the N first image frames is improved; after the first network module outputs the high-quality N frames of second image frames, when the second network module is used for processing the N frames of second image frames, the interference of noise contained in the initial N frames of first image frames can be isolated, so that the second module can be concentrated on realizing the fusion utilization of effective information in the multi-frame second image frames, and the quality of the finally obtained image frames is improved. In addition, the image processing process of the N frames of the first image frames is divided into two sub-networks of the first network module and the second network module, so that the flexibility of the whole image processing task can be improved.

Drawings

Fig. 1 illustrates several video image processing methods provided in an embodiment of the present application;

FIG. 2 is a schematic diagram showing the display effect of 3 frames of images photographed by a terminal device;

fig. 3 is a schematic flowchart of a video image processing method according to an embodiment of the present disclosure;

fig. 4 is a first schematic view of a video image processing flow provided in the embodiment of the present application;

fig. 5 is a schematic view illustrating a video image processing flow according to an embodiment of the present application;

fig. 6 is a first schematic diagram of a first network module according to an embodiment of the present disclosure;

fig. 7 is a second schematic diagram of a first network module according to an embodiment of the present application;

fig. 8 is a first schematic diagram of a second network module according to an embodiment of the present application;

fig. 9 is a second schematic diagram of a second network module according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a video image processing apparatus according to an embodiment of the present application.

Detailed Description

So that the manner in which the features and aspects of the present application can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Fig. 1 illustrates several video image processing methods listed in the embodiment of the present application, and as shown in fig. 1, the video image processing algorithms are mainly classified into two categories, one category is a Convolutional Neural Network (CNN) -based video image processing algorithm, and the other category is a non-CNN video image processing conventional algorithm. Currently, when a video image processing technology is researched, two ideas are generally available, one is to take each frame image frame in a video as an independent image for reconstruction, and then splice the reconstructed frames together to form a final result video; the other method is to use the time domain characteristics of the video, and when a certain frame of image frame is reconstructed, use a plurality of image frames adjacent to the frame of image frame in the video to develop the algorithm.

For processing of video images, in one particular approach, the processing of video images is achieved by using a convolutional neural network with two-stage motion compensation. The scheme specifically comprises the following steps: 1. accepting video having a first plurality of frames, the first plurality of frames having a first resolution; 2. generating a plurality of warped frames from the first plurality of frames based on the first type of motion compensation; 3. generating a second plurality of frames having a second resolution, wherein the second resolution belongs to a higher resolution than the first resolution; 4. obtaining each frame of a second plurality of frames having a second resolution from a subset of the plurality of warped frames using a convolutional neural network; 5. generating a third plurality of frames having a second resolution based on the second type of motion compensation; 6. each frame of the third plurality of frames having the second resolution is obtained from a fusion of a subset of the second plurality of frames. The scheme needs to perform motion compensation on both the low-resolution image frame and the high-resolution image frame, increases the complexity of the algorithm, and leads to poor convergence performance of the network. Also, this scheme is not suitable for scenes that require 1:1 video quality enhancement.

In another specific scheme, the reconstruction of the video image is mainly realized based on residual learning and implicit motion compensation. The scheme specifically comprises the following steps: 1. respectively training convolutional neural network models with different amplification factors; 2. and taking the adjacent low-resolution image frames as input, and obtaining a final reconstruction result through the network model trained in the previous step. The scheme is to simply send 3 adjacent frames in the video to the CNN network, and cannot fully utilize multi-frame information in the video.

In the two schemes, when a plurality of image frames in a video are used for obtaining a reconstructed image frame, the plurality of image frames obtained from the video are simply and directly input into a network module, and the network module is used for training the plurality of image frames simultaneously. However, as shown in fig. 2, there are often a plurality of degradation types in the image frames captured and generated by the terminal device, such as blur, noise, down-sampling, compression, and the like, and the image frames have obvious blur and noise when displayed, if a network module is used to directly merge a plurality of image frames including noise and blur and perform subsequent feature learning and training, the blur and high-intensity noise in the image frames may increase the difficulty of the subsequent processing process of the network, which may cause the network module to fall into a local optimal solution, even make the training process of the network module unable to converge, and reduce the quality of the image frames finally trained by the network module.

In the following, based on the problems of the above solutions, the technical solutions of the embodiments of the present application are proposed.

Fig. 3 is a schematic flowchart of a video image processing method according to an embodiment of the present application, and as shown in fig. 3, the method includes the following steps:

step 301: n first image frames are acquired.

In this embodiment of the application, N first image frames are obtained from a video to be subjected to image processing, a value of N may be extended to any integer greater than or equal to 1, and the N first image frames may be consecutive N first image frames in the video to be subjected to image processing, or may be discontinuous N first image frames selected from consecutive M first image frames (M is an integer greater than N). Here, if the N first image frames are not continuous, it is necessary to ensure that a time interval between any two first image frames in the N first image frames is smaller than a set time threshold.

In this embodiment of the application, the N first image frames may be directly obtained from a video to be subjected to image processing, are original N image frames in the video to be subjected to image processing, and may also be N image frames obtained by performing certain processing on the original N image frames.

In a specific embodiment, the N frames of the first image frame may be acquired by: acquiring a video to be subjected to image processing, and acquiring N third image frames from the video; and extracting image data of a specific channel from the N third image frames to obtain the N first image frames.

Specifically, after the original N frames of third image frames are obtained from the video to be subjected to image processing, the image data of the specific channel may be extracted from the N frames of third image frames to obtain N frames of first image frames only including the image data of the specific channel, and when the first image frames are subsequently processed by using the first network module, the feature information included in the N frames of first image frames only including the specific channel may be used.

Here, in a preferred embodiment, for a frame of image frame with YUV color space, the specific channel may be a Y channel, and when the image frame is processed, the feature data of the Y channel has a greater influence on the result of image processing, and the U channel and the V channel have a smaller influence on the result of image processing, so that only the image data of the Y channel may be input into the first network module and processed by the subsequent second network module to obtain a final result frame. In this way, the data processing amount of the image processing process can be reduced.

It should be noted that, in the embodiment of the present application, the number of channels of the image frame input to the first network module is not limited, and for example, the image frame including multiple channels (for example, RGB three channels) may also be directly input to the first network module, and a final result frame is obtained through the processing of the subsequent second module.

In one embodiment, after acquiring N first image frames, a first key frame needs to be selected from the N first image frames; the first key frame is one image frame of the N first images.

Here, the selection of the first key frame may be determined according to a specific application scenario, and the first key frame may be any one of N first image frames. If the N first image frames are obtained from the video, the N first image frames have a time sequence, and after the N first image frames are sequenced from 1 to N according to the time sequence, a first key frame can be selected from the N first image frames according to the sequence. For example, when the value of N is 5, the image frame of the middle frame of the 5 first image frames, that is, the 3 rd frame image, may be selected as the first key frame, and the 2 nd frame image or the 4 th frame image of the 5 first image frames may also be selected as the first key frame. It should be noted that, in general, for a specific application scene, the value of N and the order of the first keyframe relative to the N first image frames are generally not changed, and for example, for the same video a to be subjected to image processing, if the value of N is set to be 5 and the selected first keyframe is the 2 nd image after the 5 image frames are time-sequenced, when image processing is performed on the video a, the values of N are all 5, and the 2 nd image in the 5 image frames is selected as the first keyframe.

Step 302: and performing super-resolution reconstruction on each frame of the N frames of first image frames by using a first network module to obtain N frames of second image frames.

Fig. 4 is a video image processing flow when the value of N is 3 according to the embodiment of the present application; fig. 5 is a video image processing flow when the value of N is 5 according to the embodiment of the present application; in fig. 4 and 5, network a represents a first network module and network B represents a second network module.

In the embodiment of the application, the first network module is mainly used for performing super-resolution reconstruction of a single-frame image on each of N first image frames, and restoring the received low-resolution first image frame into a second image frame with higher definition. The first network module may adopt a classical Single Image Super Resolution reconstruction (SISR) method.

In an optional embodiment of the present application, the first network module includes a residual error network, and when the first network module includes the residual error network, the first network module is used to perform super-resolution reconstruction on each of the N first image frames, so as to obtain N second image frames in a specific process that:

for each frame of first image frame in the N frames of first image frames, performing feature learning on the frame of first image frame through the residual error network to obtain a feature image corresponding to the frame of first image frame; and splicing the first image frame and the characteristic image to obtain a second image frame.

Specifically, as shown in fig. 6, the first network module may specifically include a residual network, in fig. 6, a Low Resolution (LR) video frame represents a first image frame of the N first image frames, and an initial Super Resolution (SR) frame represents an image frame obtained by performing Super Resolution reconstruction on a single image frame of the first image frame. And performing super-resolution reconstruction on the LR video frame by using a residual error network, and correspondingly obtaining an initial SR frame after performing super-resolution reconstruction on the LR video frame.

Here, the first image frame and the feature image are spliced through a global residual structure, and the training difficulty of the first network module can be reduced by introducing the global residual structure into the first network module.

In another optional embodiment of the present application, the first network module comprises a multi-stage feature extraction network; under the condition that the first network module comprises a multi-level feature extraction network, performing super-resolution reconstruction on each frame of first image frames in the N frames of first image frames by using the first network module, wherein the specific process of obtaining the N frames of second image frames comprises the following steps:

respectively performing feature learning on each frame of the N frames of first image frames through the multistage feature extraction network to obtain a plurality of feature images with different scales; and splicing the plurality of characteristic images with different scales to obtain a second image frame.

Fig. 7 is a schematic diagram of a first network module including a multi-level feature extraction network, where the multi-level feature extraction network is also referred to as a pyramid network, and after the multi-level feature extraction network is introduced into the first network module, multi-scale information of an LR video frame can be obtained through downsampling. Taking two down-sampling as an example, in fig. 7, after the two down-sampling, feature learning is performed on different scales, at the output end of the multi-level feature extraction network, sampling is performed on low-scale information in a deconvolution mode, and then the low-scale information and the high-scale feature are fused, so that effective utilization of multi-scale information in an LR video frame can be finally realized, and the neural network structure combining the multi-scale information can reduce the time complexity of the first network module.

In an optional embodiment of the present application, the first network module includes a first amplification module, and the first amplification module is configured to improve a resolution of an image frame input to the first amplification module.

Specifically, as shown in fig. 6, the first network module may further include an upsampling layer, and in the process of training the LR video frame by using the first network module, by setting the upsampling layer, the resolution of the image frame input to the upsampling layer can be further improved according to a specified magnification, so that the resolution of the initial SR frame output by the first network module is higher than the resolution of the LR video frame input to the first network module.

Step 303: and aligning and fusing the N frames of second image frames by using a second network module to obtain a result frame.

Specifically, after the super-resolution reconstruction of the single-frame image is performed on each of the N first image frames by using the first network module to obtain the N second image frames, the N second image frames may be aligned and fused by using the second network module, so as to obtain a final result frame. Here, when the value of N is greater than 2, when the second network module is used to perform alignment fusion on the N second image frames, the N second image frames may be simultaneously input into the second network module, and the second network module performs alignment fusion processing on the N second image frames simultaneously to obtain a reconstructed result frame. Or grouping the N frames of second image frames pairwise, and aligning and fusing every two frames of second image frames by the second network module.

In a preferred embodiment, the alignment and fusion processing of the N second image frames by using the second network module to obtain a result frame specifically includes:

aligning and fusing each frame of the N-1 frames of second image frames with the second key frame by using the second network module to obtain N-1 first fused frames; the N-1 frame second image frames are image frames except the second key frame in the N frame second image frames;

aligning and fusing every two first fusion frames in the N-1 first fusion frames by using the second network module to obtain M second fusion frames; m is an integer greater than or equal to 1 and less than N-1;

and acquiring a result frame based on the M second fusion frames.

Here, the process of obtaining the result frame based on the M second fusion frames may be implemented as follows:

determining the second fused frame as a result frame if the M is equal to 1;

when the M is larger than 1, the second network module is utilized to carry out alignment fusion processing on every two second fusion frames in the M second fusion frames; if the number of the obtained fusion frames is 1, determining the obtained fusion frames as result frames; and if the number of the obtained fusion frames is more than 1, continuing to perform alignment fusion processing on every two obtained fusion frames in the plurality of obtained fusion frames by using the second network module until the number of the obtained fusion frames is 1.

It should be noted that, when the second network module is used to perform the alignment and fusion processing of the image frames, if three or more image frames are directly input to the second network module, the second network module network realizes the alignment and fusion of the three or more image frames, which will increase the training difficulty of the second network module, and make the second network module difficult to converge and fall into the local optimal solution. According to the embodiment of the application, the second network module only receives the input of two image frames, the training difficulty of the second network module can be reduced, and the effect of aligning and fusing the two image frames is improved.

Fig. 4 is a processing procedure of processing the 3 first image frames by using the network a and the network B to obtain a result frame when the value of N is 3, and after the 3 first image frames are sorted, the middle one of the image frames is selected as the first key frame (i.e., the value of i is 2).

Fig. 5 is a processing procedure of processing 5 first image frames by using a network a and a network B to obtain a result frame when a middle one of the image frames is selected as a first key frame (i.e., i has a value of 3) after the value of N is 5 and the 5 first image frames are sorted.

It should be noted that, when the second network module is used to further fuse the fused frame generated by the second network module in the previous stage, if after each two fused frames generated in the previous stage are combined, there are remaining single frame fused frames that are not grouped, the frame fused frame may be discarded, or another frame of other fused frame may be optionally combined with the frame fused frame, and the second network is used to perform further alignment and fusion processing on the two frames of fused frame.

In the embodiment of the present application, the second network module may be implemented by a convolutional neural network, or may be implemented by a conventional algorithm such as a conventional block matching algorithm and a homography matrix algorithm.

In an optional implementation manner of this application, the second network module includes: a splicing module and a fusion module;

the splicing module is used for splicing the two image frames input to the splicing module in a target dimension and outputting the characteristic data after splicing the two image frames;

the fusion module is used for fusing the spliced characteristic data and outputting a fusion frame corresponding to one frame and the two frame image frames.

Here, in addition to the splicing module and the fusion module, the second network module further includes an alignment module, the alignment module is located between the splicing module and the fusion module or located before the splicing module and the fusion module, and the alignment module is configured to align two image frames input to the alignment module and output feature data after the two image frames are aligned.

Under the condition that the alignment module is positioned between the splicing module and the fusion module, the specific processing process of the second network module for two frame image frames is as follows: splicing the two image frames input to the splicing module on a target dimension by using the splicing module, and outputting characteristic data after splicing the two image frames; aligning the spliced characteristic data by using an alignment module, and outputting the characteristic data after aligning two frames of image frames; and fusing the aligned feature data by using the fusion module, and outputting a fusion frame corresponding to one frame and two frames of image frames.

Specifically, as shown in fig. 8, the second network module outputs a frame of fused frame corresponding to the two received image frames according to a processing sequence of "splicing-aligning-fusing".

For the second network module shown in fig. 8, the splicing module first splices two input image frames in a third dimension, and then sequentially sends the spliced data to the alignment module and the fusion module for learning, and since each convolution layer in the alignment module and the fusion module has a plurality of filters, the alignment and fusion of the two image frames can be realized through the plurality of filters in the two network modules, and the fusion frame corresponding to the two image frames is output.

It should be noted that the two image frames here represent different image frames at different stages. Illustratively, when the second network module is used for the first time to perform the alignment and fusion processing of two image frames, the two image frames refer to a second key frame corresponding to the first key frame and one image frame of the other N-1 second image frames. When the second and later alignment fusion processing is performed by using the second network module, the two-frame image frame refers to two fusion frames included in a combination after grouping a plurality of fusion frames generated in the previous stage.

In this embodiment, the target dimension refers to an image channel dimension other than a width and a height in an image frame, for example, for a frame of color image with three RGB channels, a dimension where three color channels of R (i.e., red), G (i.e., green), and B (i.e., blue) are located is a third dimension, at this time, a stitching process of the stitching module may be understood as inputting three images of R, G, B channels, and after the stitching module performs the stitching of the third dimension, a normal RGB three-channel color image may be output.

Here, by using the second network module constituted by sequentially cascading the mosaic module, the alignment module, and the fusion module, information of two frame image frames input to the second network module can be efficiently fused and used in the processing order of "mosaic-alignment-fusion".

Under the condition that the alignment module is positioned in front of the splicing module and the fusion module, the specific processing process of the second network module for two image frames is as follows: aligning the two image frames input to the alignment module by using the alignment module, and outputting the aligned characteristic data of the two image frames; splicing the aligned feature data on a target dimension by using a splicing module, and outputting the feature data obtained by splicing the two frames of image frames; and fusing the spliced characteristic data by utilizing a fusion module, and outputting a fusion frame corresponding to one frame and two frames of image frames.

Specifically, as shown in fig. 9, the second network module outputs a fused frame corresponding to one frame and two frame image frames according to the processing sequence of "alignment-stitching-fusion".

For the alignment fusion module shown in fig. 9, the alignment module aligns two input image frames through several layers of convolution structures, and after the aligned two image frames are spliced in a third dimension through the splicing module, the aligned two image frames are input into a convolution structure corresponding to a subsequent fusion module to complete fusion of the two image frames, and a fusion frame corresponding to one image frame and the two image frames is output.

Here, by using the second network module constituted by sequentially cascading the alignment module, the mosaic module, and the fusion module, the feature information of the two frame image frames input to the second network module can be efficiently fused and used in the processing order of "alignment-mosaic-fusion".

In the embodiment of the application, in order to achieve better alignment, in the link of aligning each frame, some special convolution structures, such as a void convolution layer, a deformable convolution layer, and the like, can be used in the alignment module, so that the effect of aligning two image frames is improved.

In an optional implementation manner of this application, the second network module further includes a second enlarging module, where the second enlarging module is configured to improve a resolution of the image frame input to the second enlarging module.

Specifically, in the embodiment of the present application, in addition to the first amplification module being arranged in the first network module to achieve the improvement of the resolution of the image frame, a second amplification module may be arranged in the second network module according to a specific application scenario to achieve the improvement of the resolution of the image frame input into the second amplification module.

It should be noted that, in the embodiment of the present application, the improvement multiple of the resolution of the finally generated result frame with respect to the N frames of the second image frame input to the second network module is associated with the value of N and the specific magnification set by the magnification module. For example, as shown in fig. 4, when the value of N is 3, a final result frame needs to be obtained through two alignment fusion processes of the second network module, and at this time, if the magnification of the second amplification module is 2, the lifting magnification of the final result frame relative to the N frames of the second image frame is 4; in fig. 5, when the value of N is 5, a final result frame needs to be obtained through 3 rounds of alignment and fusion processes of the second network module, and at this time, if the magnification of the second amplification module is 2, the lifting magnification of the final result frame relative to the N frames of the second image frame is 8.

Here, it should be noted that, in the embodiment of the present application, the first amplification module in the first network module and the second amplification module in the second network module may be flexibly set according to the target amplification factor. If the image frame does not need to be subjected to magnification amplification in the whole video image processing process, the first amplification module arranged in the first network module can be cancelled, and the second amplification module arranged in the second network module can be cancelled. In this case, it is equivalent to perform an equal-ratio reconstruction on the first key frame by using the first network module and the second network module. The first amplification module and the second amplification module are respectively arranged in the first network module and the second network module according to requirements, so that the resolution of the first key frame can be improved while the definition of the first key frame is improved, and the method can be applied to various scenes in need of super-resolution reconstruction of image frames in videos.

In an optional embodiment of the present application, the following method may be adopted for training the first network module and the second network module:

training the first network module by using a first training data set to obtain a trained first network module;

and determining a second training data set for training the second network module by using the trained first network module, and training the second network module by using the second training data set to obtain the trained second network module.

Specifically, when the first network module and the second network module are trained, the SISR data set may be used to train the first network module, and after the first network module converges, a new training data set for training the second network module is generated by combining the trained first network module, so as to ensure that the input of the second network module is a clear image frame when the second network module is trained.

In the embodiment of the application, the first training data set is a data set for training a first network module, and the second training data set is a data set for training a second network module; the second training data set may be a data set obtained by inputting the first training data set to the trained first network module, or a data set obtained by processing the third data set with the trained first network module; wherein the third data set is a different data set than the first training data set.

Here, taking the deep learning framework of pytorech as an example, the input during training and testing is a tensor with four dimensions, that is, the number × channels × width × height, when the scheme of the present invention is used, in order to reduce the time consumption of network deployment, in a first network module, an image frame with a data volume of 3 × 1 × 640 × 360 (taking 3 frame input, for example, Y channels that only process 360p image frames) is directly received, the data volume of the image frame of the final output result is 3 × 1 × 1920 × 1080(3 times enlargement), and then the input received by a second network module is similar to that of the first network module, and the specific input specification is determined according to the network structure and the actual application scene.

The reconstruction task of the image frame in the video is divided into two sub-network modules (namely a first network module and a second network module), each network module completes corresponding functions, the two network modules can obtain excellent convergence performance by respectively training the two network modules during training, and finally the two trained network modules are cascaded to obtain a better reconstruction effect of the video image frame. In addition, the training difficulty of the second network module can be reduced by utilizing the high-quality image frames output by the first network module, and the reconstruction effect of the image frames in the video is improved.

According to the technical scheme of the embodiment of the application, time domain information of a plurality of image frames in a video can be fully utilized, a reconstruction task of a low-quality image frame in the video is divided into a first network module and a second network module, the first network module is utilized to carry out SISR processing on each image frame in the low-quality multi-image frame, noise and fuzzy degree in the low-quality multi-image frame are reduced, definition of each image frame in the low-quality multi-image frame is improved, then the processed image frames are sent to the second network module, the second network module carries out alignment fusion processing on the processed image frames, interference of problems such as noise in the image frames can be isolated in the process of image alignment fusion by the second module, and effective fusion utilization of the multi-image frame information is realized; in addition, the processing process of the image frames in the video is divided into two sub-networks of the first network module and the second network module, so that the flexibility of the whole video image processing task can be improved.

Fig. 10 is a schematic structural component diagram of a video image processing apparatus according to an embodiment of the present application, and as shown in fig. 10, the video image processing apparatus includes:

an input module 1001, configured to acquire N frames of first image frames and input the N frames of first image frames to the first module;

the first network module 1002 is configured to perform super-resolution reconstruction on each of the N first image frames to obtain N second image frames;

the second network module 1003 is configured to perform alignment fusion processing on the N second image frames to obtain a result frame.

In an optional embodiment of the present application, the apparatus further comprises:

a selecting module 1004, configured to select a first key frame from the N first image frames, where the first key frame is one image frame of the N first image frames;

and performing super-resolution reconstruction on the first key frame through the first network module to obtain a second key frame, wherein the second key frame is one of the N second image frames.

In an optional embodiment of the present application, the second network module 1003 is specifically configured to: aligning and fusing each frame of the N-1 frames of second image frames with the second key frame by using the second network module to obtain N-1 first fused frames; the N-1 frame second image frames are image frames except the second key frame in the N frame second image frames; aligning and fusing every two first fusion frames in the N-1 first fusion frames by using the second network module to obtain M second fusion frames; m is an integer greater than or equal to 1 and less than N-1; and acquiring a result frame based on the M second fusion frames.

In an optional embodiment of the present application, the second network module 1003 is further specifically configured to: determining the second fused frame as a result frame if the M is equal to 1; when the M is larger than 1, the second network module is utilized to carry out alignment fusion processing on every two second fusion frames in the M second fusion frames; if the number of the obtained fusion frames is 1, determining the obtained fusion frames as result frames; and if the number of the obtained fusion frames is more than 1, continuing to perform alignment fusion processing on every two obtained fusion frames in the plurality of obtained fusion frames by using the second network module until the number of the obtained fusion frames is 1.

In an optional embodiment of the present application, the first network module 1002 includes a residual error network;

the first network module 1002 is specifically configured to: for each frame of first image frame in the N frames of first image frames, performing feature learning on the frame of first image frame through the residual error network to obtain a feature image corresponding to the frame of first image frame; and splicing the first image frame and the characteristic image to obtain a second image frame.

In an optional embodiment of the present application, the first network module 1002 includes a multi-stage feature extraction network;

the first network module 1002 is specifically configured to: respectively performing feature learning on each frame of the N frames of first image frames through the multistage feature extraction network to obtain a plurality of feature images with different scales; and splicing the plurality of characteristic images with different scales to obtain a second image frame.

In an optional embodiment of the present application, the first network module 1002 includes a first enlarging module, and the first enlarging module is configured to improve a resolution of an image frame input to the first enlarging module.

In an optional embodiment of the present application, the second network module 1003 includes: a splicing module and a fusion module;

and the fusion module is used for fusing the spliced characteristic data and outputting a fusion frame corresponding to one frame and the two image frames.

In an optional embodiment of the present application, the second network module 1003 further includes: an alignment module; the alignment module is positioned between the splicing module and the fusion module, or the alignment module is positioned before the splicing module;

the alignment module is used for aligning the two image frames input to the alignment module and outputting the aligned feature data of the two image frames.

a first training module 1005, configured to train the first network module by using a first training data set, so as to obtain a trained first network module;

a second training module 1006, configured to determine a second training data set for training the second network module by using the trained first network module, and train the second network module by using the second training data set, so as to obtain the trained second network module.

In an optional embodiment of the present application, the input module 1001 is specifically configured to: acquiring a video to be subjected to image processing, and acquiring N third image frames from the video; and extracting image data of a specific channel from the N third image frames to obtain the N first image frames.

It should be understood by those skilled in the art that the functions of the modules in the video image processing apparatus shown in fig. 10 can be understood by referring to the related description of the video image processing method. The functions of the respective blocks in the video image processing apparatus shown in fig. 10 may be implemented by a program running on a processor, or may be implemented by specific logic circuits.

An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the video image processing method according to the foregoing embodiment.

The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method for processing video images, the method comprising:

acquiring N frames of first image frames;

2. The method of claim 1, further comprising:

selecting a first key frame from the N first image frames, wherein the first key frame is one of the N first image frames;

3. The method according to claim 2, wherein the performing, by the second network module, the alignment fusion process on the N second image frames to obtain a result frame comprises:

and acquiring a result frame based on the M second fusion frames.

4. The method of claim 3, wherein obtaining the result frame based on the M second fused frames comprises:

determining the second fused frame as a result frame if the M is equal to 1;

5. The method of claim 1, wherein the first network module comprises a residual network; wherein the content of the first and second substances,

the super-resolution reconstruction of each frame of the N frames of first image frames by using the first network module to obtain N frames of second image frames includes:

for each frame of first image frame in the N frames of first image frames, performing feature learning on the frame of first image frame through the residual error network to obtain a feature image corresponding to the frame of first image frame;

and splicing the first image frame and the characteristic image to obtain a second image frame.

6. The method of claim 1, wherein the first network module comprises a multi-stage feature extraction network; wherein the content of the first and second substances,

respectively performing feature learning on each frame of the N frames of first image frames through the multistage feature extraction network to obtain a plurality of feature images with different scales;

and splicing the plurality of characteristic images with different scales to obtain a second image frame.

7. The method according to any one of claims 1 to 6, wherein the first network module comprises a first magnification module for increasing the resolution of the image frame input to the first magnification module.

8. The method according to any of claims 1 to 6, wherein the second network module comprises: a splicing module and a fusion module;

9. The method of claim 8, wherein the second network module further comprises: an alignment module; the alignment module is positioned between the splicing module and the fusion module, or the alignment module is positioned before the splicing module;

10. The method according to any one of claims 1 to 6, further comprising:

11. The method of any of claims 1 to 6, wherein said obtaining N first image frames comprises:

acquiring a video to be subjected to image processing, and acquiring N third image frames from the video;

and extracting image data of a specific channel from the N third image frames to obtain the N first image frames.

12. An apparatus for processing video images, the apparatus comprising:

the input module is used for acquiring N frames of first image frames and inputting the N frames of first image frames into the first network module;

13. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program that, when executed by the processor, causes the processor to perform the method of any of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 11.