CN113362224A

CN113362224A - Image processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN113362224A
Application number: CN202110604628.2A
Authority: CN
Inventors: 郭伟伟; 夏树涛; 戴涛
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-07
Anticipated expiration: 2041-05-31
Also published as: CN113362224B

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a readable storage medium, and belongs to the technical field of communication. The method comprises the following steps: extracting M adjacent first image frames in the N first image frames to serve as a second image frame to obtain N second image frames; n and M are positive integers, and M is less than N; performing channel conversion on a target second image frame, and extracting the depth characteristics of the converted image frame, wherein the target second image frame is any one of N second image frames; generating third image frames based on the depth features of the converted image frames, wherein one second image frame corresponds to one third image frame; synthesizing the N third image frames into a second video according to the time sequence of the N first image frames; wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

Description

Image processing method and device, electronic equipment and readable storage medium

Technical Field

The embodiment of the application relates to the technical field of communication, in particular to an image processing method and device, an electronic device and a readable storage medium.

Background

With the rapid development of the internet and multimedia technology, video is becoming more and more widely used in real life as an important form of multimedia information carrier.

In order to improve the sharpness of a video, in the related art, a super-resolution technique is generally used to process the video. However, in the related art, the low-resolution video stored on the server is converted into the high-resolution video for the user to watch based on the strong computing power of the GPU at the server side.

However, the above method for improving video definition can only be used in electronic devices with strong GPU computing power, and cannot be applied to mobile terminals with low image processing capability.

Disclosure of Invention

An object of the embodiments of the present application is to provide an image processing method, an image processing apparatus, an electronic device, and a readable storage medium, which can solve the problem that the definition of a played video cannot be improved by an electronic device with a low image processing capability.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides an image processing method, including: extracting M adjacent first image frames in the N first image frames to serve as a second image frame to obtain N second image frames; n and M are positive integers, and M is less than N; performing channel conversion on a target second image frame, and extracting the depth characteristics of the converted image frame, wherein the target second image frame is any one of N second image frames; generating third image frames based on the depth features of the converted image frames, wherein one second image frame corresponds to one third image frame; synthesizing the N third image frames into a second video according to the time sequence of the N first image frames; wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

In a second aspect, an embodiment of the present application further provides an image processing apparatus, including: the device comprises an extraction module, a processing module, a generation module and a synthesis module; the extraction module is used for extracting M adjacent first image frames in the N first image frames to serve as a second image frame to obtain N second image frames; n and M are positive integers, and M is less than N; the processing module is used for carrying out channel conversion on the target second image frame extracted by the extraction module and extracting the depth feature of the converted image frame, and the target second image frame is any one of the N second image frames; the generating module is used for generating third image frames based on the depth characteristics of the image frames converted by the processing module, and one second image frame corresponds to one third image frame; the synthesis module is used for synthesizing the N third image frames generated by the generation module into a second video according to the time sequence of the N first image frames extracted by the extraction module; wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the image processing method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, when a user uses the electronic equipment to play a video, N second image frames are obtained by extracting M adjacent first image frames from the N first image frames as one second image frame; and then, respectively carrying out channel conversion on each second image frame in the N second image frames, extracting the depth characteristics of the converted image frames, carrying out super-resolution processing on the image frames of the original video to obtain image frames with higher definition, and finally synthesizing the image frames with higher resolution into a high-resolution video according to the time sequence of the original low-resolution image frames. The electronic equipment with lower image processing capacity can utilize the target model to perform super-resolution processing on the video with lower definition, and play the processed video in real time, so that the video super-resolution processing capacity of the electronic equipment with lower image processing capacity is greatly improved.

Drawings

Fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a model structure applied by an image processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 5 is a second schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The image processing method provided by the embodiment of the application can be applied to scenes for improving the definition of videos on electronic equipment with lower image processing capacity.

For example, in a scene in which the definition of a video is improved on an electronic device with a lower image processing capability, taking the electronic device with the lower image processing capability as a mobile terminal as an example, in the related art, because the electronic device is limited by the size of a storage space of the mobile terminal, the resolution of the video shot by the mobile terminal is generally lower and the definition is poorer, and the existing super-resolution technology can convert the video with the lower resolution into the video with the higher resolution to improve the definition of the video in turn. However, the video with higher resolution cannot be directly stored on the mobile terminal due to the limitation of the storage space of the mobile terminal, and if the mobile terminal is required to play the high-resolution video and the storage space occupied by the video is relatively small, the mobile terminal is required to convert the low-resolution video into the high-resolution video in real time while playing the video. However, the existing mobile terminal cannot achieve the above operation.

In view of the above problem, in the technical solution provided in the embodiment of the present application, a depth model is provided, which greatly reduces the computation workload of the existing model by adopting early fusion, narrow and deep network design, fast upsampling and other modes, and by optimizing the model structure, the depth model can achieve the effect of performing real-time super-resolution on a 360p video on an electronic device with low image processing capability, thereby greatly improving the definition of the played video.

The image processing method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

As shown in fig. 1, an image processing method provided in an embodiment of the present application may include the following steps 201 to 204:

step 201, the image processing apparatus extracts M adjacent first image frames of the N first image frames as a second image frame to obtain N second image frames.

Wherein N and M are both positive integers, and M is less than N.

Illustratively, the N first image frames are image frames of a first video, and the first video may be a video stored in the electronic device or a network streaming media video acquired by the electronic device.

Illustratively, the image processing device extracts image frames in the video after acquiring the first video. The image processing device may extract all image frames in the first video, and may also extract image frames to be or being played.

In a possible implementation manner, the image processing apparatus may extract an image frame currently played when a user plays a video using the electronic device, and output the image frame to a screen of the electronic device after processing the image frame, so as to implement real-time super-resolution processing on the video.

Step 202, the image processing device performs channel conversion on the target second image frame, and extracts the depth feature of the converted image frame.

Wherein the target second image frame is any one of the N second image frames.

Each of the N second image frames uses a first image frame as a reference image frame, and uses the reference image frame and a preset number of adjacent image frames before and after the reference image frame as a second image frame. I.e. one first image frame corresponds to one second image frame.

Illustratively, in order to be able to extract image detail features of the second image frame, the image processing device performs channel conversion on the second image frame including M consecutive first image frames. For example, the image processing apparatus may convert a second image frame of three channels (e.g., (red green blue, RGB) red green blue channels) into an image frame having more channels.

Optionally, in a possible implementation manner, in order to reduce the computational complexity of the image processing apparatus and improve the computational efficiency, the step 202 may further include the following step 202 a:

step 202a, the image processing device converts the target second image frame into an eight-channel image frame.

Step 203, the image processing device generates a third image frame based on the depth feature of the image frame after the conversion.

One second image frame corresponds to one third image frame, namely, one second image frame can generate one third image frame after being processed.

For example, after converting the second image frame into image frames of more channels, in order to further process the multi-channel image frame to obtain an image frame with higher definition, the image processing apparatus may extract a depth feature of the multi-channel image frame, and then generate a third image frame with higher definition based on the depth feature.

It should be noted that the image processing apparatus needs to process each of the N second image frames according to the methods in step 202 and step 203, so as to obtain N third image frames with higher definition.

And step 204, the image processing device synthesizes the N third image frames into a second video according to the time sequence of the N first image frames.

Illustratively, a first image frame corresponds to a second image frame, and a second image frame may be processed to obtain a third image frame, so that a third image frame may correspond to a first image frame. After obtaining the N third image frames, the image processing apparatus may adjust the order of the N third image frames based on the timing of the first image frame or the second image frame corresponding to each third image frame, and synthesize the second video.

Illustratively, the definition of the image frame processed by the image processing device is higher than that of the image frame in the first video before processing, i.e. the resolution of the first image frame is higher than that of the first image frame. The image frames before processing correspond to the image frames after processing one by one.

It can be understood that, in order to ensure the continuity of video playing, the image processing device may perform super-resolution processing on each image frame in sequence according to the time sequence of each image frame in the first video, to obtain N second image frames having the same time sequence as that of the N first image frames, so that the image processing device can ensure the continuity of video content when playing the video.

Illustratively, the definition of the image frame processed by the image processing device is higher than that of the image frame in the first video before processing, i.e. the resolution of the third image frame is higher than that of the first image frame. The image frames before processing correspond to the image frames after processing one by one.

Illustratively, after obtaining N third image frames with the same time sequence as the N first image frames, the image processing apparatus synthesizes the N third image frames into a video, and outputs the video to a screen of the electronic device for a user to watch.

Therefore, when a user plays a video by using the electronic equipment, N second image frames are obtained by extracting M adjacent first image frames from the N first image frames as one second image frame; and then, respectively carrying out channel conversion on each second image frame in the N second image frames, extracting the depth characteristics of the converted image frames, carrying out super-resolution processing on the image frames of the original video to obtain image frames with higher definition, and finally synthesizing the image frames with higher resolution into a high-resolution video according to the time sequence of the original low-resolution image frames. The electronic equipment with lower image processing capacity can utilize the target model to perform super-resolution processing on the video with lower definition, and play the processed video in real time, so that the video super-resolution processing capacity of the electronic equipment with lower image processing capacity is greatly improved.

Optionally, in this embodiment of the application, the step 202 and the step 203 may process the second image frame based on the target model, so as to obtain a third image frame.

Illustratively, the third image frame is obtained by the second image frame based on an object model.

Illustratively, the

above steps

202 and 203 may include the following steps 202b1 and 202b 2:

step 202b1, the image processing device performs channel conversion on the target second image frame through the target model, and extracts the depth feature of the converted image frame.

Step 202b2, the image processing device generates a third image frame according to the depth feature of the transformed image frame by the target model.

For example, in order to reduce the computation amount of the model when processing the super-resolution video and improve the generalization capability of the model, the target model provided in the embodiment of the present application has the characteristics of lower model complexity, light training strategy of the model, higher generalization capability of the model, and the like, compared with the video super-resolution model in the related art.

Illustratively, the target model is a model obtained after training of a target convolutional neural network. The target convolutional neural network comprises an input layer, a feature extraction layer and an output layer.

Illustratively, the input layer includes 8 convolution kernels for converting an input image frame from a three-channel image to an eight-channel image. The feature extraction layer comprises 16 residual blocks and is used for extracting the depth features of the image frames converted by the input layer. The output layer is used to generate a high resolution image.

Illustratively, the image processing apparatus, after extracting N low-resolution second image frames of the first video, takes M first image frames in each of the second image frames as one input of the target model. The target model converts the M first image frames into an eight-channel feature map, extracts depth features of the feature map, and restores the feature map into a high-resolution three-channel image, namely the third image frame, based on the depth features.

The training process for training the convolutional neural network to obtain the target model may be set in an electronic device with high computation capability, for example, a server or a Personal Computer (PC). Then, the trained target model may be installed in an electronic device with low processing capability, and the trained model may be used to perform super-resolution processing on the video, for example, in a mobile terminal. Of course, the training process may also be located in an electronic device with low processing capability, which is not limited in this application.

Optionally, in this embodiment of the application, before processing a video using the target model, the seedling table convolutional neural network model needs to be trained, and a target model meeting the requirements is obtained.

For example, before training the target convolutional neural network model, a training sample set for training the target convolutional neural network model needs to be created. Therefore, before the step 202, the image processing method provided in the embodiment of the present application may further include the following steps 205a1 to 205a 4:

step 205a1, the image processing device extracts M high resolution image frames from a target video of a training sample set containing a plurality of high definition videos.

Wherein the target video is at least one sample in the set of training samples.

For example, the image processing apparatus may select one or more high definition videos from the training sample set of the plurality of high definition videos, split the selected high definition videos, and extract each frame of high definition image frame. And then, selecting M continuous image frames from the extracted high-definition image frames as a group of image frames, and obtaining a plurality of groups of image frames.

Illustratively, embodiments of the present application use videos in the vimeo90K dataset as raw videos and train models in a pytorch environment.

After the target video is acquired, the image processing device decodes the target video by using a video processing tool, splits the target video into a frame sequence, and converts the frame sequence into a data format recognizable by a target neural network model as input data.

Step 205a2, the image processing device performs degradation processing on the M high resolution image frames to obtain M low resolution image frames.

Wherein one high resolution image frame corresponds to one low resolution image frame.

For example, the convolutional neural network model may process the M low-resolution image frames to obtain image frames with a higher resolution, compare the image frames with the M high-resolution image frames, and then iterate the convolutional neural network through an inverse algorithm based on the comparison result. And adjusting and optimizing the parameters of the convolutional neural network model in each iteration process.

In step 205a3, the image processing apparatus sets the M high-resolution image frames and the M low-resolution image frames as a set of fourth image frames, and generates K sets of fourth image frames.

Illustratively, after the image processing apparatus splits the target video into a sequence of frames, M high resolution image frames are obtained. And then, performing degradation processing on the M high-resolution image frames to obtain M low-resolution image frames. Thus, the M low-resolution image frames plus the M high-resolution image frames form a group of fourth image frames, and K groups of fourth image frames are obtained.

Each of the K groups of fourth image frames includes M high-resolution image frames and M low-resolution image frames, and K and M are positive integers.

It can be understood that, since the original video is a high-resolution video, the M high-resolution image frames need to be subjected to image degradation processing in addition to the retention of the M high-resolution image frames, so as to obtain M low-resolution image frames with lower resolution. And pairing the M high-resolution image frames and the M low-resolution image frames to be used as a training sample and adding the training sample into a training sample set.

For example, after acquiring the training sample set generated by the K sets of fourth image frames, the image processing apparatus may train the target convolutional neural network model using the training sample set, and obtain the target model.

Step 205a4, the image processing device trains the target convolutional neural network model by using the K groups of fourth image frames as a training set to obtain the target model.

Illustratively, after obtaining a training set including K sets of fourth image frames based on the methods in the above-mentioned steps 206a1 to 206a3, the image processing apparatus trains a structure-optimized target convolutional neural network model using the training set, and obtains the above-mentioned target model after iterating a preset number of times.

Illustratively, the image processing apparatus may train the target neural network model based on the following steps. Specifically, in the step 205a4, the image processing method provided in the embodiment of the present application may further include the following steps 206a1 to 206a 3:

step 206a1, the image processing apparatus converts K of the target low resolution image frames into a first target image.

Wherein the first target image includes: an eight-channel image, said target low resolution image frame being any one of K sets of fourth image frames.

For example, the fourth image frame may be a three-channel (RGB-channel) image frame, and at this time, the image processing apparatus converts the K low-resolution image frames from the three-channel image frame to an eight-channel image frame.

In step 206a2, the image processing apparatus extracts the depth feature of the first target image.

For example, the image processing apparatus may extract the depth feature of the first target image through the feature extraction layer.

In step 206a3, the image processing apparatus generates a second target image based on the depth feature.

Illustratively, the image processing apparatus generates the second target image through the output layer of the above-described target model. The output layer comprises a convolution layer and a sub-pixel layer, and can convert the feature map extracted by the feature extraction layer into a three-channel high-definition image.

Illustratively, M high resolution image frames corresponding to M low resolution image frames of said target image frames for verifying said second target image; the M low-resolution image frames in the target image frame are: and image frames obtained after image degradation processing is performed on M high-resolution image frames corresponding to M low-resolution image frames in the target image frames.

Illustratively, the M high resolution image frames may be adjacent M image frames in the original video.

For example, for a video in the aforementioned vimeo90K data set, the image processing apparatus may extract every 7 consecutive image frames in the video and combine the 7 image frames into a short video. Then, image degradation processing is performed on the short video to generate a degraded low-resolution video. Further, the low resolution video is split to obtain 7 low resolution image frames. Meanwhile, the image processing device can also directly split the short video to obtain 7 high-resolution image frames.

It is to be noted that image quality is degraded during formation, recording, processing and transfer of images due to imperfections in the imaging system, recording apparatus, transfer medium and processing method, a phenomenon known as image degradation. The image degradation operation described above in the embodiments of the present application is to obtain M low-resolution image frames.

Illustratively, after K groups of fourth image frames are obtained through the above processing, the convolutional neural network model subjected to structure optimization is trained by using the K groups of fourth image frames.

As shown in fig. 2, the structure of the target convolutional neural network model subjected to the structure optimization includes: the low resolution image input from the image processing apparatus is first converted into an eight-channel image by a layer of convolution 31 from the 7 three-channel image 30. Then, the eight-channel image passes through 16 residual blocks 32 (residual blocks 1 to 16) step by step, each residual block is internally formed by stacking a layer of 3x3 convolution, a layer of Relu activation layer and a layer of 3x3 convolution, the number of channels is kept unchanged by the convolution, and the residual blocks add input values to the tail output so as to extract the depth features of the input image. Finally, a three-channel high-resolution image 34 is output by a layer of convolution 33 and a layer of sub-pixel convolution 34.

It should be noted that the image input into the target convolutional neural network model is a three-channel image, the convolution 31 includes 8 convolution kernels, and the three-channel image is converted into an eight-channel image after passing through the 8 convolution kernels.

For example, in the training process of the model, the model may be further optimized by using an optimizer, and before the step 202, the image processing method provided in the embodiment of the present application may further include a step 204 a:

and 204a, the image processing device iterates the target convolutional neural network, updates the model parameters of the target convolutional neural network, and obtains the target model after iteration for preset times.

Illustratively, the image processing device may iterate the target convolutional neural network using a back propagation algorithm.

Illustratively, the training process is optimized by using an adam optimizer, parameters of the optimizer adopt default values, training data are input successively and are propagated reversely, and model parameters are updated. And after 200 ten thousand iterations, stopping training, and storing model parameters to obtain the target model.

For example, for the sake of easy understanding, how the above-described target model reduces the amount of computation when processing an image will now be described. Taking the generation of a 4-fold super-resolution image as an example, the input image frame 1 is 3 × H × W, where 3 is the number of channels of the image frame, and H and W are the height (high) and width (width) of the image frame, respectively.

For the input layer: after the image frame 1 is processed by the input layer, the image frames of 21 × H × W are spliced from the images of 7 frames 3 × H × W. In the related art, the input layer generally includes 64 convolution kernels, and the amount of calculation required is large.

It should be noted that, in the related art, after an image passes through an input layer, a plurality of frames of images need to be subjected to pixel level alignment and then spliced, and the method in the present application, after 8 convolution kernels are performed, can directly splice 7 frames of image frames into one image frame without requiring pixel level alignment (the splicing process also needs to be simply aligned based on the vertex or the diagonal of the image frame).

For the feature extraction layer: the 21 × H × W image frame output by the input layer is firstly subjected to the first layer of convolution (including 8 convolution kernels of 21 × H × W) and the second layer of convolution (including 8 convolution kernels of 8 × H × W), and then 16 residual blocks (each including 2 convolution kernels) are repeated in the middle, so that an 8 × H × W image frame is finally obtained.

It should be noted that, in the related art, since the image processing model is generally provided on an electronic device with a high processing capability, the depth of the feature extraction layer is generally shallow. The depth of the feature extraction layer is deep, and feature extraction can be performed on the image under the condition of low computing power.

For the output layer: the 8 × H × W image frames obtained by the feature extraction layer are subjected to a convolution operation (including a convolution kernel of 48 × H × W), and then subjected to a PixelShuffle operation to obtain image frames of 3 × 4H × 4W, so as to realize 4-time super-resolution.

It should be noted that, in the related art, the output layer needs to perform two-step progressive amplification operation (needs to perform multiple convolution kernel processing) to amplify the feature map extracted by the feature extraction layer, which results in a large amount of computation. In the application, only one convolution operation is needed, the number of the channels is fixed to 48, and the calculation process is simplified.

It should be noted that, in the related art, because the image processing model has a higher computational complexity and cannot be generally applied to an electronic device with a lower processing capability, the embodiment of the present application reduces the complexity of the model by simplifying the model structure, and reduces the amount of computation required by the model when processing an image on the premise of ensuring the processing efficiency, so that the model can be applied to an electronic device with a lower processing capability.

Therefore, after the structure of the convolutional neural network is optimized and the training strategy is optimized, the calculated amount of the model can be reduced, the generalization performance of the model is improved, and the electronic equipment with low image processing capacity can have the capacity of performing real-time super-resolution processing on the video after using the model.

Optionally, in this embodiment of the application, after the target convolutional neural network model is trained and a target model is obtained, the electronic device with lower image processing capability may perform super-resolution processing on the video in real time based on the target model.

Illustratively, after the image processing apparatus splits the first video based on the time sequence of each image frame of the first video, N first image frames are obtained. Then, the image processing device combines each of the N first image frames with a plurality of adjacent first image frames to obtain a second image frame including a plurality of first image frames.

For example, after splitting the first video according to the time sequence of each image frame, the image processing apparatus obtains N first image frames, and combines each image frame and the first three frames and the last three frames of the image frame into a group of combined frames including 7 image frames as one input data of the target model. For the head and tail portions of the first video, a mirror filling method may be used to obtain 7 image frames. For example, for an image frame of timing 0 of the first video, the image frame of timing 1 to 3 may be padded before the image frame of timing 0. And (3) processing every 7 image frames by using a target model to obtain 1 processed high-resolution image frame.

It should be noted that, in the embodiment of the present application, early fusion, narrow and deep network design, fast upsampling and other modes are adopted, so that the calculation amount of the existing model is greatly reduced, the model structure is optimized, and the depth model can achieve the effect of performing real-time super-resolution on a 360p video at the mobile end. Moreover, through the degradation mode simulation of the videos of real scenes such as human faces, landscapes, foods, characters and the like, the generalization performance of the model is improved, and the image quality of the videos after super-resolution is effectively enhanced.

It should be noted that, the specific implementation process of the above step 202 and step 203 may refer to the description of the training process of the target convolutional neural network.

The image processing method provided by the embodiment of the application provides a depth model, the model adopts the modes of early fusion, narrow and deep network design, rapid upsampling and the like to greatly reduce the calculated amount of the existing model, and the depth model can achieve the effect of performing real-time super-resolution on a 360p video on electronic equipment with lower image processing capacity by optimizing the model structure, so that the definition of playing the video is greatly improved.

It should be noted that, in the image processing method provided in the embodiment of the present application, the execution subject may be an image processing apparatus, or a control module in the image processing apparatus for executing the image processing method. The image processing apparatus provided in the embodiment of the present application is described with an example in which an image processing apparatus executes an image processing method.

In the embodiments of the present application, the above-described methods are illustrated in the drawings. The image processing method is exemplarily described with reference to one of the drawings in the embodiments of the present application. In specific implementation, the image processing methods shown in the above method drawings may also be implemented by combining with any other drawings that may be combined, which are illustrated in the above embodiments, and are not described herein again.

Fig. 3 is a schematic diagram of a possible structure of an image processing apparatus for implementing the embodiment of the present application, and as shown in fig. 3, the image processing apparatus 600 includes: an extraction module 601, a processing module 602, a generation module 603 and a synthesis module 604; an extracting module 601, configured to extract M adjacent first image frames of the N first image frames as a second image frame, to obtain N second image frames; n and M are positive integers, and M is less than N; the processing module 602 is configured to perform channel conversion on the target second image frame extracted by the extraction module 601, and extract depth features of the converted image frame, where the target second image frame is any one of N second image frames; a generating module 603, configured to generate third image frames based on the depth features of the image frames converted by the processing module 602, where one second image frame corresponds to one third image frame; a synthesizing module 604, configured to synthesize the N third image frames generated by the generating module 603 into a second video according to the time sequence of the N first image frames extracted by the extracting module 601; wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

Optionally, the processing module 602 is specifically configured to convert the target second image frame extracted by the extraction module 601 into an eight-channel image frame.

Optionally, the processing module 602 is specifically configured to perform channel conversion on the target second image frame through the target model, and extract depth features of the converted image frame; the generating module 603 is specifically configured to generate a third image frame according to the depth feature of the image frame converted by the processing module 602 through the target model.

Optionally, the apparatus 600 further comprises: a training module 605; the extracting module 601 is further configured to extract M high-resolution image frames from a target video of a training sample set including a plurality of high-definition videos, where the target video is at least one sample in the training sample set; the processing module 602 is further configured to perform degradation processing on the M high-resolution image frames extracted by the extraction module 601 to obtain M low-resolution image frames, where one high-resolution image frame corresponds to one low-resolution image frame; the processing module 602 is further configured to use the M high-resolution image frames and the M low-resolution image frames as a group of fourth image frames, and generate K groups of fourth image frames; a training module 605, configured to train a target convolutional neural network model by using the K groups of fourth image frames generated by the processing module as a training set, so as to obtain the target model; each image frame of the K groups of fourth image frames comprises M high-resolution image frames and M low-resolution image frames, and K and M are positive integers.

Optionally, the training module 605 is specifically configured to convert M low-resolution image frames in the target low-resolution image frames from three-channel images into a first target image, where the first target image includes: an eight-channel image; a training module 605, specifically configured to extract a depth feature of the first target image; a training module 605, specifically configured to generate a second target image based on the depth feature; wherein the target low resolution image frame is any one of K sets of fourth image frames; m high resolution image frames corresponding to M low resolution image frames of the target image frames for verifying a second target image; the M low resolution image frames of the target image frame are: and the image frames are obtained after the image degradation processing is carried out on the M high-resolution image frames corresponding to the M low-resolution image frames in the target image frame.

Optionally, the training module 605 is further configured to iterate the target convolutional neural network, update the model parameters of the target convolutional neural network, and obtain the target model after iterating for a preset number of times.

The image processing apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The image processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The image processing apparatus provided in the embodiment of the present application can implement each process implemented by the image processing apparatus in the method embodiments of fig. 1 to fig. 2, and is not described herein again to avoid repetition.

The beneficial effects of the various implementation manners in this embodiment may specifically refer to the beneficial effects of the corresponding implementation manners in the above method embodiments, and are not described herein again to avoid repetition.

The image processing device provided by the embodiment of the application provides a depth model, the model adopts the modes of early fusion, narrow and deep network design, quick upsampling and the like to greatly reduce the calculated amount of the existing model, and the depth model can achieve the effect of performing real-time super-resolution on a 360p video on an electronic device with lower image processing capacity by optimizing the model structure, so that the definition of playing the video is greatly improved.

Optionally, as shown in fig. 4, an electronic device M00 is further provided in an embodiment of the present application, and includes a processor M01, a memory M02, and a program or an instruction stored in the memory M02 and executable on the processor M01, where the program or the instruction when executed by the processor M01 implements the processes of the foregoing embodiment of the image processing method, and can achieve the same technical effects, and details are not repeated here to avoid repetition.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present application.

The electronic device 100 includes, but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 5 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The input unit 104 is configured to extract M adjacent first image frames from the N first image frames as a second image frame, so as to obtain N second image frames; n and M are positive integers, and M is less than N; the processor 110 is configured to perform channel conversion on a target second image frame, and extract a depth feature of the converted image frame, where the target second image frame is any one of N second image frames; a processor 110, configured to generate third image frames based on depth features of the converted image frames, where one second image frame corresponds to one third image frame; a processor 110, configured to synthesize N third image frames into a second video according to a time sequence of the N first image frames; wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

Optionally, the processor 110 is specifically configured to perform channel conversion on the target second image frame through the target model, and extract depth features of the converted image frame; the processor 110 is specifically configured to generate a third image frame according to the depth feature of the converted image frame through the target model.

Optionally, the input unit 104 is further configured to extract M high-resolution image frames from a target video of a training sample set including a plurality of high-definition videos, where the target video is at least one sample in the training sample set; the processor 110 is further configured to perform degradation processing on the extracted M high-resolution image frames to obtain M low-resolution image frames, where one high-resolution image frame corresponds to one low-resolution image frame; the processor 110 is further configured to generate K sets of fourth image frames by using the M high-resolution image frames and the M low-resolution image frames as a set of fourth image frames; the processor 110 is configured to train a target convolutional neural network model by using the K groups of fourth image frames as a training set to obtain the target model; each image frame of the K groups of fourth image frames comprises M high-resolution image frames and M low-resolution image frames, and K and M are positive integers.

Optionally, the processor 110 is configured to convert M low resolution image frames of the target low resolution image frames from a three-channel image into a first target image, where the first target image includes: an eight-channel image; the processor 110 is further configured to extract a depth feature of the first target image; a processor 110, further configured to generate a second target image based on the depth feature; wherein the target low resolution image frame is any one of K sets of fourth image frames; m high resolution image frames corresponding to M low resolution image frames of the target image frames for verifying a second target image; the M low resolution image frames of the target image frame are: and the image frames are obtained after the image degradation processing is carried out on the M high-resolution image frames corresponding to the M low-resolution image frames in the target image frame.

Optionally, the processor 110 is further configured to iterate the target convolutional neural network, update a model parameter of the target convolutional neural network, and obtain the target model after iterating for a preset number of times.

Therefore, the electronic equipment part with lower image processing capacity can convert the low-resolution video into the high-resolution video in real time and output the high-resolution video by adopting the trained model.

The electronic equipment provided by the embodiment of the application provides a depth model, the model adopts the modes of early fusion, narrow and deep network design, quick upsampling and the like to greatly reduce the calculated amount of the existing model, and the depth model can achieve the effect of performing real-time super-resolution on a 360p video on the electronic equipment with lower image processing capacity by optimizing the model structure, so that the definition of playing the video is greatly improved.

It should be understood that, in the embodiment of the present application, the input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 110 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the image processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the embodiment of the image processing method, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling an electronic device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

extracting M adjacent first image frames in the N first image frames to serve as a second image frame to obtain N second image frames; n and M are positive integers, and M is less than N;

performing channel conversion on a target second image frame, and extracting the depth feature of the converted image frame, wherein the target second image frame is any one of the N second image frames;

generating third image frames based on the depth features of the converted image frames, wherein one second image frame corresponds to one third image frame;

synthesizing the N third image frames into a second video according to the time sequence of the N first image frames;

wherein each first image frame comprises a corresponding third image frame; the third image frame has a higher resolution than the first image frame.

2. The method of claim 1, wherein the channel converting the target second image frame comprises:

converting the target second image frame into an eight-channel image frame.

3. The method according to claim 1 or 2, wherein the channel converting the target second image frame and extracting the depth feature of the converted image frame comprises:

performing channel conversion on a target second image frame through a target model, and extracting the depth characteristics of the converted image frame;

generating, by the computing device, a third image frame based on the depth features of the transformed image frame, including:

and generating a third image frame according to the depth characteristics of the converted image frame through the target model.

4. The method of claim 3,

before the channel converting the target second image frame and extracting the depth feature of the converted image frame, the method further includes:

extracting M high-resolution image frames from a target video of a training sample set containing a plurality of high-definition videos; the target video is at least one sample in the training sample set;

performing degradation processing on the M high-resolution image frames to obtain M low-resolution image frames, wherein one high-resolution image frame corresponds to one low-resolution image frame;

generating K groups of fourth image frames by using the M high-resolution image frames and the M low-resolution image frames as a group of fourth image frames;

training a target convolutional neural network model by using the K groups of fourth image frames as a training set to obtain the target model;

each of the K sets of fourth image frames includes M high resolution image frames and M low resolution image frames, and K and M are positive integers.

5. The method of claim 4, wherein training a target convolutional neural network model using the K sets of fourth image frames as a training set to obtain the target model comprises:

converting M of the target low resolution image frames into a first target image, the first target image comprising: an eight-channel image;

extracting depth features of the first target image;

generating a second target image based on the depth feature;

wherein the target low resolution image frame is any one of the K sets of fourth image frames; m high resolution image frames corresponding to M low resolution image frames of the target image frames for verifying the second target image; the M low resolution image frames in the target image frame are: and the image frames are obtained after image degradation processing is carried out on M high-resolution image frames corresponding to M low-resolution image frames in the target image frames.

6. An image processing apparatus, characterized in that the apparatus comprises: the device comprises an extraction module, a processing module, a generation module and a synthesis module;

the extraction module is used for extracting M adjacent first image frames in the N first image frames as a second image frame to obtain N second image frames; n and M are positive integers, and M is less than N;

the processing module is configured to perform channel conversion on the target second image frame extracted by the extraction module, and extract depth features of the converted image frame, where the target second image frame is any one of the N second image frames;

the generating module is used for generating third image frames based on the depth features of the image frames converted by the processing module, and one second image frame corresponds to one third image frame;

the synthesis module is used for synthesizing the N third image frames generated by the generation module into a second video according to the time sequence of the N first image frames extracted by the extraction module;

7. The apparatus of claim 6,

the processing module is specifically configured to convert the target second image frame extracted by the extraction module into an eight-channel image frame.

8. The apparatus according to claim 6 or 7,

the processing module is specifically used for performing channel conversion on the target second image frame through the target model and extracting the depth features of the converted image frame;

the generating module is specifically configured to generate, by using the target model, a third image frame according to the depth feature of the image frame converted by the processing module.

9. The apparatus of claim 8, further comprising: a training module;

the extraction module is further configured to extract M high-resolution image frames from a target video of a training sample set including a plurality of high-definition videos, where the target video is at least one sample in the training sample set;

the processing module is further configured to perform degradation processing on the M high-resolution image frames extracted by the extraction module to obtain M low-resolution image frames, where one high-resolution image frame corresponds to one low-resolution image frame;

the processing module is further configured to use the M high-resolution image frames and the M low-resolution image frames as a group of fourth image frames, and generate K groups of fourth image frames;

the training module is used for training a target convolutional neural network model by taking the K groups of fourth image frames generated by the processing module as a training set to obtain the target model;

each of the K sets of fourth image frames includes M high-resolution image frames and M low-resolution image frames, and K and M are positive integers.

10. The apparatus of claim 9,

the training module is specifically configured to convert M low-resolution image frames in the target low-resolution image frames from three-channel images into a first target image, where the first target image includes: an eight-channel image;

the training module is specifically used for extracting the depth feature of the first target image;

the training module is specifically configured to generate a second target image based on the depth feature;

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the image processing method according to any one of claims 1 to 5.

12. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the image processing method according to any one of claims 1 to 5.