CN114638772A

CN114638772A - Video processing method, device and equipment

Info

Publication number: CN114638772A
Application number: CN202210273291.6A
Authority: CN
Inventors: 磯部駿; 陶鑫; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-06-17

Abstract

The application discloses a video processing method, a video processing device and video processing equipment, which are applied to the technical field of images. The technical scheme provided by the embodiment of the application can decouple different types of motion areas in the front and back adjacent video frames of the video frame based on the motion condition in the video frame when needing to carry out super-resolution processing on the video, and respectively based on different types of motion areas to perform corresponding video feature fusion and other processing, omitting the steps of motion estimation and compensation based on pixel-by-pixel operation, simplifying the space complexity of the network, saving the video processing time, and respectively performing feature extraction on different types of motion areas, the extracted features are more accurate, the same processing on all regions can be effectively avoided, the condition that the motion estimation of a fast motion region or a shielding region is inaccurate can not occur, the problem of artifacts in the video is further avoided, and the definition of the video is greatly improved.

Description

Video processing method, device and equipment

Technical Field

The present application relates to the field of image technologies, and in particular, to a video processing method, apparatus and device.

Background

In a video processing task, a video super-resolution technology is very important, and the purpose is to convert a low-resolution video into a high-resolution video. In the video processing process, information missing during resolution conversion needs to be filled, and the current mainstream method is to perform motion estimation and compensation on a video by using a convolutional neural network so as to realize video super-resolution.

In a video processing method in the related art, motion estimation and compensation are performed on a video by using an optical flow between front and rear adjacent video frames. For a certain video frame, firstly, image splicing is carried out on a certain adjacent video frame and the video frame in a color channel, the spliced image is up-sampled twice to obtain images with low resolution, medium resolution and high resolution, the images with three resolutions are input into a convolution neural network, an optical flow graph between the adjacent video frame and the video frame is extracted through the convolution neural network, then the adjacent video frame is aligned to the video frame based on the optical flow graph, the operation is carried out on the previous adjacent video frame and the next adjacent video frame of the video frame, and finally the aligned previous adjacent video frame and the aligned next adjacent video frame are fused with the video frame to obtain the video frame with improved resolution. The above operations are performed on a plurality of video frames to obtain processed video.

However, with the above video processing method, the extraction of the light flow graph and the alignment based on the light flow graph are performed pixel by pixel, resulting in a time-consuming whole process; since the optical flow estimation is inaccurate when the adjacent frames have occlusion or large motion, warping pixels based on the inaccurate optical flow can generate artifacts, resulting in unclear pictures of the video.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device and video processing equipment, which can save video processing time, avoid the problem of artifacts in videos and greatly improve the definition of the videos, and the technical scheme is as follows:

in one aspect, a video processing method is provided, and the method includes:

based on a video frame of a video and a previous video frame of the video frame, a first image and a second image of the previous video frame are determined, the first image comprises a region with a motion condition meeting a target condition, and the second image comprises a region with a motion condition not meeting the target condition.

Determining a first image and a second image of a next video frame based on the video frame of the video and the next video frame of the video frame;

acquiring a fusion feature map of the video frame based on a first feature map and a second feature map of the video frame, wherein the first feature map is determined based on the video frame, a first image of the previous video frame and a first image of the next video frame, and the second feature map is determined based on the video frame, a second image of the previous video frame and a second image of the next video frame;

and generating a target video of the video based on the fusion feature map of the plurality of video frames of the video and the plurality of video frames.

In some embodiments, the determining the first image and the second image of the previous video frame based on the video frame of the video and the previous video frame of the video frame comprises: acquiring a first residual image of the video frame, wherein the first residual image is a residual between the video frame and a previous video frame of the video frame; determining a pixel value threshold corresponding to the previous video frame based on the first residual image, wherein the pixel value threshold is the pixel value mean value of pixel points in the first residual image; and generating a first image and a second image of the previous video frame based on the pixel value threshold value corresponding to the previous video frame and the first residual map.

In some embodiments, the generating the first image and the second image of the previous video frame based on the pixel value threshold corresponding to the previous video frame and the first residual map comprises: determining, based on the pixel value threshold, a region in the first residual map having a pixel value greater than or equal to the pixel value threshold; generating a mask of the previous video frame based on a region corresponding to the determined region in the previous video frame; generating a first image and a second image of the previous video frame based on the mask of the previous video frame and the first image of the previous video frame.

In some embodiments, the generating the first image and the second image of the previous video frame based on the mask of the previous video frame and the first image of the previous video frame comprises: multiplying the mask of the previous video frame with the previous video frame to obtain a first image of the previous video frame; and subtracting the previous video frame from the first image of the previous video frame to obtain a second image of the previous video frame.

In some embodiments, the step of obtaining a fused feature map of the video frame based on the first feature map and the second feature map of the video frame includes: acquiring a first feature map and a second feature map of the video frame; and fusing the first feature map and the second feature map to obtain a fused feature map of the video frame.

In some embodiments, the obtaining of the first feature map of the video frame includes: splicing the video frame, the first image of the previous video frame and the first image of the next video frame to obtain a first spliced video frame of the video frame; inputting a first spliced video frame of the video frame into a first convolutional neural network, and performing feature extraction on the first spliced video frame of the video frame through the first convolutional neural network to obtain a first feature map, wherein the first convolutional neural network is obtained by adopting training of a first sample video frame, and the first sample video frame comprises an area with a motion condition meeting the target condition.

In some embodiments, the obtaining of the second feature map of the video frame includes: splicing the video frame, the second image of the previous video frame and the second image of the next video frame to obtain a second spliced video frame of the video frame; inputting a second spliced video frame of the video frame into a second convolutional neural network, and performing feature extraction on the second spliced video frame of the video frame through the second convolutional neural network to obtain a second feature map, wherein the second convolutional neural network is obtained by adopting a second sample video frame for training, and the second sample video frame comprises an area with a motion condition which does not accord with the target condition.

In some embodiments, the generating the target video of the video based on the fused feature map of the plurality of video frames of the video and the plurality of video frames comprises: for each video frame in the video, performing up-sampling on the fusion characteristic graph of the video frame to obtain a target characteristic graph; performing upsampling on the video frame to obtain an upsampled video frame corresponding to the video frame; fusing the target feature map and the up-sampling video frame corresponding to the video frame to obtain a target video frame of the video frame; and coding the target video frames corresponding to the plurality of video frames to generate a target video.

In some embodiments, the video processing method further comprises: and carrying out scene switching detection based on the acquired video frame, and if the video frame is detected not to have scene switching, executing the processing of the video frame.

In one aspect, a video processing apparatus is provided, the apparatus including:

a determining unit configured to perform, based on a video frame of a video and a previous video frame of the video frame, determining a first image and a second image of the previous video frame, the first image including a region whose motion condition meets a target condition, the second image including a region whose motion condition does not meet the target condition;

the determining unit is further configured to perform determining a first image and a second image of a subsequent video frame based on a video frame of the video and the subsequent video frame of the video frame;

a fusion unit configured to perform obtaining a fusion feature map of the video frame based on a first feature map and a second feature map of the video frame, the first feature map being determined based on the video frame, a first image of the previous video frame and a first image of the next video frame, the second feature map being determined based on the video frame, a second image of the previous video frame and a second image of the next video frame;

a generating unit configured to perform generating a target video of the video based on the fused feature map of a plurality of video frames of the video and the plurality of video frames.

In some embodiments, the determining unit comprises: a first obtaining subunit configured to perform obtaining a first residual map of the video frame, the first residual map being a residual between the video frame and a video frame previous to the video frame; a determining subunit, configured to determine, based on the first residual image, a pixel value threshold corresponding to the previous video frame, where the pixel value threshold is a pixel value mean of a pixel point in the first residual image; and the generating subunit is configured to generate a first image and a second image of the previous video frame based on the pixel value threshold corresponding to the previous video frame and the first residual map.

In some embodiments, the generating subunit comprises: a first determining subunit configured to perform, based on the pixel value threshold, determining a region in the first residual map having a pixel value greater than or equal to the pixel value threshold; a first generation subunit configured to perform generating a mask of the previous video frame based on a region corresponding to the determined region in the previous video frame; a second generating subunit configured to perform generating a first image and a second image of the previous video frame based on the mask of the previous video frame and the first image of the previous video frame.

In some embodiments, the second generating subunit is configured to perform multiplying the mask of the previous video frame by the previous video frame to obtain a first image of the previous video frame; and subtracting the previous video frame from the first image of the previous video frame to obtain a second image of the previous video frame.

In some embodiments, the fusion unit comprises: a second acquiring subunit configured to perform acquiring the first feature map and the second feature map of the video frame; and the fusion subunit is configured to perform fusion on the first feature map and the second feature map to obtain a fusion feature map of the video frame.

In some embodiments, the second obtaining subunit is configured to perform stitching the video frame, the first image of the previous video frame, and the first image of the subsequent video frame to obtain a first stitched video frame of the video frame; inputting a first spliced video frame of the video frame into a first convolutional neural network, and performing feature extraction on the first spliced video frame of the video frame through the first convolutional neural network to obtain a first feature map, wherein the first convolutional neural network is obtained by adopting a first sample video frame training, and the first sample video frame comprises an area with a motion condition meeting the target condition.

In some embodiments, the second obtaining subunit is further configured to perform stitching the video frame, the second image of the previous video frame, and the second image of the subsequent video frame to obtain a second stitched video frame of the video frame; inputting a second spliced video frame of the video frame into a second convolutional neural network, and performing feature extraction on the second spliced video frame of the video frame through the second convolutional neural network to obtain a second feature map, wherein the second convolutional neural network is obtained by adopting a second sample video frame for training, and the second sample video frame comprises an area with a motion condition which does not accord with the target condition.

In some embodiments, the generating unit is configured to perform, for each video frame in the video, upsampling the fusion feature map of the video frame to obtain a target feature map; performing upsampling on the video frame to obtain an upsampled video frame corresponding to the video frame; fusing the target feature map and the up-sampling video frame corresponding to the video frame to obtain a target video frame of the video frame; and coding the target video frames corresponding to the plurality of video frames to generate a target video.

In some embodiments, the video processing apparatus performs processing of the video frame by the determination unit, the fusion unit, and the generation unit when scene change detection is performed based on the acquired video frame and it is detected that scene change does not occur in the video frame.

In one aspect, a computer device is provided, the computer device comprising:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the video processing method described above.

In one aspect, a computer-readable storage medium is provided, in which program code is enabled, when executed by a processor of a server, to cause the server to perform the above-described video processing method.

In an aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described video processing method.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a video processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a video processing method according to an embodiment of the present application;

fig. 3 is a flowchart of a video processing method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a first convolutional neural network provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an upsampling module provided in an embodiment of the present application;

fig. 6 is a diagram for verifying the effectiveness of dividing a motion region according to an embodiment of the present application;

fig. 7 is a comparison diagram of video processing effects provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first image can be referred to as a second image, and similarly, the second image can be referred to as a first image without departing from the scope of various such examples. The first image and the second image can both be images, and in some cases, can be separate and distinct images.

The term "at least one" is used herein to mean one or more, and the term "plurality" is used herein to mean two or more, e.g., a plurality of packets means two or more packets.

It is to be understood that the terminology used in the description of the various examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various examples and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an associative relationship that describes an associated object, meaning that three relationships can exist, e.g., a and/or B, can mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.

It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

It will be further understood that the terms "Comprises," "Comprising," "inCludes" and/or "inCluding," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, video and the like referred to in this application are acquired with sufficient authorization.

Fig. 1 is a schematic diagram of an implementation environment of a video processing method according to an embodiment of the present application, and referring to fig. 1, the implementation environment may include a terminal 101 and a server 102.

The terminal 101 is connected to the server 102 through a wireless network or a wired network. Optionally, the terminal 101 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, and the like, but is not limited thereto. The terminal 101 is installed and operated with an application program supporting video processing, for example, a plug-in type application, an applet, or other type application.

The server 102 is an independent physical server, or a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 102 provides background services for applications running on the terminal 101. Optionally, the server 102 undertakes primary processing and the terminal 101 undertakes secondary processing; or, the server 102 undertakes the secondary processing work, and the terminal 101 undertakes the primary processing work; alternatively, the server 102 or the terminal 101 may be able to separately undertake processing; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.

Those skilled in the art will appreciate that the number of the above-described terminals 101 and servers 102 can be greater or less. For example, there can be only one terminal 101 or one server 102, or several tens or hundreds of terminals 101 or servers 102, or more, and the number of terminals or servers and the type of the device are not limited in the embodiment of the present application.

After the description of the implementation environment of the embodiment of the present application, an application scenario of the embodiment of the present application will be described below with reference to the implementation environment, in the following description, a terminal is also a terminal 101 in the implementation environment, and a server is also a server 102 in the implementation environment.

The technical scheme provided by the embodiment of the application can be applied to a scene of performing super-resolution processing on a video, for example, the super-resolution processing can be performed offline or online through a server.

In the scene of performing super-resolution processing on the video offline, the application program can apply the video processing method to perform super-resolution processing on the uploaded video to obtain the super-resolution video, so that the aim of improving the video resolution is fulfilled.

The method comprises the steps that in a scene that the server carries out super-resolution processing on videos on line, the uploaded videos are sent to the server, the server applies the video processing method to carry out super-resolution processing on the uploaded videos to obtain super-resolution videos, and the super-resolution videos are returned to the terminal, so that the purpose of improving the video resolution is achieved.

Of course, the technical solution provided in the embodiment of the present application can be applied to the above scenarios, and can also be applied to other scenarios in which super-resolution processing is performed on a video, which is not limited in the embodiment of the present application.

The technical solutions provided by the present disclosure are described next. Fig. 2 is a flowchart of a video processing method provided in an embodiment of the present application, and as shown in fig. 2, taking the method as an example for being used in the server 102, the method includes the following steps:

in step 201, based on a video frame of a video and a previous video frame of the video frame, a first image and a second image of the previous video frame are determined, the first image includes a region whose motion condition meets a target condition, and the second image includes a region whose motion condition does not meet the target condition.

In step 202, based on a video frame of a video and a subsequent video frame of the video frame, a first image and a second image of the subsequent video frame are determined.

In step 203, a fused feature map of the video frame is obtained based on a first feature map and a second feature map of the video frame, the first feature map is determined based on the video frame, a first image of the previous video frame and a first image of the next video frame, and the second feature map is determined based on the video frame, a second image of the previous video frame and a second image of the next video frame.

In step 204, a target video of the video is generated based on the fused feature map of the video frames of the video and the video frame.

According to the technical scheme, when the video needs to be subjected to super-resolution processing, different types of motion areas in front and back adjacent video frames of the video frames can be decoupled based on motion conditions in the video frames, and corresponding video features are fused and the like based on the different types of motion areas respectively, so that a final super-resolution result of the video frame is obtained. The video processing time is saved, and the extracted features are more accurate due to the fact that the features of the motion regions of different types are extracted respectively, the problem of artifacts in the video due to the fact that the same processing is carried out on all the regions can be effectively avoided, and the definition of the video is greatly improved.

After the implementation environment and the application scenario of the embodiment of the present application are introduced, an execution subject is taken as an example, and a technical solution provided by the embodiment of the present application is introduced below. Fig. 3 is a flowchart of a video processing method provided in an embodiment of the present application, and referring to fig. 3, the video processing method includes:

in step 301, a video is decoded to obtain a plurality of video frames.

In the embodiment of the present application, the video is a video to be subjected to super-resolution processing. In some embodiments, the video may be any type of video, and the embodiments of the present application do not limit this.

It is understood that the above step 301 is one implementation manner of acquiring a plurality of video frames of a video, and this process is not limited by the embodiment of the present application.

In some embodiments, scene cut detection is performed based on the acquired video frame, and if it is detected that the scene cut of the video frame does not occur, step 302 is performed. The video processing method is based on the adjacent video frames before and after the video frame, and the motion region decoupling and feature extraction are carried out on the continuous content in the same scene, so that the method is more suitable for the continuous video with the scene not switched. By carrying out scene detection on the acquired video and carrying out video processing on the video frame which is detected not to have scene switching, the video processing method can be also suitable for the video comprising a plurality of scenes.

In step 302, a first residual map of the ith video frame is obtained, where the first residual map is a residual between the ith video frame and the (i-1) th video frame, and i is a positive integer greater than 1.

Wherein, each pixel point in the first residual image is the difference between the pixel values of the corresponding position of the ith video frame and the corresponding position of the (i-1) th video frame, and the calculation formula is shown in the following formula (1).

Ri-1＝|Ii-Ii-₁|(1)

Wherein, I_iIs the pixel value, I, of a pixel point in the ith video frame_i-1For pixel values of corresponding positions in the i-1 th video frame, R_i-1And i is the difference of pixel values between pixel points at corresponding positions in the ith video frame and the (i-1) th video frame, and is a positive integer larger than 1.

In step 303, a pixel value threshold corresponding to the i-1 th video frame is determined based on the first residual map.

In some embodiments, the pixel value threshold is a mean of pixel values of pixels in the first residual map.

In step 304, based on the pixel value threshold corresponding to the i-1 th video frame and the first residual map, a first image and a second image of the i-1 th video frame are generated, wherein the first image includes a region of the i-1 th video frame whose motion condition meets the target condition, and the second image includes a region of the i-1 th video frame whose motion condition does not meet the target condition.

Wherein, the motion condition refers to the change condition of the pixel point.

In some embodiments, based on the pixel value threshold, it is determined that a region in the first residual image where the pixel value is greater than or equal to the pixel value threshold is a region where the motion condition meets the target condition, and a region in the first residual image where the pixel value is less than the pixel value threshold is a region where the motion condition does not meet the target condition.

Taking the example of determining a region with a pixel value greater than or equal to the pixel value threshold in the first residual map based on the pixel value threshold, determining a region with a pixel value greater than or equal to the pixel value threshold in the first residual map, generating a mask of the i-1 th video frame based on a region corresponding to the determined region in the i-1 th video frame, and generating a first image and a second image of the i-1 th video frame based on the mask of the i-1 th video frame and the first image of the i-1 th video frame.

In some embodiments, the mask of the (i-1) th video frame is multiplied by the (i-1) th video frame to obtain a first image of the (i-1) th video frame, and the first image of the (i-1) th video frame is subtracted from the (i-1) th video frame to obtain a second image of the (i-1) th video frame. Through the mask-based processing, the processing efficiency can be greatly improved.

In step 305, a second residual map of the ith video frame is obtained, where the second residual map is a residual between the video frame and the (i + 1) th video frame, and i is a positive integer greater than 1.

Wherein, each pixel point in the second residual image is the difference between the pixel values of the corresponding position of the ith video frame and the (i + 1) th video frame, and the calculation formula is shown in the following formula (2).

Ri+1＝|Ii-Ii+₁|(2)

Wherein, I_iIs the pixel value, I, of a pixel point in the ith video frame_i+1Is the pixel value, R, of the corresponding position in the i +1 th video frame_i+1Is the difference of pixel values between the pixel points at the corresponding positions in the ith video frame and the (i + 1) th video frame, wherein i is a positive integer greater than 1.

In step 306, a pixel value threshold corresponding to the i +1 th video frame is determined based on the second residual map.

The process of determining the pixel value threshold in step 306 is the same as that in step 303, and is not described herein again.

In step 307, based on the pixel value threshold corresponding to the i +1 th video frame and the second residual map, a first image and a second image of the i +1 th video frame are generated, where the first image includes a region of the i +1 th video frame whose motion condition meets the target condition, and the second image includes a region of the i +1 th video frame whose motion condition does not meet the target condition.

The process of generating the first image and the second image in step 307 and step 304 is the same, and is not described herein again.

It should be noted that, the steps 302 to 304 and 305 to 307 may be executed in parallel, or may be executed according to the current sequence, and of course, the steps 302 to 304 may be executed first, and then the steps 305 to 307 may be executed, which is not limited in this disclosure.

In step 308, the ith video frame, the first image of the (i-1) th video frame and the first image of the (i + 1) th video frame are spliced to obtain a first spliced video frame of the ith video frame.

The server splices the ith video frame, the first image of the (i-1) th video frame and the first image of the (i + 1) th video frame in the RGB channel to obtain first spliced video frames of nine channels.

In step 309, a first stitched video frame of the ith video frame is input into a first convolutional neural network, and feature extraction is performed on the first stitched video frame of the ith video frame through the first convolutional neural network to obtain a first feature map.

In some embodiments, fig. 4 is a schematic structural diagram of a first convolutional neural network provided in an embodiment of the present application, and referring to fig. 4, the convolutional neural network is composed of a plurality of residual modules, where the residual modules include two-dimensional

convolutional layers

401 and 403,

active layers

402 and 404, and a straight edge 405 across layers, and optionally, a convolutional kernel is used in the two-dimensional convolutional layers, and the active layer uses a ReLU activation function.

Taking the first residual module as an example, the server inputs the first spliced video frame of the i-th video frame to the two-dimensional convolutional layer 401, the two-dimensional convolutional layer 401 extracts features from the input video frame, the obtained feature activity value is input to the active layer 402, the active layer 402 outputs the features, the output of the active layer 402 is the input of the two-dimensional convolutional layer 403, the two-dimensional convolutional layer 403 extracts the features from the input features, the server adds the obtained feature activity value to the input of the two-dimensional convolutional layer 401 directly connected through 405, inputs the added result to the active layer 404, and the active layer 404 processes and outputs the features of the residual module.

Taking a residual module positioned between the head and the tail of a plurality of residual modules as an example, a server inputs the characteristics output by the last residual module into a two-dimensional convolutional layer 401, the two-dimensional convolutional layer 401 extracts the characteristics of the input characteristics, the obtained characteristic activity value is input into an active layer 402, the active layer 402 outputs new characteristics, the output of the active layer 402 is the input of a two-dimensional convolutional layer 403, the two-dimensional convolutional layer 403 extracts the characteristics of the input characteristics, the server adds the obtained characteristic activity value and the input of the two-dimensional convolutional layer 401 directly connected through 405, the added result is input into the active layer 404, and the active layer 404 processes and outputs the characteristics of the residual module.

Taking the last residual module as an example, the server inputs the feature output by the last residual module into the two-dimensional convolutional layer 401, the two-dimensional convolutional layer 401 extracts the feature of the input feature, the obtained feature activity value is input into the active layer 402, the active layer 402 outputs a new feature, the output of the active layer 402 is the input of the two-dimensional convolutional layer 403, the two-dimensional convolutional layer 403 extracts the feature of the input feature, the server adds the obtained feature activity value and the input of the two-dimensional convolutional layer 401 directly connected through the 405, inputs the feature activity value into the active layer 404, and the active layer 404 processes and outputs the first feature map.

Of course, in the embodiment of the present application, the first convolutional neural network may adopt the convolutional neural network described above, and may also adopt other network structures to perform feature extraction, which is not limited in the embodiment of the present application.

In step 310, the ith video frame, the second image of the (i-1) th video frame and the second image of the (i + 1) th video frame are spliced to obtain a second spliced video frame of the ith video frame.

The server splices the ith video frame, the second image of the (i-1) th video frame and the second image of the (i + 1) th video frame in the RGB channel to obtain second spliced video frames of nine channels.

In step 311, the second stitched video frame of the ith video frame is input into a second convolutional neural network, and feature extraction is performed on the second stitched video frame of the ith video frame through the second convolutional neural network to obtain a second feature map.

In some embodiments, the second convolutional neural network has the same network structure as the first convolutional neural network, which is not described herein again.

The first convolutional neural network and the second convolutional neural network are trained respectively, wherein the first convolutional neural network is obtained by training a first sample video frame, the first sample video frame comprises an area with a motion condition meeting a target condition, the second convolutional neural network is obtained by training a second sample video frame, and the second sample video frame comprises an area with a motion condition not meeting the target condition. Through independent deep learning, the first convolutional neural network and the second convolutional neural network can learn different network processing capabilities, so that the first convolutional neural network is more suitable for extracting the regional features of which the motion conditions meet the target conditions in the video frames, and the second convolutional neural network is more suitable for extracting the regional features of which the motion conditions do not meet the target conditions in the video frames.

In the continuous video frames, the change degree of the area of which the motion condition accords with the target condition is large, the change degree of the area of which the motion condition does not accord with the target condition is small, and the information redundancy problem existing in the area with the small change degree is considered. The deep learning mode may be supervised learning, unsupervised learning, reinforcement learning, and the like, which is not limited in the embodiment of the present application.

In some embodiments, the convolutional layer in the second convolutional neural network may be a sparse convolutional layer, and when performing feature extraction, the sparse convolutional layer may skip some regions in the slow motion region, so that by reducing the processing region, the amount of computation of the network may be effectively reduced, and the time for video processing may be reduced.

In step 312, the first feature map and the second feature map are fused to obtain a fused feature map of the video frame.

The fusion may be performed in any fusion manner, for example, in some embodiments, the first feature map and the second feature map are input to an adder to fuse the first feature map and the second feature map, and the pixel points in the first feature map and the pixel points in the second feature map are added by the adder to obtain a fused feature map.

In some embodiments, the first feature map and the second feature map are input into a convolutional neural network to fuse the first feature map and the second feature map, and the first feature map and the second feature map are fused by the convolutional neural network based on the contribution degree of the first feature map and the second feature map to obtain a fused feature map which can make the super-resolution effect better.

The above steps 308 to 312 are illustrative of the process of obtaining the fused feature map of the video frame based on the first feature map and the second feature map of the video frame.

In step 313, the fused feature map of the ith video frame is up-sampled to obtain a target feature map.

In the embodiment of the application, the fused feature map is up-sampled to obtain a target feature map with a larger size, and the target feature map is used as an image basis for super resolution.

In some embodiments, the fused feature map is upsampled by an upsampling module, and in some embodiments, fig. 5 is a structural diagram of an upsampling module provided in an embodiment of the present application, where the upsampling module is a convolutional neural network, and the convolutional neural network includes an input layer 501, a convolutional layer 502, a deconvolution layer 503, and an output layer 504.

The server inputs the fusion feature map of the video frame through the input layer 501, the convolution layer 502 performs feature extraction on the input fusion feature map to obtain a new feature, the deconvolution layer 503 is an up-sampling layer, performs transposition convolution on the feature output by the convolution layer 502 to obtain a target feature map of the video frame, and the output layer 504 outputs the target feature map of the video frame.

Of course, the above convolutional neural network may be used in the upsampling step in this embodiment, and other network structures may also be used to perform upsampling, which is not limited in this embodiment.

In step 314, the ith video frame is upsampled to obtain an upsampled video frame corresponding to the ith video frame.

In some embodiments, the ith video frame is upsampled by an image interpolation method, which may be a neighbor interpolation method, a bilinear interpolation method, a bicubic interpolation method, or the like, to obtain an upsampled video frame corresponding to the ith video frame.

Of course, the above image interpolation method may be adopted in this embodiment, and other upsampling methods may also be adopted, for example, upsampling is performed through a convolutional neural network, which is not limited in this embodiment of the present application.

In step 315, the target feature map and the upsampled video frame corresponding to the ith video frame are fused to obtain the target video frame of the ith video frame.

The above fusion may be performed in any fusion manner, for example, in some embodiments, the target feature map and the up-sampled video frame are input to an adder to be fused, and the pixel points in the first feature map and the second feature map are correspondingly added by the adder to obtain the target video frame of the ith video frame.

In some embodiments, the target feature map and the up-sampled video frame are input into a convolutional neural network to be fused, and the target feature map and the up-sampled video frame are fused through the convolutional neural network based on the contribution of the target feature map and the up-sampled video frame to obtain a target video frame of the ith video frame which can enable the final super-resolution effect to be better.

The above steps 302 to 315 are only descriptions of a processing procedure of one video frame, and for each video frame in the video, the server performs a corresponding procedure to obtain a target video frame corresponding to each video frame.

In step 316, a target video frame corresponding to the plurality of video frames is encoded to generate a target video.

And carrying out video coding on the target video frames corresponding to the plurality of video frames to obtain a target video. It is understood that the step 316 is one implementation manner of generating the target video, and this process is not limited by the embodiment of the present application.

The target video is subjected to decoupling of the motion areas of different types, and the motion areas of different types are input into different convolutional neural networks for feature extraction, so that the obtained regional features are more accurate, the final video is clearer, and the playing effect of the video is greatly improved.

According to the technical scheme, when the video needs to be subjected to super-resolution processing, different types of motion areas in front and back adjacent video frames of the video frames can be decoupled based on motion conditions in the video frames, and corresponding video features are fused and the like based on the different types of motion areas respectively, so that a final super-resolution result of the video frame is obtained. In the whole process, the motion areas are classified by directly utilizing the pixel residual errors of adjacent video frames, the motion estimation and compensation steps based on pixel-by-pixel operation are omitted, the space complexity of a network is simplified, the video processing time is saved, and the extracted features are more accurate because the features of the motion areas of different types are respectively extracted, the same processing on all the areas can be effectively avoided, the condition that the motion estimation of a fast motion area or a shielding area is inaccurate can not occur, the problem of artifacts in the video is avoided, and the definition of the video is greatly improved.

Fig. 6 is a diagram of verifying the validity of motion region partition provided in an embodiment of the present application, and referring to fig. 6, 601 in the diagram is an original video frame, 602 is a result of performing motion region partition on the original video frame by using the video processing method, 601A in the diagram represents an area whose motion condition does not meet a target condition, and 602B represents an area whose motion condition meets a target condition.

Fig. 7 is a comparison graph of video processing effects provided by an embodiment of the present application, and referring to fig. 7, 701 is a real high-resolution video image, 702 to 707 are results of other video processing methods, and 708 is a result of the present video processing method. It can be seen that the present video processing method yields more detail than other methods.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. Referring to fig. 8, the apparatus includes: a determination unit 801, a fusion unit 802, and a generation unit 803.

The determination unit 801 is configured to perform, based on a video frame of a video and a previous video frame of the video frame, determining a first image and a second image of the previous video frame, the first image including a region where a motion condition meets a target condition, the second image including a region where the motion condition does not meet the target condition.

The determination unit 801 is further configured to perform determining the first image and the second image of a subsequent video frame based on the video frame of the video and the subsequent video frame of the video frame.

The fusion unit 802 is configured to perform obtaining a fusion feature map of the video frame based on a first feature map and a second feature map of the video frame, the first feature map being determined based on the video frame, a first image of the previous video frame and a first image of the subsequent video frame, the second feature map being determined based on the video frame, a second image of the previous video frame and a second image of the subsequent video frame.

The generating unit 803 is configured to perform generating a target video of the video based on the fused feature map of a plurality of video frames of the video and the video frame.

In some embodiments, the determination unit 801 includes: a first obtaining subunit configured to perform obtaining a first residual map of the video frame, the first residual map being a residual between the video frame and a video frame previous to the video frame; a determining subunit, configured to determine, based on the first residual map, a pixel value threshold corresponding to the previous video frame, where the pixel value threshold is a pixel value mean of pixel points in the first residual map; a generating subunit, configured to generate a first image and a second image of the previous video frame based on the pixel value threshold corresponding to the previous video frame and the first residual map.

In some embodiments, the generating subunit comprises: a first determining subunit configured to perform determining, based on the pixel value threshold, a region in the first residual map having a pixel value greater than or equal to the pixel value threshold; a first generation subunit configured to perform generating a mask of the previous video frame based on a region corresponding to the determined region in the previous video frame; a second generating subunit configured to perform generating a first image and a second image of the previous video frame based on the mask of the previous video frame and the first image of the previous video frame.

In some embodiments, the fusion unit 802 includes: a second acquiring subunit configured to perform acquiring the first feature map and the second feature map of the video frame; and the fusion subunit is configured to perform fusion on the first feature map and the second feature map to obtain a fusion feature map of the video frame.

In some embodiments, the second acquiring subunit is further configured to perform stitching the video frame, the second image of the previous video frame, and the second image of the next video frame to obtain a second stitched video frame of the video frame; inputting a second spliced video frame of the video frame into a second convolutional neural network, and performing feature extraction on the second spliced video frame of the video frame through the second convolutional neural network to obtain a second feature map, wherein the second convolutional neural network is obtained by adopting a second sample video frame for training, and the second sample video frame comprises an area with a motion condition which does not accord with the target condition.

In some embodiments, the generating unit 803 is configured to perform, for each video frame in the video, upsampling the fusion feature map of the video frame to obtain a target feature map; performing upsampling on the video frame to obtain an upsampled video frame corresponding to the video frame; fusing the target feature map and the up-sampling video frame corresponding to the video frame to obtain a target video frame of the video frame; and coding the target video frames corresponding to the plurality of video frames to generate a target video.

In some embodiments, the video processing apparatus executes the processing of the video frame by the determination unit 801, the fusion unit 802, and the generation unit 803 when detecting that a scene change has not occurred in the video frame based on the acquired video frame.

It should be noted that: in the video processing apparatus provided in the above embodiment, when processing a video, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the video processing method provided by the above embodiment and the video processing apparatus embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

An embodiment of the present application provides a computer device, configured to perform the foregoing method, where the computer device may be implemented as a terminal or a server, and a structure of the terminal is introduced below:

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application.

In general, terminal 900 includes: one or more processors 901 and one or more memories 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. Memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one computer program for execution by the processor 901 to implement the video processing methods provided by the method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 904, display screen 905, camera assembly 906, audio circuitry 907, and power supply 908.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication.

Power supply 908 is used to provide power to various components within terminal 900. The power source 908 may be alternating current, direct current, disposable or rechargeable.

In some embodiments, the terminal 900 also includes one or more sensors 909. The one or more sensors 909 include, but are not limited to: an acceleration sensor 910, a gyro sensor 911, a pressure sensor 912, an optical sensor 913, and a proximity sensor 914.

The acceleration sensor 910 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900.

The gyro sensor 911 may be a body direction and a rotation angle of the terminal 900, and the gyro sensor 911 and the acceleration sensor 910 cooperate to collect a 3D motion of the user on the terminal 900.

Pressure sensor 912 may be disposed on a side bezel of terminal 900 and/or underlying display screen 905. When the pressure sensor 912 is disposed on the side frame of the terminal 900, the holding signal of the user to the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 912. When the pressure sensor 912 is disposed at the lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905.

The optical sensor 913 is used to collect the ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 according to the ambient light intensity collected by the optical sensor 913.

The proximity sensor 914 is used to gather the distance between the user and the front face of the terminal 900.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

The computer device may also be implemented as a server, and the following describes a structure of the server:

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the one or more memories 1002 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1001 to implement the methods provided by the foregoing method embodiments. Of course, the server 1000 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1000 may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one computer program stored therein, the computer program being loaded and executed by a processor to implement the video processing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the above-described video processing method.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

It will be understood by those skilled in the art that all or part of the steps of the above embodiments may be implemented by hardware, or may be implemented by a program controlling associated hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of video processing, the method comprising:

determining a first image and a second image of a previous video frame based on the video frame of the video and the previous video frame of the video frame, wherein the first image comprises an area of which the motion condition meets the target condition, and the second image comprises an area of which the motion condition does not meet the target condition;

determining a first image and a second image of a subsequent video frame based on a video frame of a video and the subsequent video frame;

generating a target video of the video based on the fused feature map of a plurality of video frames of the video and the plurality of video frames.

2. The video processing method of claim 1, wherein determining the first image and the second image of a previous video frame based on the video frame and the previous video frame of the video frame comprises:

acquiring a first residual image of the video frame, wherein the first residual image is a residual between the video frame and a previous video frame of the video frame;

determining a pixel value threshold corresponding to the previous video frame based on the first residual image, wherein the pixel value threshold is a pixel value mean value of pixel points in the first residual image;

and generating a first image and a second image of the previous video frame based on the pixel value threshold corresponding to the previous video frame and the first residual map.

3. The method of claim 2, wherein the generating the first image and the second image of the previous video frame based on the pixel value threshold corresponding to the previous video frame and the first residual map comprises:

determining, based on the pixel value threshold, a region in the first residual map having pixel values greater than or equal to the pixel value threshold;

generating a mask for the previous video frame based on a region in the previous video frame corresponding to the determined region;

generating a first image and a second image of the previous video frame based on the mask of the previous video frame and the first image of the previous video frame.

4. The video processing method according to claim 1, wherein obtaining the fused feature map of the video frame based on the first feature map and the second feature map of the video frame comprises:

acquiring a first feature map and a second feature map of the video frame;

and fusing the first characteristic diagram and the second characteristic diagram to obtain a fused characteristic diagram of the video frame.

5. The video processing method according to claim 1, wherein the generating a target video of the video based on the fused feature map of a plurality of video frames of the video and the plurality of video frames comprises:

for each video frame in the video, performing up-sampling on the fusion characteristic graph of the video frame to obtain a target characteristic graph;

performing upsampling on the video frame to obtain an upsampled video frame corresponding to the video frame;

fusing the target feature map and the up-sampling video frame corresponding to the video frame to obtain a target video frame of the video frame;

and coding the target video frames corresponding to the plurality of video frames to generate a target video.

6. The video processing method of claim 1, wherein the method further comprises:

and carrying out scene switching detection based on the acquired video frame, and if the video frame is detected not to have scene switching, executing the processing of the video frame.

7. A video processing apparatus, characterized in that the apparatus comprises:

the determining unit is further configured to perform determining a first image and a second image of a subsequent video frame based on a video frame of a video and the subsequent video frame;

a fusion unit configured to perform acquiring a fusion feature map of the video frame based on a first feature map and a second feature map of the video frame, the first feature map being determined based on the video frame, a first image of the previous video frame and a first image of the subsequent video frame, the second feature map being determined based on the video frame, a second image of the previous video frame and a second image of the subsequent video frame;

8. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the video processing method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein program code in the computer-readable storage medium, when executed by a processor of a server, enables the server to perform the video processing method of any of claims 1 to 6.

10. A computer program product, comprising a computer program which, when executed by a processor, implements the video processing method of any one of claims 1 to 6.