CN114157895A - Video processing method and device, electronic equipment and storage medium - Google Patents

Video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114157895A
CN114157895A CN202111465240.5A CN202111465240A CN114157895A CN 114157895 A CN114157895 A CN 114157895A CN 202111465240 A CN202111465240 A CN 202111465240A CN 114157895 A CN114157895 A CN 114157895A
Authority
CN
China
Prior art keywords
reference frame
output result
video
feature
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111465240.5A
Other languages
Chinese (zh)
Inventor
丁予康
戴宇荣
徐宁
周雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202111465240.5A priority Critical patent/CN114157895A/en
Publication of CN114157895A publication Critical patent/CN114157895A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip

Abstract

The present disclosure relates to a video processing method, an apparatus, an electronic device, and a storage medium, the video processing method including: acquiring a video frame group of a first video, wherein the video frame group is divided into a reference frame and a non-reference frame; obtaining a characteristic output result of the deep neural network model aiming at a reference frame; obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame; and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.

Description

Video processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of signal processing, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.
Background
The video super-resolution technology has important application value in the field of video processing, and can process low-resolution and low-quality videos to obtain high-quality and high-resolution videos, so that the subjective quality and the objective quality of the videos are improved.
At present, no matter a video super-resolution technology based on deep learning or a traditional super-resolution technology, a video is often decoded into video frames, then the video frames are processed frame by frame, and finally the processed video frames are encoded into the video as a final output result.
Due to the fact that content difference between continuous video frames is small, a large amount of repeated content can be repeatedly calculated by a frame-by-frame processing method, calculated amount is wasted, and video processing speed is low.
Disclosure of Invention
The present disclosure provides a video processing method, an electronic device and a storage medium, which at least solve the problems of large calculation amount and low video processing speed in the process of improving video resolution in the related art.
According to a first aspect of the embodiments of the present disclosure, there is provided a video processing method, including: acquiring a video frame group of a first video, wherein the video frame group is divided into a reference frame and a non-reference frame; obtaining a characteristic output result of the deep neural network model aiming at a reference frame; obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame; and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.
Optionally, the obtaining the feature output result of the deep neural network model for the non-reference frame based on the feature output result of the deep neural network model for the reference frame includes: obtaining feature output results of the deep neural network model for a non-reference frame by: determining a feature change area and a non-feature change area of a non-reference frame compared with a reference frame, acquiring a first feature output result of a deep neural network model for the feature change area, acquiring a second feature output result of the deep neural network model for the non-feature change area based on the feature output result for the reference frame, and acquiring the feature output result of the deep neural network model for the non-reference frame based on the first feature output result and the second feature output result.
Optionally, the obtaining a feature output result of the deep neural network model for a reference frame includes: obtaining a feature output result of a reference frame in each layer network in the deep neural network model, wherein the deep neural network model comprises n layer networks, and n is a positive integer greater than 1;
the obtaining of the feature output result of the deep neural network model for the non-reference frame by the following means includes: determining a characteristic change region and a non-characteristic change region of a non-reference frame in an i-1 layer network of the deep neural network model compared with a reference frame, performing convolution operation on the characteristic change region in the i-layer network to obtain a first characteristic output result of the i-layer network for the characteristic change region, obtaining a second characteristic output result of the i-layer network for the non-characteristic change region based on the characteristic output result of the reference frame in the i-layer network, and obtaining the characteristic output result of the non-reference frame in the i-layer network based on the first characteristic output result and the second characteristic output result, wherein i is more than or equal to 2 and less than or equal to n;
the obtaining a second video based on the feature output results of the deep neural network model for the reference frame and the non-reference frame comprises: and obtaining a second video based on the characteristic output result of the reference frame and the non-reference frame in the n-th layer network of the deep neural network model.
Optionally, the acquiring the video frame group includes: the method comprises the steps of obtaining a first video, decoding the first video to obtain at least one video frame group, and dividing video frames in each video frame group into reference frames and non-reference frames.
Optionally, the determining that the non-reference frame is in a feature change region and a non-feature change region of an i-1 layer network of the deep neural network model compared with the reference frame includes: and determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network.
Optionally, the determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network includes: calculating the characteristic difference between the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network; obtaining a binary feature of a single channel based on the feature difference; and determining a characteristic change area and a non-characteristic change area according to the obtained binary characteristics.
Optionally, the performing, in the i-layer network, a convolution operation on only the characteristic change region to obtain a first characteristic output result of the i-layer network for the characteristic change region includes: performing sparse convolution on the feature difference with the binary feature in a layer-i network to obtain a feature output result of the layer-i network for the feature difference, and obtaining a first feature output result based on the binary feature and the feature output result for the feature difference;
wherein the obtaining of the second characteristic output result of the i-th network for the non-characteristic change region based on the characteristic output result of the reference frame on the i-th network comprises: obtaining a second characteristic output result based on the binary characteristic and the characteristic output result of the reference frame in the i-layer network;
the obtaining of the feature output result of the non-reference frame on the i-th layer network based on the first feature output result and the second feature output result includes: and obtaining the characteristic output result of the non-reference frame on the i-th layer network by adding the first characteristic output result and the second characteristic output result.
Optionally, the obtaining a second video based on feature output results of the reference frame and the non-reference frame in an n-th layer of the deep neural network model includes: and upsampling the characteristic output result of the reference frame and the non-reference frame on the n-th layer network, and obtaining a second video based on the upsampled result.
Optionally, the reference frame is a first video frame in the video frame group, and the non-reference frames are the rest of the video frames in the video frame group except the first video frame. According to a second aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including: a video acquisition unit configured to acquire a video frame group of a first video, wherein the video frame group is divided into a reference frame and a non-reference frame; a video processing unit configured to: obtaining a characteristic output result of the deep neural network model aiming at a reference frame; obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame; and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.
Optionally, the obtaining the feature output result of the deep neural network model for the non-reference frame based on the feature output result of the deep neural network model for the reference frame includes: obtaining feature output results of the deep neural network model for a non-reference frame by: determining a feature change area and a non-feature change area of a non-reference frame compared with a reference frame, acquiring a first feature output result of a deep neural network model for the feature change area, acquiring a second feature output result of the deep neural network model for the non-feature change area based on the feature output result for the reference frame, and acquiring the feature output result of the deep neural network model for the non-reference frame based on the first feature output result and the second feature output result.
Optionally, the obtaining a feature output result of the deep neural network model for a reference frame includes: obtaining a feature output result of a reference frame in each layer network in the deep neural network model, wherein the deep neural network model comprises n layer networks, and n is a positive integer greater than 1;
the obtaining of the feature output result of the deep neural network model for the non-reference frame by the following means includes: determining a characteristic change region and a non-characteristic change region of a non-reference frame in an i-1 layer network of the deep neural network model compared with a reference frame, performing convolution operation on the characteristic change region in the i-layer network to obtain a first characteristic output result of the i-layer network for the characteristic change region, obtaining a second characteristic output result of the i-layer network for the non-characteristic change region based on the characteristic output result of the reference frame in the i-layer network, and obtaining the characteristic output result of the non-reference frame in the i-layer network based on the first characteristic output result and the second characteristic output result, wherein i is more than or equal to 2 and less than or equal to n;
the obtaining a second video based on the feature output results of the deep neural network model for the reference frame and the non-reference frame comprises: and obtaining a second video based on the characteristic output result of the reference frame and the non-reference frame in the n-th layer network of the deep neural network model.
Optionally, the acquiring the video frame group includes: the method comprises the steps of obtaining a first video, decoding the first video to obtain at least one video frame group, and dividing video frames in each video frame group into reference frames and non-reference frames.
Optionally, the determining that the non-reference frame is in a feature change region and a non-feature change region of an i-1 layer network of the deep neural network model compared with the reference frame includes: and determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network.
Optionally, the determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network includes: calculating the characteristic difference between the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network; obtaining a binary feature of a single channel based on the feature difference; and determining a characteristic change area and a non-characteristic change area according to the obtained binary characteristics.
Optionally, the performing, in the i-layer network, a convolution operation on only the characteristic change region to obtain a first characteristic output result of the i-layer network for the characteristic change region includes: performing sparse convolution on the feature difference with the binary feature in a layer-i network to obtain a feature output result of the layer-i network for the feature difference, and obtaining a first feature output result based on the binary feature and the feature output result for the feature difference;
wherein the obtaining of the second characteristic output result of the i-th network for the non-characteristic change region based on the characteristic output result of the reference frame on the i-th network comprises: obtaining a second characteristic output result based on the binary characteristic and the characteristic output result of the reference frame in the i-layer network;
the obtaining of the feature output result of the non-reference frame on the i-th layer network based on the first feature output result and the second feature output result includes: and obtaining the characteristic output result of the non-reference frame on the i-th layer network by adding the first characteristic output result and the second characteristic output result.
Optionally, the obtaining a second video based on feature output results of the reference frame and the non-reference frame in an n-th layer of the deep neural network model includes: and upsampling the characteristic output result of the reference frame and the non-reference frame on the n-th layer network, and obtaining a second video based on the upsampled result.
Optionally, the reference frame is a first video frame in the video frame group, and the non-reference frames are the rest of the video frames in the video frame group except the first video frame.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video processing method as described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, which when executed by at least one processor, cause the at least one processor to perform the video processing method as described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the video processing method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: according to the video processing method of the embodiment of the disclosure, since the video frame is divided into the reference frame and the non-reference frame, and the feature output result of the deep neural network model for the non-reference frame is obtained based on the feature output result of the deep neural network model for the reference frame, the amount of calculation when obtaining the feature output result of the non-reference frame is reduced, thereby enabling a higher resolution video to be obtained at a higher speed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is an exemplary system architecture to which exemplary embodiments of the present disclosure may be applied;
FIG. 2 is a schematic diagram of a conventional video super resolution technique;
FIG. 3 is a schematic diagram illustrating the shortcomings of conventional video super resolution techniques;
fig. 4 is a flowchart of a video processing method of an exemplary embodiment of the present disclosure;
FIG. 5 is a schematic diagram illustrating the processing of reference frames of an exemplary embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating the processing of non-reference frames of an exemplary embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating the structure of a deep neural network model of an exemplary embodiment of the present disclosure;
fig. 8 is a block diagram showing a video processing apparatus of an exemplary embodiment of the present disclosure;
fig. 9 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.
Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. A user may use the terminal devices 101, 102, 103 to interact with the server 105 over the network 104 to receive or send messages (e.g., image or video data upload requests, image or video data download requests), etc. Various communication client applications, such as audio and video communication software, audio and video recording software, instant messaging software, conference software, mailbox clients, social platform software, and the like, may be installed on the terminal devices 101, 102, and 103. Further, various image or video shooting editing applications may also be installed on the terminal apparatuses 101, 102, and 103. The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording, editing, etc. audio and video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, etc. When the terminal device 101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.
The terminal devices 101, 102, 103 may be equipped with image capturing means (e.g. a camera) to capture image or video data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a motion video. Further, the terminal apparatuses 101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound. In addition, the terminal apparatuses 101, 102, 103 can perform voice communication or video communication with each other.
The server 105 may be a server providing various services, such as a background server providing support for multimedia applications installed on the terminal devices 101, 102, 103. The background server can analyze, store and the like the received data such as the audio and video data uploading request, can also receive the audio and video data downloading request sent by the terminal equipment 101, 102 and 103, and feeds back the audio and video data indicated by the audio and video data downloading request to the terminal equipment 101, 102 and 103.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the video processing method provided by the embodiment of the present disclosure is generally executed by a terminal device, but may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the video processing apparatus may be provided in the terminal device, the server, or both the terminal device and the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.
As mentioned in the background art, the currently mainly used super-resolution technology is to decode and extract frames of a video to be processed, then serially invoke a super-resolution algorithm for each frame to process, and perform video coding on the obtained result frames after each frame is processed, so as to obtain a final output video.
Fig. 2 is a schematic diagram of a conventional video super-resolution technique. As shown in fig. 2, a Low Resolution (LR) video is first decoded into video frames (LR video frames), then the LR video frames are processed frame by frame, that is, each LR video frame is subjected to super resolution processing by a super resolution model (SR model) to obtain a corresponding High Resolution (HR) video frame, and finally the HR video frame is encoded to obtain a High Resolution (HR) video.
However, the content difference of consecutive video frames in the same scene is very low relative to the whole video frame, and there is a great deal of redundancy between the contents of the video frames, especially in the deep learning model, the feature difference between adjacent frames gradually decreases with the increase of the network depth.
Fig. 3 is a schematic diagram illustrating the disadvantage of the conventional video super-resolution technique.
As shown in fig. 3, for the video to be processed in fig. 3, after passing the 1 st frame to the nth frame through each layer network respectively, if the feature output result (in fig. 3, referred to as "network feature") of each of the 2 nd to nth frames in each layer network is respectively subtracted from the feature output result of the 1 st frame in the corresponding layer network, according to the feature difference of each of the 1 st frame and the 2 nd to nth frames in each layer of the network, each of the 2 nd to nth frames has a large number of non-feature change regions compared with the 1 st frame, however, the conventional frame-by-frame processing video super-resolution method does not consider the information amount of the current frame and does not treat the feature change regions and the non-feature change regions differently, which results in repeatedly calculating the calculated regions in the reference frame, causing redundancy of the overall calculation amount, and causing the processing speed to be too slow.
Therefore, the present disclosure provides an efficient video super-resolution technology based on video reference frame feature multiplexing, and in general, the video super-resolution technology of the present disclosure divides a video frame into two types, namely a reference frame and a non-reference frame, and after obtaining a feature output result of a model for the reference frame, obtains a feature output result of the model for the non-reference frame based on the feature output result of the reference frame, thereby effectively reducing the amount of calculation for the non-reference frame, and increasing the video processing speed. For example, for a reference frame, performing conventional super-resolution processing on the reference frame, and storing the characteristic output results of the reference frame in all layer networks; the method comprises the steps of determining a characteristic change area for a non-reference frame, and performing super-resolution processing on the characteristic change area only, wherein the characteristic output result of a corresponding layer stored in a reference frame is used as the characteristic output result of the non-characteristic change area, and the area of the characteristic change area between the non-reference frame and the reference frame is small, so that the processing time for the characteristic change area is short, and the number of non-reference frames in a video is high, so that the video super-resolution technology disclosed by the invention can effectively reduce the calculation amount and improve the video processing speed on the whole.
Fig. 4 is a flowchart of a video processing method according to an exemplary embodiment of the present disclosure.
Referring to fig. 4, in step S410, a video frame group of a first video is acquired. Here, the group of video frames is divided into reference frames and non-reference frames. Specifically, in step S410, a first video (i.e., a video to be processed) may be obtained first, then the first video is decoded to obtain at least one video frame group, and the video frames in each video frame group are divided into reference frames and non-reference frames. For example, a plurality of video frame groups, in each of which 10 video frames may be included, are obtained by decoding the first video, but the number of video frames included in each of the video frame groups is not limited to the above example. As an example, the reference frame may be a first video frame in the video frame group, and the non-reference frame may be the rest of the video frames except the first video frame in the video frame group, however, the division manner of the reference frame and the non-reference frame is not limited thereto, for example, each video frame may have its previous video frame as its reference frame.
In step S420, a feature output result of the deep neural network model for a reference frame is obtained. For example, the feature output result of each layer of the reference frame in the deep neural network model can be obtained. Here, the deep neural network model includes an n-layer network, where n is a positive integer greater than 1. Specifically, in the i-th network, the characteristic output result of the reference frame in the i-th network is obtained based on the characteristic output result of the reference frame in the i-1-th network, and the characteristic output result of the i-th network is saved, wherein i is more than or equal to 2 and less than or equal to n.
Fig. 5 is a schematic diagram illustrating processing of a reference frame according to an exemplary embodiment of the present disclosure. As shown in fig. 5, in the case that the reference frame is the reference frame i, the reference frame i may be first input into a layer 1 network of the deep neural network model, convolution processing is performed in the layer 1 network by using a convolution operator to obtain a feature output result of the reference frame i in the layer 1 network (in fig. 5, referred to as "layer 1 feature" for short), then the feature output result of the layer 1 network is input into a layer 2 network, convolution processing is performed in the layer 2 network on the feature output result of the layer 1 network by using a convolution operator to obtain a feature output result of the reference frame i in the layer 2 network (in fig. 5, referred to as "layer 2 feature" for short), and so on, the feature output result of the previous layer network is used as an input of the next layer network until a feature output result of the reference frame i in the layer n network (i.e., the last layer network) is obtained. In addition, the feature output results of the reference frame in each layer of the deep neural network model can be saved for subsequent processing of the non-reference frame. For example, the feature output results for reference frame i at each layer of the deep neural network model may be stored in a "feature Bank" ("reference frame feature Bank").
Next, in step S430, feature output results of the deep neural network model for non-reference frames are obtained based on the feature output results of the deep neural network model for reference frames. According to an exemplary embodiment, the feature output result of the deep neural network model for a non-reference frame may be obtained, for example, by: determining a feature change area and a non-feature change area of a non-reference frame compared with a reference frame, acquiring a first feature output result of a deep neural network model for the feature change area, acquiring a second feature output result of the deep neural network model for the non-feature change area based on the feature output result for the reference frame, and acquiring the feature output result of the deep neural network model for the non-reference frame based on the first feature output result and the second feature output result. Specifically, as an example, the obtaining the feature output result of the deep neural network model for the non-reference frame by the following method includes: the method comprises the steps of determining a characteristic change region and a non-characteristic change region of an i-1 layer network of the deep neural network model of a non-reference frame compared with a reference frame, performing convolution operation on the characteristic change region in the i-layer network to obtain a first characteristic output result of the i-layer network for the characteristic change region, obtaining a second characteristic output result of the i-layer network for the non-characteristic change region based on the characteristic output result of the reference frame in the i-layer network, and obtaining the characteristic output result of the non-reference frame in the i-layer network based on the first characteristic output result and the second characteristic output result, wherein i is larger than or equal to 2 and smaller than or equal to n.
Specifically, for each non-reference frame, first, the characteristic change region and the non-characteristic change region may be determined according to a characteristic output result of the non-reference frame on the i-1 th layer network and a characteristic output result of the reference frame on the i-1 th layer network. For example, a feature difference between the feature output result of the non-reference frame on the i-1 th layer network and the feature output result of the reference frame on the i-1 th layer network may be calculated, then a single-channel binary feature may be obtained based on the feature difference, and finally, a feature change region and a non-feature change region may be determined from the obtained binary feature.
Fig. 6 is a schematic diagram illustrating processing of a non-reference frame according to an exemplary embodiment of the present disclosure. For example, for a non-reference frame n (i +1 th frame in FIG. 6), assume that its characteristic output result (i.e., input to the i-th network) at the i-1 th network is
Figure BDA0003391137330000101
And the output result of the characteristics of the i-1 layer network of the reference frame (i-th frame in figure 6) stored in the characteristic bank is
Figure BDA0003391137330000102
Then in the i-layer network, the network will be switched by
Figure BDA0003391137330000103
And
Figure BDA0003391137330000104
the difference can be obtained by difference
Figure BDA0003391137330000105
Subsequently, the characteristic difference is utilized
Figure BDA0003391137330000106
To determine regions of characteristic variation and regions of non-characteristic variation. In particular, based on feature differences
Figure BDA0003391137330000107
Obtaining binary characteristics of the single channel, and determining characteristic change areas and non-characteristic change areas according to the obtained binary characteristics. For example, a convolution operation may be performed on the feature difference using a 3 × 3 convolutional layer to obtain a single-channel binary feature
Figure BDA0003391137330000108
In the binary feature, a region with a value of 0 is a non-feature-changed region, and a region with a value of 1 is a feature-changed region (i.e., a region of interest ROI). Binary characterization of a single channel obtained above
Figure BDA0003391137330000111
May also be referred to as ROI Mask ("ROI Mask").
After the characteristic change region and the non-characteristic change region are determined, performing a convolution operation on only the characteristic change region in an i-th network to obtain a first characteristic output result of the i-th network for the characteristic change region. Specifically, first, the feature output result for the feature difference is obtained in the i-th network by performing sparse convolution on the feature difference with the binary feature in the i-th network, and then the first feature output result is obtained based on the binary feature and the feature output result for the feature difference, for example, the first feature output result may be obtained by multiplying the binary feature and the feature output result for the feature difference. As shown in fig. 6, it is known which are feature-changed regions (regions inside the rectangular frame in fig. 6 are feature-changed regions) and which are non-feature-changed regions (regions outside the rectangular frame in fig. 6 are non-feature-changed regions) using the ROI mask, and the convolution operation is performed only on the feature-changed regions. For example, assume that layer i network is poor for features
Figure BDA0003391137330000112
Is output as a result of the feature of
Figure BDA0003391137330000113
The output result of the first characteristic of the i-th layer for the characteristic change region is
Figure BDA0003391137330000114
Next, a second feature output result of the i-th network for the non-feature change region is obtained based on the feature output result of the reference frame on the i-th network. Specifically, the second feature output result may be obtained based on the binary feature and the feature output result of the reference frame at the i-th layer network. For example, the second feature output result is obtained by multiplying the difference between 1 and the binary feature by the feature output result of the reference frame on the i-th layer network. For example, suppose that the characteristic output result of the reference frame r obtained from the characteristic bank in the i-layer network is
Figure BDA0003391137330000115
The output result of the second characteristic of the i-th network for the non-characteristic change area is
Figure BDA0003391137330000116
And finally, acquiring the characteristic output result of the non-reference frame in the i-th layer network based on the first characteristic output result and the second characteristic output result. Specifically, the feature output result of the non-reference frame on the i-layer network is obtained by adding the first feature output result and the second feature output result. The characteristic output result of the non-reference frame n in the i-th layer network is assumed to be
Figure BDA0003391137330000117
Then:
Figure BDA0003391137330000118
and performing the operation on the non-reference frame in each layer network until a characteristic output result of the non-reference frame in the nth layer network is obtained.
Finally, in step S440, a second video is obtained for the feature output results of the reference frame and the non-reference frame based on the deep neural network model. Here, the resolution of the second video is greater than the resolution of the first video. Specifically, for example, the second video is obtained based on the feature output result of the reference frame and the non-reference frame in the n-th layer network of the deep neural network model.
Fig. 7 shows a schematic diagram of the structure of a deep neural network model of an exemplary embodiment of the present disclosure. As shown in fig. 7, the deep neural network model includes an up-sampling module in addition to a multi-layer network. In the example of fig. 7, the upsampling module is represented by the second to last cube in the network structure shown in fig. 7, which implements upsampling by deconvolution.
Specifically, in step S440, when the second video is obtained based on the feature output results of the reference frame and the non-reference frame on the n-th network of the deep neural network model, the feature output results of the reference frame and the non-reference frame on the n-th network may be upsampled, and the second video is obtained based on the upsampled result. Optionally, the upsampled result may be further subjected to a convolution operation to reduce the checkerboard effect, and the video frame with the reduced checkerboard effect may be encoded to finally obtain the second video.
For both reference and non-reference frames, the same deep neural network model structure is used, except that the convolution operator used for the reference frame is different from the convolution operator used for the non-reference frame in each layer of the network. For example, a conventional convolution operator is used for the reference frame, and a feature multiplexing convolution operator is used for the non-reference frame, and the feature multiplexing convolution operator is a convolution operator that utilizes the feature output result of the reference frame.
In the example of fig. 7, each cube preceding the upsampling module represents one of the n-layer networks mentioned above. Furthermore, the multi-layer network may constitute a residual block and a plurality of residual blocks may constitute a residual block group by different connection means (e.g., full connection/long connection/short connection/layer jump connection). After the feature output results of the reference frame and the non-reference frame on the n-th layer are finally obtained through the n-layer network of the deep neural network model, the high-resolution video frame can be finally obtained through the upsampling module and the convolutional layer. Finally, a processed video (i.e., the second video above) is obtained by encoding the obtained high-resolution video frame.
The video processing method according to the embodiment of the present disclosure has been described above with reference to fig. 4 to 7, and according to the above video processing method, since the video frame is divided into the reference frame and the non-reference frame, and the feature output result of the deep neural network model for the non-reference frame is obtained based on the feature output result of the deep neural network model for the reference frame, the amount of calculation when obtaining the feature output result of the non-reference frame is reduced, thereby enabling to obtain a video with higher resolution at a faster speed.
For example, when the feature output result of the deep neural network model for the non-reference frame is obtained based on the feature output result of the deep neural network model for the reference frame, obtaining a first feature output result of a deep neural network model for the feature change region by determining a feature change region and a non-feature change region of a non-reference frame compared with a reference frame, obtaining a second feature output result of the deep neural network model for the non-feature change region based on the feature output result for the reference frame, and obtaining feature output results of the deep neural network model for the non-reference frame based on the first feature output result and the second feature output result, the amount of calculation can be concentrated on the feature change area, the amount of calculation when obtaining the feature output result of the non-reference frame is reduced, and thus a video with higher resolution can be obtained at a higher speed.
It has been proved through a lot of experiments that the calculation amount of the non-reference frame is only about 40% of the reference frame by using the above video processing method, for example, in the form of a video frame group of 10 frames using 1 frame reference frame +9 frames non-reference frame, the total calculation amount can be reduced by about 54%, and the subjective visual quality of the finally processed video is not degraded.
Fig. 8 is a block diagram illustrating a video processing apparatus according to an exemplary embodiment of the present disclosure.
Referring to fig. 8, the video processing device 800 may include a video acquisition unit 810 and a video processing unit 820. Specifically, the video acquisition unit 810 may be configured to acquire a video frame group of the first video, wherein the video frame group is divided into a reference frame and a non-reference frame. The video processing unit 820 may be configured to: obtaining a characteristic output result of the deep neural network model aiming at a reference frame; obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame; and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.
Since the video processing method shown in fig. 4 can be performed by the video processing apparatus 800 shown in fig. 8, the video obtaining unit 810 performs operations corresponding to step S410 in fig. 4, and the video processing unit 820 performs operations corresponding to steps S420 to S440 in fig. 4, any relevant details related to the operations performed by the units in fig. 8 can be referred to in the corresponding description of fig. 4, and are not repeated here.
Further, it should be noted that although the video processing apparatus 800 is described above as being divided into units for respectively performing the corresponding processes, it is clear to those skilled in the art that the processes performed by the units described above may also be performed without any specific division of the units by the video processing apparatus 800 or without explicit demarcation between the units. Further, the video processing apparatus 800 may also include other units, such as a storage unit and the like.
Fig. 9 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Referring to fig. 9, an electronic device 900 may include at least one memory 901 storing computer-executable instructions that, when executed by the at least one processor, cause the at least one processor 902 to perform a video processing method according to embodiments of the present disclosure and at least one processor 902.
By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
The processor may execute instructions or code stored in the memory, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.
The memory may be integral to the processor, e.g., RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the memory.
In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a video processing method according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The instructions in the computer-readable storage medium or computer program described above may be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, etc., and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there may also be provided a computer program product including computer instructions which, when executed by a processor, implement a video processing method according to an exemplary embodiment of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A video processing method, comprising:
acquiring a video frame group of a first video, wherein the video frame group is divided into a reference frame and a non-reference frame;
obtaining a characteristic output result of the deep neural network model aiming at a reference frame;
obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame;
and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.
2. The video processing method according to claim 1, wherein said obtaining the feature output result of the deep neural network model for the non-reference frame based on the feature output result of the deep neural network model for the reference frame comprises:
obtaining feature output results of the deep neural network model for a non-reference frame by: determining a feature change area and a non-feature change area of a non-reference frame compared with a reference frame, acquiring a first feature output result of a deep neural network model for the feature change area, acquiring a second feature output result of the deep neural network model for the non-feature change area based on the feature output result for the reference frame, and acquiring the feature output result of the deep neural network model for the non-reference frame based on the first feature output result and the second feature output result.
3. The video processing method of claim 2,
the obtaining of the feature output result of the deep neural network model for the reference frame includes: obtaining a feature output result of a reference frame in each layer network in the deep neural network model, wherein the deep neural network model comprises n layer networks, and n is a positive integer greater than 1;
the obtaining of the feature output result of the deep neural network model for the non-reference frame by the following means includes: determining a characteristic change region and a non-characteristic change region of a non-reference frame in an i-1 layer network of the deep neural network model compared with a reference frame, performing convolution operation on the characteristic change region in the i-layer network to obtain a first characteristic output result of the i-layer network for the characteristic change region, obtaining a second characteristic output result of the i-layer network for the non-characteristic change region based on the characteristic output result of the reference frame in the i-layer network, and obtaining the characteristic output result of the non-reference frame in the i-layer network based on the first characteristic output result and the second characteristic output result, wherein i is more than or equal to 2 and less than or equal to n;
the obtaining a second video based on the feature output results of the deep neural network model for the reference frame and the non-reference frame comprises: and obtaining a second video based on the characteristic output result of the reference frame and the non-reference frame in the n-th layer network of the deep neural network model.
4. The video processing method of claim 1, wherein said obtaining a set of video frames comprises:
the method comprises the steps of obtaining a first video, decoding the first video to obtain at least one video frame group, and dividing video frames in each video frame group into reference frames and non-reference frames.
5. The video processing method of claim 3, wherein the determining that the non-reference frame is in a feature change region and a non-feature change region of an i-1 layer network of the deep neural network model compared to the reference frame comprises:
and determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network.
6. The video processing method according to claim 5, wherein said determining the characteristic change region and the non-characteristic change region according to the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network comprises:
calculating the characteristic difference between the characteristic output result of the non-reference frame on the i-1 layer network and the characteristic output result of the reference frame on the i-1 layer network;
obtaining a binary feature of a single channel based on the feature difference;
and determining a characteristic change area and a non-characteristic change area according to the obtained binary characteristics.
7. The video processing method according to claim 6,
wherein, the performing convolution operation on the characteristic change region only in the i-th network to obtain the first characteristic output result of the i-th network for the characteristic change region includes: performing sparse convolution on the feature difference with the binary feature in a layer-i network to obtain a feature output result of the layer-i network for the feature difference, and obtaining a first feature output result based on the binary feature and the feature output result for the feature difference;
wherein the obtaining of the second characteristic output result of the i-th network for the non-characteristic change region based on the characteristic output result of the reference frame on the i-th network comprises: obtaining a second characteristic output result based on the binary characteristic and the characteristic output result of the reference frame in the i-layer network;
the obtaining of the feature output result of the non-reference frame on the i-th layer network based on the first feature output result and the second feature output result includes: and obtaining the characteristic output result of the non-reference frame on the i-th layer network by adding the first characteristic output result and the second characteristic output result.
8. A video processing apparatus comprising:
a video acquisition unit configured to acquire a video frame group of a first video, wherein the video frame group is divided into a reference frame and a non-reference frame;
a video processing unit configured to:
obtaining a characteristic output result of the deep neural network model aiming at a reference frame;
obtaining the feature output result of the deep neural network model for a non-reference frame based on the feature output result of the deep neural network model for a reference frame;
and obtaining a second video aiming at the characteristic output results of the reference frame and the non-reference frame based on the deep neural network model, wherein the resolution of the second video is greater than that of the first video.
9. An electronic device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video processing method of any of claims 1 to 7.
10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the video processing method of any of claims 1 to 7.
CN202111465240.5A 2021-12-03 2021-12-03 Video processing method and device, electronic equipment and storage medium Pending CN114157895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111465240.5A CN114157895A (en) 2021-12-03 2021-12-03 Video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111465240.5A CN114157895A (en) 2021-12-03 2021-12-03 Video processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114157895A true CN114157895A (en) 2022-03-08

Family

ID=80455965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111465240.5A Pending CN114157895A (en) 2021-12-03 2021-12-03 Video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114157895A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116249018A (en) * 2023-05-11 2023-06-09 深圳比特微电子科技有限公司 Dynamic range compression method and device for image, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956219A (en) * 2019-12-09 2020-04-03 北京迈格威科技有限公司 Video data processing method and device and electronic system
CN112700392A (en) * 2020-12-01 2021-04-23 华南理工大学 Video super-resolution processing method, device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956219A (en) * 2019-12-09 2020-04-03 北京迈格威科技有限公司 Video data processing method and device and electronic system
CN112700392A (en) * 2020-12-01 2021-04-23 华南理工大学 Video super-resolution processing method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116249018A (en) * 2023-05-11 2023-06-09 深圳比特微电子科技有限公司 Dynamic range compression method and device for image, electronic equipment and storage medium
CN116249018B (en) * 2023-05-11 2023-09-08 深圳比特微电子科技有限公司 Dynamic range compression method and device for image, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11871002B2 (en) Iterative techniques for encoding video content
US11270470B2 (en) Color leaking suppression in anchor point cloud compression
JP2020096342A (en) Video processing method and apparatus
CN110049336B (en) Video encoding method and video decoding method
JP2018501693A (en) How to create a video
CN112422561A (en) Content sharing method and device and method
CN114157895A (en) Video processing method and device, electronic equipment and storage medium
CN114268792A (en) Method and device for determining video transcoding scheme and method and device for video transcoding
CN113012073A (en) Training method and device for video quality improvement model
JP2023517486A (en) image rescaling
JP2013506379A (en) Combined scalar embedded graphics coding for color images
CN113194270B (en) Video processing method and device, electronic equipment and storage medium
US20110090956A1 (en) Compression method using adaptive field data selection
CN113747242B (en) Image processing method, image processing device, electronic equipment and storage medium
WO2021057464A1 (en) Video processing method and apparatus, and storage medium and electronic device
CN114155852A (en) Voice processing method and device, electronic equipment and storage medium
US20120263224A1 (en) Encoding digital assets as an image
CN116264606A (en) Method, apparatus and computer program product for processing video
CN113506219A (en) Training method and device for video super-resolution model
CN113411521B (en) Video processing method and device, electronic equipment and storage medium
US11647153B1 (en) Computer-implemented method, device, and computer program product
CN115086188B (en) Graph operation and maintenance playback method and device and electronic equipment
CN113225620B (en) Video processing method and video processing device
CN110662060B (en) Video encoding method and apparatus, video decoding method and apparatus, and storage medium
CN110572676B (en) Video encoding method and apparatus, video decoding method and apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination