CN111953971A - Video processing method, video processing device and terminal equipment - Google Patents

Video processing method, video processing device and terminal equipment Download PDF

Info

Publication number
CN111953971A
CN111953971A CN201910407602.1A CN201910407602A CN111953971A CN 111953971 A CN111953971 A CN 111953971A CN 201910407602 A CN201910407602 A CN 201910407602A CN 111953971 A CN111953971 A CN 111953971A
Authority
CN
China
Prior art keywords
frame
video
layer
interpolated
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910407602.1A
Other languages
Chinese (zh)
Other versions
CN111953971B (en
Inventor
樊顺利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan TCL Group Industrial Research Institute Co Ltd
Original Assignee
Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan TCL Group Industrial Research Institute Co Ltd filed Critical Wuhan TCL Group Industrial Research Institute Co Ltd
Priority to CN201910407602.1A priority Critical patent/CN111953971B/en
Publication of CN111953971A publication Critical patent/CN111953971A/en
Application granted granted Critical
Publication of CN111953971B publication Critical patent/CN111953971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability

Abstract

The application is applicable to the technical field of video processing, and provides a video processing method, a video processing device, a terminal device and a computer readable storage medium, comprising the following steps: acquiring a video to be processed; decomposing the video to be processed into a plurality of video segments; coding I frames in each video segment through a first coding model based on deep learning; acquiring an R frame interpolated in each layer in each video clip and a context frame of the R frame interpolated in each layer in each video clip; and coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and a second coding model based on deep learning. The method and the device can solve the problems of relatively complex video coding process and relatively poor coding effect in the prior art.

Description

Video processing method, video processing device and terminal equipment
Technical Field
The present application belongs to the field of video processing technologies, and in particular, to a video processing method, a video processing apparatus, a terminal device, and a computer-readable storage medium.
Background
Video is an important information carrier, and human eyes can generate motion feeling by playing a series of pictures by utilizing the principle of human eye visual persistence. However, the original video occupies a large space and is not beneficial to storage or transmission, a key technology of video application is video compression, and the size of the video can be reduced through the video compression, so that the video occupies a small storage space. The traditional video coding and decoding modules are numerous, a large amount of experienced manual design and manual optimization are needed, the coding process is complex, and the coding effect is poor.
Disclosure of Invention
In view of this, embodiments of the present application provide a video processing method, a video processing apparatus, a terminal device, and a computer readable storage medium, so as to solve the problems in the prior art that a video encoding process is complicated and an encoding effect is poor.
A first aspect of an embodiment of the present application provides a video processing method, where the video processing method includes:
acquiring a video to be processed;
decomposing the video to be processed into a plurality of video clips, wherein each video clip comprises an I frame and a plurality of R frames, the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame;
coding the I frame in each video segment through a first coding model based on deep learning;
acquiring an R frame interpolated in each layer in each video clip and a context frame of the R frame interpolated in each layer in each video clip;
and coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and a second coding model based on deep learning.
A second aspect of an embodiment of the present application provides a video processing apparatus, including:
the video acquisition module is used for acquiring a video to be processed;
the video decomposition module is used for decomposing the video to be processed into a plurality of video clips, wherein each video clip comprises an I frame and a plurality of R frames, the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame;
an I frame coding module, configured to code an I frame in each video segment through a first coding model based on deep learning;
an interpolated frame obtaining module, configured to obtain an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
and the R frame coding module is used for coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and the second coding model based on deep learning.
A third aspect of embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the video processing method according to the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, which when executed by a processor implements the steps of the video processing method according to the first aspect.
A fifth aspect of the present application provides a computer program product comprising a computer program which, when executed by one or more processors, performs the steps of the video processing method as described in the first aspect above.
According to the scheme, after the video to be processed is obtained, the video to be processed is decomposed into the plurality of video segments, the I frame in each video segment is coded through the first coding model based on the deep learning, the R frame interpolated in each layer in each video segment is coded through the layered interpolation algorithm based on the deep learning, when the I frame and the R frame in each video segment are coded, manual participation is not needed, the video coding process is simplified, and the video coding effect is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flow chart of an implementation of a video processing method according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of hierarchical interpolation;
FIG. 3 is a diagram of an example of a structure of a context model;
fig. 4 is a schematic flow chart of an implementation of a video processing method according to a second embodiment of the present application;
fig. 5 is a schematic diagram of a video processing apparatus according to a third embodiment of the present application;
fig. 6 is a schematic diagram of a terminal device according to a fourth embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
In particular implementations, the terminal devices described in embodiments of the present application include, but are not limited to, other portable devices such as mobile phones, laptop computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments, the device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).
In the discussion that follows, a terminal device that includes a display and a touch-sensitive surface is described. However, it should be understood that the terminal device may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.
The terminal device supports various applications, such as one or more of the following: a drawing application, a presentation application, a word processing application, a website creation application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a digital camera application, a web browsing application, a digital music player application, and/or a digital video player application.
Various applications that may be executed on the terminal device may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical architecture (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.
It should be understood that, the sequence numbers of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of the embodiment of the present application.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
Referring to fig. 1, which is a schematic flow chart of an implementation of a video processing method provided in an embodiment of the present application, as shown in the figure, the video processing method may include the following steps:
and step S101, acquiring a video to be processed.
Step S102, decomposing the video to be processed into a plurality of video segments.
Each video clip comprises an I frame and a plurality of R frames, wherein the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame.
In this embodiment Of the present application, a video to be processed may be decomposed into video Segments (GOPs), each video segment may include a preset number Of video frames, a first frame Of each video segment is an I frame Of each video segment, and the remaining frames are R frames Of each video segment. Where each video segment represents a set of consecutive pictures, a video segment typically starts with an I-frame and ends with a frame before the next I-frame. Each frame represents a static picture, the I frame represents a key frame, the picture of the I frame is kept complete, and when the I frame is decoded, the decoding can be completed only by the data of the I frame, and the data of other frames are not needed. The user can set the preset number according to the actual requirement, and the limitation is not made herein. For example, the preset number is 12, each video clip includes 12 video frames, the first frame is an I frame, and the remaining 11 frames are R frames.
And step S103, encoding the I frame in each video segment through a first coding model based on deep learning.
In the embodiment of the present application, the first coding model may be any video compression technology, such as a multi-scale neural network based on a convolutional neural network.
Step S104, obtaining the R frame interpolated in each layer of each video segment and the context frame of the R frame interpolated in each layer of each video segment.
In the embodiment of the present application, the R frame in each video segment is encoded by multilayer interpolation (for example, three-layer interpolation), and the R frame interpolated in each layer in the multilayer interpolation and the previous frame and the next frame of the R frame can be obtained for each video segment.
Optionally, each video segment includes N layers of interpolation, where N is an integer greater than 1, where the N layers of interpolation include first layer interpolation and non-first layer interpolation, and for the first layer interpolation, the obtaining of the R frame interpolated in each layer in each video segment and the context frame of the R frame interpolated in each layer in each video segment includes:
acquiring the R frame of the first-layer interpolation in the current video clip according to the I frame of the current video clip and the I frame of the next video clip, wherein the I frame of the current video clip and the I frame of the next video clip are context frames of the R frame of the first-layer interpolation in the current video clip;
for an mth layer interpolation in the non-first layer interpolation, where the mth layer interpolation is any one layer interpolation in the non-first layer interpolation, the obtaining the R frame interpolated in each layer in each video segment and the context frame interpolated in each layer in each video segment includes:
and acquiring the R frames of the interpolation of the M-th layer in the current video clip according to the R frames respectively corresponding to the interpolation of the previous M-1 layer in the current video clip, the I frame of the current video clip and the I frame of the next video clip, wherein the R frames of the interpolation of the previous M-1 layer in the current video clip, the I frame of the current video clip and the I frame of the next video clip are context frames of the R frames of the interpolation of the M-th layer in the current video clip, and M is an integer which is greater than 1 and less than or equal to N.
For the first-layer interpolation (i.e., the first-layer interpolation), the R frame arranged in the middle position between the I frame of the current video segment and the I frame of the next video segment in the current video segment is the R frame of the first-layer interpolation, the I frame of the current video segment arranged before the R frame of the first-layer interpolation is the upper frame of the R frame of the first-layer interpolation, and the I frame of the next video segment arranged after the R frame of the first-layer interpolation is the lower frame of the R frame of the first-layer interpolation.
For any layer of interpolation other than the first layer interpolation, for example, the mth layer interpolation, R frames corresponding to the previous M-1 layer interpolation in the current video clip, an I frame of the current video clip, and an I frame of the next video clip may all be used as reference frames for the mth layer interpolation, the number of the reference frames is usually at least three, and one or two R frames for the mth layer interpolation may be obtained according to any two adjacent reference frames (when the number of frames between two adjacent reference frames is an odd number, one R frame for the mth layer interpolation may be obtained according to two adjacent reference frames, and when the number of frames between two adjacent reference frames is an even number, two R frames for the mth layer interpolation may be obtained according to two adjacent reference frames). Wherein, the first layer interpolation is from the first layer interpolation to the M-1 layer interpolation, for example, when obtaining the R frame of the third layer interpolation, the R frame of the first layer interpolation and the R frame of the second layer interpolation are used.
Specifically, the R frame arranged at the intermediate position between any two adjacent reference frames is an M-th layer interpolated R frame, a reference frame arranged before the M-th layer interpolated R frame in any two adjacent reference frames is an upper frame of the M-th layer interpolated R frame, and a reference frame arranged after the M-th layer interpolated R frame is a lower frame of the M-th layer interpolated R frame. For example, for any two adjacent reference frames P and Q, and the reference frame P is arranged before the reference frame Q, the R frame arranged at the middle position between the reference frames P and Q is an R frame interpolated at the mth layer, the reference frame P is an upper frame of the R frame, and the reference frame Q is a lower frame of the R frame.
As shown in fig. 2, an exemplary diagram of hierarchical interpolation is shown, where a video segment includes one I frame and 11R frames, two I frames in fig. 2 are an I frame of a current video segment and an I frame of a next video segment, respectively, and numbers 2 to 12 are 11R frames in the current video segment, and in the first layer of interpolation, an R frame numbered 7 in the current video segment is interpolated, an upper frame of an R frame numbered 7 is an I frame of the current video segment, and a lower frame is an I frame of the next video segment; in the second layer of interpolation, R frames numbered 4 and 10 in the current video segment are respectively subjected to interpolation coding, wherein the upper frame of the R frame numbered 4 is an I frame of the current video segment, the lower frame of the R frame numbered 7 is an R frame numbered 7, the upper frame of the R frame numbered 10 is an R frame numbered 7, and the lower frame of the R frame numbered 10 is an I frame of the next video segment; in the third layer of interpolation, R frames numbered 2, 3, 4, 5, 8, 9, 11 and 12 in the current video segment are respectively interpolated and coded, wherein the previous frames of the R frames numbered 2 and 3 are both I frames of the current video segment, the next frames are both R frames numbered 4, the previous frames of the R frames numbered 4 and 5 are both R frames numbered 4, the next frames are both R frames numbered 7, the previous frames of the R frames numbered 8 and 9 are both R frames numbered 7, the next frames are both R frames numbered 10, and the previous frames of the R frames numbered 11 and 12 are both R frames numbered 10, and the next frames are both I frames of the next video segment.
And step S105, coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and the second coding model based on the deep learning.
The second coding model based on deep learning includes, but is not limited to, a convolution Long and Short Memory Network model, and not only has the time sequence modeling capability of a Long Short Term Memory Network (LSTM), but also can depict local features like a convolution neural Network, that is, has space-time characteristics.
Optionally, the encoding, according to the context frame of the R frame interpolated in each layer in each video segment and the second coding model based on depth learning, the R frame interpolated in each layer in each video segment includes:
inputting the context frame of the R frame interpolated in each layer in each video segment into a context model based on deep learning, and outputting the context feature of the context frame of the R frame interpolated in each layer in each video segment;
inputting the interpolated R frames per layer in each video segment to the second coding model, and fusing the context features of the context frames of the interpolated R frames per layer in each video segment with the specified network layer of the second coding model to encode the interpolated R frames per layer in each video segment.
In the embodiment of the present application, the network structure of the context model may adopt a pnet network, which is a variant of a convolutional neural network and is named as a pnet network because the structure is similar to a letter U. The upsampling part of the Unet network can adopt convolution of pixel _ shuffle and 1 × 1, the feature resolution can be enlarged by adopting pixel _ shuffle, and the number of feature channels can be enlarged by adopting 1 × 1 convolution. The context model extracts the features of the upper frame and the lower frame, and the abstract features of the model are the context features. As shown in fig. 3, which is a structural example of the context model, four down-sampling layers and three up-sampling layers are adopted, and output1, output2 and output3 of the network are respectively three outputs of the context model. In order to add the information of the context frame to the encoder (i.e. the second coding model), the features extracted by the context model can be fused into the network layer of the second coding model, and the difference in multi-layer interpolation coding is mainly that the second coding model fuses the context features in different network layers, i.e. the R frames of different layers have different network layers specified in the corresponding second coding models.
Optionally, the embodiment of the present application further includes:
acquiring motion estimation information of the R frame interpolated in each layer in each video segment and the context frame of the R frame interpolated in each layer in each video segment;
mapping the context characteristics of the context frame of the R frame interpolated at each layer in each video segment according to the motion estimation information of the R frame interpolated at each layer in each video segment and the context frame of the R frame interpolated at each layer in each video segment;
correspondingly, the fusing the context features of the context frame of the R frame interpolated at each layer in each video segment with the specified network layer of the second coding model includes:
and fusing the context characteristics of the interpolated R frame of each layer in each mapped video segment with the specified network layer of the second coding model.
In the embodiment of the application, a PWC-Net network can be adopted to obtain motion estimation information between an R frame interpolated at each layer and a context frame thereof, and the motion estimation information is used to map context characteristics, so that the purpose of motion compensation can be achieved. In addition, in the stage of training the context model, the motion information corresponding to the R frame coordinate of each layer of interpolation may be used to map the context features, and when the context model is used, the features obtained by each upsampling layer in the context model may be used for subsequent fusion with the feature map in the coding and decoding model after being subjected to the mapping operation. Among them, the PWC-Net network is a compact and efficient optical flow estimation convolutional neural network model, which follows the simple and clear principle: image pyramids, rolling, and use of this quantity.
After the video to be processed is acquired, the video to be processed is firstly decomposed into a plurality of video segments, then I frames in each video segment are coded through a first coding model based on deep learning, R frames interpolated in each layer in each video segment are coded through a layered interpolation algorithm based on deep learning, when the I frames and the R frames in each video segment are coded, manual participation is not needed, the video coding process is simplified, and the video coding effect is improved.
Referring to fig. 4, which is a schematic flow chart of an implementation of a video processing method provided in the second embodiment of the present application, as shown in the figure, the video processing method may include the following steps:
step S401, acquiring a video to be processed.
The step is the same as step S101, and reference may be made to the related description of step S101, which is not repeated herein.
Step S402, decomposing the video to be processed into a plurality of video segments.
Each video clip comprises an I frame and a plurality of R frames, wherein the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame.
The step is the same as step S102, and reference may be made to the related description of step S102, which is not repeated herein.
And step S403, encoding the I frame in each video segment through a first coding model based on deep learning.
The step is the same as step S103, and reference may be made to the related description of step S103, which is not described herein again.
Step S404, obtaining the interpolated R frame of each layer in each video segment and the context frame of the interpolated R frame of each layer in each video segment.
The step is the same as step S104, and reference may be made to the related description of step S104, which is not repeated herein.
Step S405, coding the R frame interpolated in each layer of each video segment according to the context frame of the R frame interpolated in each layer of each video segment and the second coding model based on the deep learning.
The step is the same as step S105, and reference may be made to the related description of step S105, which is not repeated herein.
Optionally, after the I frame and the R frame interpolated in each layer in each video segment are encoded, the method further includes:
performing quantization operation on the I frame after being coded and the R frame after being coded and interpolated in each layer in each video clip;
and entropy coding and decoding the quantized I frame and the quantized R frame interpolated in each layer in each video segment.
In the embodiment of the present application, the quantization operation may be a deep learning-based quantization encoding, and the quantization operation may quantize the information of the first coding model and the second coding model to-1 and 1 by using a 1 × 1 convolutional network and a sign function, so as to reduce information redundancy. Entropy coding can be entropy coding based on deep learning, mainly compressing information of quantized I frames and R frames, wherein entropy coding is lossless compression, firstly probability estimation is carried out on quantized binary numbers by using a PixelCNN network, and then coding is carried out by using arithmetic coding. For the loss function, a combination of the L1 loss function and the MS-SSIM loss function may be used, i.e., loss ═ lossL1*α+lossMS-SSIM(1- α), wherein loss is a loss function of entropy coding, lossL1Loss function of L1MS-SSIMFor the MS-SSIM loss function, α is a fusion coefficient, which has a value of 0 to 1, for example, α ═ 0.2.
Step S406, decoding the I frame encoded in each video segment through the first decoding model based on deep learning.
In the embodiment of the present application, the first decoding model may be any video decoding technology, such as a multi-scale neural network based on a convolutional neural network.
Step S407, decoding the interpolated R frame of each layer encoded in each video segment according to the context frame of the interpolated R frame of each layer in each video segment and the second decoding model based on depth learning.
Wherein, the second decoding model based on deep learning includes, but is not limited to, a convolution long-time and short-time memory network model.
In the embodiment of the present application, the encoded R frames interpolated per layer in each video segment are input to the second decoding model, and the context features of the context frames interpolated per layer are fused with the specified network layer of the second decoding model, so as to decode the encoded R frames interpolated per layer. Wherein, in the decoding stage, different reconstruction times can be selected for decoding.
In order to add the information of the context frame to the decoder (i.e. the second decoding model), the features extracted by the context model can be fused into the network layer of the second decoding model, and the difference in multi-layer interpolation decoding is mainly that the second decoding model fuses the context features in different network layers, i.e. the R frames of different layers have different assigned network layers in their corresponding second decoding models.
It should be noted that, in order to facilitate the second coding model and the second decoding model to compare the interpolated R frame of each layer with the context frame of the R frame, the R frame and the context frame of the R frame may be fused at a channel level to form a multi-channel (e.g. 9-channel) image, and the multi-channel (e.g. 9-channel) image is fed into the second coding model and the second decoding model.
Taking the R frame of the first layer interpolation as an example, table 1 is a second coding model and a quantization model, table 2 is a second decoding model, B, C, H, W in tables 1 and 2 respectively represent batch size (i.e. batch size), channel, height, and width, and tables 1 and 2 give the operation of the first layer interpolation; the second layer interpolation layer fuses output2 in fig. 3 at the encoding stage rnn2 layer, and the decoding stage rnn3 layer fuses output 2; the third layer of interpolation fuses output1 at the encoding stage rnn3 and output1 at the decoding stage rnn2, the rest of the operation being the same as for the first layer of interpolation.
TABLE 1 second coding model and quantization model
Figure BDA0002061742930000121
TABLE 2 second decoding model
Figure BDA0002061742930000122
Figure BDA0002061742930000131
Optionally, after the video to be processed is decomposed into a plurality of video segments, the method further includes:
dividing each frame in each video segment into a plurality of image blocks;
after decoding the I frame and the interpolated R frame of each layer in each video segment, further comprising:
and merging the image blocks of the decoded I frame and the decoded R frame interpolated in each layer in each video clip.
In the embodiment of the present application, in order to reduce computational resources occupied by encoding and decoding a picture (i.e., a frame), each frame may be divided into a plurality of image blocks of a preset size, and meanwhile, in order to weaken a boundary influence caused by a padding operation in a subsequent convolution, a boundary of each image block is extended by N pixels, and the image blocks are merged after decoding. Where N is an integer greater than 1, such as 16. The user may set the preset size according to actual needs, which is not limited herein, for example, the preset size is 352 × 288.
When the encoding and decoding operation is performed on the R frame of each layer of interpolation in each video segment, the embodiment of the application is based on the deep learning network, and can perform joint training on each deep learning network without excessive artificial design and optimization, thereby simplifying the video encoding and decoding process, improving the video encoding and decoding effect and saving the video transmission bandwidth.
Fig. 5 is a schematic diagram of a video processing apparatus provided in the third embodiment of the present application, and for convenience of description, only the relevant portions of the third embodiment of the present application are shown.
The video processing apparatus includes:
a video obtaining module 51, configured to obtain a video to be processed;
a video decomposition module 52, configured to decompose the video to be processed into a plurality of video segments, where each video segment includes an I frame and a plurality of R frames, the I frame is a key frame in each video segment, and the R frame is a frame other than the I frame in each video segment;
an I-frame encoding module 53, configured to encode an I-frame in each video segment through a first coding model based on deep learning;
an interpolated frame obtaining module 54, configured to obtain an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
and an R frame coding module 55, configured to code the R frame interpolated in each layer in each video segment according to the context frame of the R frame interpolated in each layer in each video segment and the second coding model based on deep learning.
Optionally, each video segment includes N layers of interpolation, where N is an integer greater than 1, the N layers of interpolation include first layer interpolation and non-first layer interpolation, and the interpolated frame obtaining module 54 includes:
a first obtaining unit, configured to obtain, for the first-layer interpolation, an R frame of the first-layer interpolation in a current video segment according to an I frame of the current video segment and an I frame of a next video segment, where the I frame of the current video segment and the I frame of the next video segment are context frames of the R frame of the first-layer interpolation in the current video segment;
a second obtaining unit, configured to, for an mth layer interpolation in the non-top layer interpolation, where the mth layer interpolation is any one layer interpolation in the non-top layer interpolation, obtain, according to an R frame corresponding to a previous M-1 layer interpolation in the current video segment, an I frame of the current video segment, and an I frame of the next video segment, an R frame of the mth layer interpolation in the current video segment, where the R frame of the previous M-1 layer interpolation in the current video segment, the I frame of the current video segment, and the I frame of the next video segment are context frames of the R frame of the mth layer interpolation in the current video segment, and M is an integer greater than 1 and less than or equal to N.
Optionally, the R frame encoding module 55 includes:
a feature output unit, configured to input the context frame of the R frame interpolated per layer in each video segment to a context model based on deep learning, and output the context feature of the context frame of the R frame interpolated per layer in each video segment;
a processing unit, configured to input the R frames interpolated per layer in each video segment to the second coding model, and fuse the context features of the context frames of the R frames interpolated per layer in each video segment with a specified network layer of the second coding model, so as to encode the R frames interpolated per layer in each video segment.
Optionally, the video processing apparatus further includes:
an information obtaining module, configured to obtain motion estimation information of an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
a feature mapping module, configured to map, according to motion estimation information of the R frame interpolated in each layer in each video segment and the context frame of the R frame interpolated in each layer in each video segment, context features of the context frame of the R frame interpolated in each layer in each video segment;
correspondingly, the processing unit is specifically configured to:
and fusing the context characteristics of the interpolated R frame of each layer in each mapped video segment with the specified network layer of the second coding model.
Optionally, the video processing apparatus further includes:
an I frame decoding module, configured to decode the I frame encoded in each video segment through a first decoding model based on deep learning;
and the R frame decoding module is used for decoding the R frames interpolated in each layer coded in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and a second decoding model based on deep learning.
Optionally, the video processing apparatus further includes:
an image block dividing module for dividing each frame in each video segment into a plurality of image blocks;
and the image block merging module is used for merging the image blocks of the decoded I frame and the decoded R frame of each layer of interpolation in each video segment.
Optionally, the video processing apparatus further includes:
the quantization module is used for performing quantization operation on the I frame after being coded and the R frame after being coded and interpolated in each layer in each video segment;
and the entropy coding and decoding module is used for carrying out entropy coding and decoding on the quantized I frame and the quantized R frame interpolated in each layer in each video segment.
The apparatus provided in the embodiment of the present application may be applied to the first method embodiment and the second method embodiment, and for details, reference is made to the description of the first method embodiment and the second method embodiment, and details are not repeated here.
Fig. 6 is a schematic diagram of a terminal device according to a fourth embodiment of the present application. As shown in fig. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the various video processing method embodiments described above. Alternatively, the processor 60 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 62.
Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the terminal device 6. For example, the computer program 62 may be divided into a video acquisition module, a video decomposition module, an I-frame encoding module, an interpolation frame acquisition module, an R-frame encoding module, an information acquisition module, a feature mapping module, an I-frame decoding module, an R-frame decoding module, an image block division module, an image block merging module, a quantization module, and an entropy encoding and decoding module, each module having the following specific functions:
the video acquisition module is used for acquiring a video to be processed;
the video decomposition module is used for decomposing the video to be processed into a plurality of video clips, wherein each video clip comprises an I frame and a plurality of R frames, the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame;
an I frame coding module, configured to code an I frame in each video segment through a first coding model based on deep learning;
an interpolated frame obtaining module, configured to obtain an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
and the R frame coding module is used for coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and the second coding model based on deep learning.
Optionally, each video segment includes N layers of interpolation, where N is an integer greater than 1, the N layers of interpolation include first layer interpolation and non-first layer interpolation, and the interpolation frame obtaining module includes:
a first obtaining unit, configured to obtain, for the first-layer interpolation, an R frame of the first-layer interpolation in a current video segment according to an I frame of the current video segment and an I frame of a next video segment, where the I frame of the current video segment and the I frame of the next video segment are context frames of the R frame of the first-layer interpolation in the current video segment;
a second obtaining unit, configured to, for an mth layer interpolation in the non-top layer interpolation, where the mth layer interpolation is any one layer interpolation in the non-top layer interpolation, obtain, according to an R frame corresponding to a previous M-1 layer interpolation in the current video segment, an I frame of the current video segment, and an I frame of the next video segment, an R frame of the mth layer interpolation in the current video segment, where the R frame of the previous M-1 layer interpolation in the current video segment, the I frame of the current video segment, and the I frame of the next video segment are context frames of the R frame of the mth layer interpolation in the current video segment, and M is an integer greater than 1 and less than or equal to N.
Optionally, the R frame encoding module includes:
a feature output unit, configured to input the context frame of the R frame interpolated per layer in each video segment to a context model based on deep learning, and output the context feature of the context frame of the R frame interpolated per layer in each video segment;
a processing unit, configured to input the R frames interpolated per layer in each video segment to the second coding model, and fuse the context features of the context frames of the R frames interpolated per layer in each video segment with a specified network layer of the second coding model, so as to encode the R frames interpolated per layer in each video segment.
Optionally, the information obtaining module is configured to obtain motion estimation information of an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
a feature mapping module, configured to map, according to motion estimation information of the R frame interpolated in each layer in each video segment and the context frame of the R frame interpolated in each layer in each video segment, context features of the context frame of the R frame interpolated in each layer in each video segment;
correspondingly, the processing unit is specifically configured to:
and fusing the context characteristics of the interpolated R frame of each layer in each mapped video segment with the specified network layer of the second coding model.
Optionally, the I-frame decoding module is configured to decode, through a first decoding model based on deep learning, an I-frame encoded in each video segment;
and the R frame decoding module is used for decoding the R frames interpolated in each layer coded in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and a second decoding model based on deep learning.
Optionally, the image block dividing module is configured to divide each frame in each video segment into a plurality of image blocks;
and the image block merging module is used for merging the image blocks of the decoded I frame and the decoded R frame of each layer of interpolation in each video segment.
Optionally, the quantization module is configured to perform quantization operation on the I frame after encoding and the R frame after encoding for each layer of interpolation in each video segment;
and the entropy coding and decoding module is used for carrying out entropy coding and decoding on the quantized I frame and the quantized R frame interpolated in each layer in each video segment.
The terminal device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of a terminal device 6 and does not constitute a limitation of terminal device 6 and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input output devices, network access devices, buses, etc.
The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the terminal device 6. The memory 61 is used for storing the computer program and other programs and data required by the terminal device. The memory 61 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A video processing method, characterized in that the video processing method comprises:
acquiring a video to be processed;
decomposing the video to be processed into a plurality of video clips, wherein each video clip comprises an I frame and a plurality of R frames, the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame;
coding the I frame in each video segment through a first coding model based on deep learning;
acquiring an R frame interpolated in each layer in each video clip and a context frame of the R frame interpolated in each layer in each video clip;
and coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and a second coding model based on deep learning.
2. The video processing method according to claim 1, wherein each of the video segments includes N layers of interpolation, N being an integer greater than 1, the N layers of interpolation including first layer interpolation and non-first layer interpolation, and the obtaining the R frames interpolated per layer in each of the video segments and the context frames interpolated per layer in each of the video segments includes, for the first layer interpolation:
acquiring the R frame of the first-layer interpolation in the current video clip according to the I frame of the current video clip and the I frame of the next video clip, wherein the I frame of the current video clip and the I frame of the next video clip are context frames of the R frame of the first-layer interpolation in the current video clip;
for an mth layer interpolation in the non-first layer interpolation, where the mth layer interpolation is any one layer interpolation in the non-first layer interpolation, the obtaining the R frame interpolated in each layer in each video segment and the context frame interpolated in each layer in each video segment includes:
and acquiring the R frames of the interpolation of the M-th layer in the current video clip according to the R frames respectively corresponding to the interpolation of the previous M-1 layer in the current video clip, the I frame of the current video clip and the I frame of the next video clip, wherein the R frames of the interpolation of the previous M-1 layer in the current video clip, the I frame of the current video clip and the I frame of the next video clip are context frames of the R frames of the interpolation of the M-th layer in the current video clip, and M is an integer which is greater than 1 and less than or equal to N.
3. The video processing method according to claim 1, wherein said encoding the R frames interpolated per layer in each video segment according to the context frames of the R frames interpolated per layer in each video segment and the second coding model based on deep learning comprises:
inputting the context frame of the R frame interpolated in each layer in each video segment into a context model based on deep learning, and outputting the context feature of the context frame of the R frame interpolated in each layer in each video segment;
inputting the interpolated R frames per layer in each video segment to the second coding model, and fusing the context features of the context frames of the interpolated R frames per layer in each video segment with the specified network layer of the second coding model to encode the interpolated R frames per layer in each video segment.
4. The video processing method of claim 3, wherein the video processing method further comprises:
acquiring motion estimation information of the R frame interpolated in each layer in each video segment and the context frame of the R frame interpolated in each layer in each video segment;
mapping the context characteristics of the context frame of the R frame interpolated at each layer in each video segment according to the motion estimation information of the R frame interpolated at each layer in each video segment and the context frame of the R frame interpolated at each layer in each video segment;
correspondingly, the fusing the context features of the context frame of the R frame interpolated at each layer in each video segment with the specified network layer of the second coding model includes:
and fusing the context characteristics of the interpolated R frame of each layer in each mapped video segment with the specified network layer of the second coding model.
5. The video processing method of claim 1, further comprising, after encoding the I frame and the R frame interpolated per layer in each of the video segments:
decoding the coded I frame in each video segment through a first decoding model based on deep learning;
and decoding the coded R frame of each layer of interpolation in each video segment according to the context frame of the R frame of each layer of interpolation in each video segment and a second decoding model based on deep learning.
6. The video processing method of claim 5, after decomposing the video to be processed into a plurality of video segments, further comprising:
dividing each frame in each video segment into a plurality of image blocks;
after decoding the I frame and the interpolated R frame of each layer in each video segment, further comprising:
and merging the image blocks of the decoded I frame and the decoded R frame interpolated in each layer in each video clip.
7. The video processing method of claim 1, further comprising, after encoding the I frame and the R frame interpolated per layer in each of the video segments:
performing quantization operation on the I frame after being coded and the R frame after being coded and interpolated in each layer in each video clip;
and entropy coding and decoding the quantized I frame and the quantized R frame interpolated in each layer in each video segment.
8. A video processing apparatus, characterized in that the video processing apparatus comprises:
the video acquisition module is used for acquiring a video to be processed;
the video decomposition module is used for decomposing the video to be processed into a plurality of video clips, wherein each video clip comprises an I frame and a plurality of R frames, the I frame is a key frame in each video clip, and the R frames are frames in each video clip except the I frame;
an I frame coding module, configured to code an I frame in each video segment through a first coding model based on deep learning;
an interpolated frame obtaining module, configured to obtain an R frame interpolated in each layer in each video segment and a context frame of the R frame interpolated in each layer in each video segment;
and the R frame coding module is used for coding the R frames interpolated in each layer in each video segment according to the context frames of the R frames interpolated in each layer in each video segment and the second coding model based on deep learning.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the video processing method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the video processing method according to any one of claims 1 to 7.
CN201910407602.1A 2019-05-16 2019-05-16 Video processing method, video processing device and terminal equipment Active CN111953971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910407602.1A CN111953971B (en) 2019-05-16 2019-05-16 Video processing method, video processing device and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910407602.1A CN111953971B (en) 2019-05-16 2019-05-16 Video processing method, video processing device and terminal equipment

Publications (2)

Publication Number Publication Date
CN111953971A true CN111953971A (en) 2020-11-17
CN111953971B CN111953971B (en) 2023-03-14

Family

ID=73335761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910407602.1A Active CN111953971B (en) 2019-05-16 2019-05-16 Video processing method, video processing device and terminal equipment

Country Status (1)

Country Link
CN (1) CN111953971B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658019A (en) * 2015-10-31 2017-05-10 华为技术有限公司 Coding and decoding method and device for reference frame
CN107820085A (en) * 2017-10-31 2018-03-20 杭州电子科技大学 A kind of method of the raising video compression coding efficiency based on deep learning
CN109151476A (en) * 2018-09-21 2019-01-04 北京大学 A kind of reference frame generating method and device based on bi-directional predicted B frame image
US20190132591A1 (en) * 2017-10-26 2019-05-02 Intel Corporation Deep learning based quantization parameter estimation for video encoding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658019A (en) * 2015-10-31 2017-05-10 华为技术有限公司 Coding and decoding method and device for reference frame
US20190132591A1 (en) * 2017-10-26 2019-05-02 Intel Corporation Deep learning based quantization parameter estimation for video encoding
CN107820085A (en) * 2017-10-31 2018-03-20 杭州电子科技大学 A kind of method of the raising video compression coding efficiency based on deep learning
CN109151476A (en) * 2018-09-21 2019-01-04 北京大学 A kind of reference frame generating method and device based on bi-directional predicted B frame image

Also Published As

Publication number Publication date
CN111953971B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
US20220292730A1 (en) Method and apparatus for haar-based point cloud coding
US20050196056A1 (en) Image coding and decoding method
CN104956671A (en) Video frame reconstruction
US20230306701A1 (en) Parallel approach to dynamic mesh alignment
CN111953971B (en) Video processing method, video processing device and terminal equipment
US20240078713A1 (en) Texture coordinate prediction in mesh compression
US20230308669A1 (en) Predictive coding of boundary uv information for mesh compression
US20230334714A1 (en) Coding of boundary uv2xyz index for mesh compression
US20230334713A1 (en) On coding of boundary uv2xyz index for mesh compression
US11727536B2 (en) Method and apparatus for geometric smoothing
US20240104783A1 (en) Multiple attribute maps merging
US20230306649A1 (en) Predictive coding of boundary uv2xyz index for mesh compression
US20230306647A1 (en) Geometry filtering for mesh compression
WO2024086314A2 (en) Texture coordinate compression using chart partition
CN113965756A (en) Image coding method, storage medium and terminal equipment
CN116320395A (en) Image processing method, device, electronic equipment and readable storage medium
JP2023533423A (en) Non-Binary Occupancy Maps for Video-Based Point Cloud Coding
CN116848553A (en) Method for dynamic grid compression based on two-dimensional UV atlas sampling
CN117396923A (en) Triangularization method using boundary information for dynamic mesh compression
CN114449285A (en) Video coding and decoding method and related equipment
WO2024081393A1 (en) Adaptive geometry filtering for mesh compression
CN113965755A (en) Image coding method, storage medium and terminal equipment
CN117611953A (en) Graphic code generation method, graphic code generation device, computer equipment and storage medium
CN116250009A (en) Fast block generation for video-based point cloud coding
CN116980603A (en) Video data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant