Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in sequences other than those illustrated or described herein, and that the terms "first," "second," etc. are generally used in a generic sense and do not limit the number of terms, e.g., a first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/", and generally means that the former and latter related objects are in an "or" relationship.
The following describes the video processing method, apparatus, device and storage medium provided by the present application in detail by specific embodiments with reference to the accompanying drawings.
Fig. 1 is a flowchart of a video quantization encoding and decoding method according to an embodiment of the present invention, as shown in fig. 1, including:
step 110, inputting N video frames of an original video into a coding module of a video processing model, and outputting quantization feature codes of M video frames, wherein M and N are positive integers, and M is less than or equal to N;
specifically, the original video described in the embodiment of the present application may be a black and white video, or may be a color video. The color original video can be represented by X, X belongs to R N×H×W×3 Where N denotes that the original video contains N video frames, H and W denote that the spatial resolution of each video frame is H × W, and 3 denotes three color channels R, G, and B.
The video processing model described in the embodiment of the application is a model trained in advance, and the video processing model comprises a coding module, can carry out quantization coding on an original video and outputs quantization characteristic codes of M video frames. The quantization feature codes of the M video frames are the quantization feature codes corresponding to the original video, and can be represented by VQ (E).
Specifically, downsampling processing is performed during quantization encoding of an original video, and the number of video frames output by an encoding module of the video processing model may be less than the number of original video frames, that is, M is less than or equal to N.
Furthermore, after the quantization feature code corresponding to the original video is obtained, the user can directly perform required operation on the quantization feature code without processing the complex original video.
Step 120, inputting the quantization feature codes of the M video frames into a decoding module of a video processing model, and mapping the quantization feature codes of the M video frames into M first video frame features;
specifically, in this embodiment of the present application, the quantization feature codes of the M video frames input to the decoding module may be identical quantization feature codes output by the encoding module, or may be quantization feature codes processed by the user.
The video processing model described in the embodiment of the present application further includes a decoding module, and after the quantized feature codes of the M video frames are input to the decoding module of the video processing model, the quantized feature codes are first mapped into M first video features in a full link layer and a residual convolutional network, where the M first video features may be represented by z.
Step 130, reconstructing a first reference frame feature corresponding to each first video frame feature through a time axis attention mechanism according to the M first video frame features to obtain M first reference frame features;
specifically, reconstructing the first reference frame feature corresponding to the first video frame feature at the time t is to use the first video frame feature at the time t as a reference frame, and when reconstructing the reference frame, use a time axis attention mechanism to fuse features of the same pixel position in all other frames (video frames at all times).
Similarly, the M first reference frame features can be obtained by reconstructing the first reference frame feature corresponding to each first video frame feature. And the dimensions of the M first reference frame features are the same as the dimensions of the M first reference frame features.
More specifically, the M first reference frame features may be represented as z r ,z r =RFC(z)。
Step 140, outputting a reconstructed video based on the M first video frame characteristics and the M first reference frame characteristics.
Specifically, the subsequent processing is continued in a decoding module of the video processing model, and the reconstructed video is output based on the M first video frame features and the M first reference frame features.
In this embodiment, first, a video processing model obtained through pre-training is used to encode an original video to obtain a quantization feature code, so that the data size can be reduced, and compared with the case that a user processes the original video, the processing of the quantization feature code is more convenient. Secondly, when the quantized feature codes are decoded by using the video processing model obtained by pre-training, a time axis attention mechanism is adopted to reconstruct the reference frames, so that accurate and effective reference frames can be obtained, and the reconstructed video is output based on the M first video frame features and the M first reference frame features, so that the reconstructed video has richer details, and the high-quality reconstructed video is obtained.
Optionally, the inputting N video frames of the original video into a coding module of the video processing model and outputting quantized feature codes of M video frames includes:
inputting N video frames of an original video into a coding module of a video processing model, and coding the N video frames of the original video to obtain first feature codes of M video frames;
determining the corresponding feature code of each video frame in the codebook based on the Euclidean distance between the first feature code of each video frame and each feature code in the codebook, wherein the codebook comprises a plurality of discrete hidden layer feature codes;
and outputting the quantized feature codes of the M video frames based on the corresponding feature codes of each video frame in the codebook.
Specifically, the original video is expressed as X epsilon R N×H×W×3 Inputting the video data into a video processing model, firstly, encoding an original video X by using a 3D convolution network and a residual error network to obtain continuous video feature vectors, namely first feature encodings of M video framesAnd (6) code. The expression for encoding the original video is E = f (X).
The down-sampling is completed in the encoding process, and the down-sampling proportion is (s, f, f), wherein s corresponds to a video frame and f corresponds to a resolution. The resulting first feature code of M video frames can be expressed as E ∈ R M×H/f×W/f×D I.e. E ∈ R N/s×H/f×W/f×D And D is the number of the hidden layer nodes.
The codebook e ∈ R described in the examples of this application T×D T discrete implicit feature encodings are included for determining the quantized feature encoding for each video frame.
Specifically, the feature code corresponding to the video frame at the time t in the codebook is determined as calculating a first feature code E at the time t t,i,j And Euclidean distances of T hidden layer feature codes in the codebook, and E in the codebook is selected t,i,j And coding the nearest hidden layer as the quantization characteristic coding of the video frame at the time t. The formula is expressed as follows:
similarly, a quantization encoding for each video frame may be determined. Furthermore, the quantization coding of each video frame is determined, that is, the quantization feature coding of M video frames can be obtained, and is marked as VQ (E) epsilon R N/s×H/f×W/f×D And outputting the quantization feature codes of the M video frames.
In this embodiment, the original video is encoded first, and then the quantization feature codes of M video frames are determined based on the distance between the original video and each hidden layer feature code in the codebook, so that the accurate quantization feature code of the original video can be obtained. In the prior art, the quantization coding of a video is usually processed into the concatenation of the quantization coding of a plurality of image frames, so that the video quantization characteristic coding length is too long, and the processing is inconvenient for users. The original video is compressed from two dimensions of time and space, so that the length of video quantization coding is reduced, and convenience is brought to a user to process.
Optionally, the outputting a reconstructed video based on the M first video frame features and the M first reference frame features includes:
respectively aligning each first video frame feature with the corresponding first reference frame feature to obtain M aligned first video frame features;
fusing the M aligned first video frame features and the M first video frame features through a time and space attention mechanism to obtain M fused first video frame features;
up-sampling the M fused first video frame features to obtain X reconstructed video frame features;
and outputting a reconstructed video based on the X reconstructed video frame characteristics, wherein X is a positive integer.
Specifically, in the embodiment of the present application, the expression of the M first video frame features is z ∈ R N/s×H/f×W/f×D I.e. z ∈ R M ×H/f×W/f×D 。
Fig. 2 is a second flowchart of the video quantization encoding and decoding method according to the embodiment of the present invention, and as shown in fig. 2, after feature extraction is performed on an original video to be processed by using a 3D network convolution neural system and a residual error network. And calculating Euclidean distances between the first characteristic codes of the video frame at each moment and the plurality of characteristic codes in the codebook, and inquiring the characteristic code closest to the first characteristic code in the codebook as the quantized characteristic code of the video frame. And then inputting the quantized feature codes into a full connection layer and a residual error network to obtain the video frame features. And reconstructing the characteristics of the reference frame through a time axis attention mechanism according to the characteristics of the video frame at each moment.
Image frames are aligned at a feature level using pyramid concatenation and deformable convolution, specifically, due to a reconstructed first reference frame feature
And a first video frame characteristic z
t There may be misalignment, so Pyramid cascade and Deformable convolution (PCD) are adopted as the video frame alignment module, which is characterized in thatThe image frames are aligned on level, namely, the first video frame characteristic at the time t and the first reference frame characteristic at the time t are aligned. Similarly, each first video frame feature is aligned with a corresponding first reference frame feature. The expression for obtaining the M aligned first video frame features is:
z a =PCD(z r ∣z′)
z a ∈R N/s×H/f×W/f×D ,
the dimensions of the M aligned first video frame features are the same as the dimensions of the M first video frame features.
Further, a temporal and spatial attention fusion module is adopted to perform feature fusion on video frame features, specifically, due to lens jitter, target motion and other reasons, different video frames in the same video generate blurs in different situations, and therefore, the contributions of the different video frames to the recovery/enhancement of the reference frame are different. Conventional methods generally consider them to be equally well, but not so. Therefore, attention is drawn to a mechanism of giving different weights to different feature maps in two dimensions of space and time, that is, a Temporal and Spatial Attention module (TSA) is adopted as a fusion module of video frames to combine M first video frame features z with M aligned first video frame features z a And performing fusion to obtain M fused first video frame features z', wherein the formula is expressed as follows:
z′=TSA(z,z a )
z′∈R N/s×H/f×W/f×D
the dimensions of the M aligned first video frame features are the same as the dimensions of the M first video frame features.
And (3) upsampling the fused video frame characteristics by adopting 3D convolution to obtain a reconstructed video, specifically, upsampling the video characteristics z' according to the proportion of (b, c, c) by using 3D convolution to obtain X reconstructed video frame characteristics z ×n The formula is expressed as:
z ×n =Upsample b×c×c (z′)
z ×n ∈R (b×N/s)×(c×H/f)×(c×W/f)×D 。
optionally, the X reconstructed video frame features are used as X second video frame features, and according to the X second video frame features, a second reference frame feature corresponding to each second video frame feature is reconstructed through a time axis attention mechanism to obtain X second reference frame features; respectively aligning each second video frame feature with the corresponding second reference frame feature to obtain X aligned second video frame features; fusing the X aligned second video frame features and the X second video frame features through a time and space attention mechanism to obtain X fused second video frame features; up-sampling the X fused second video frame features to obtain Y reconstructed video frame target features; performing up-sampling on the Y reconstructed video frame target characteristics to obtain target video characteristics; and outputting a reconstructed video based on the target video characteristics, wherein Y is a positive integer.
Specifically, X reconstructed video frame characteristics z are obtained ×n Then, the X reconstructed video frame features are used as X second video frame features, and reconstruction of a second reference frame, image frame alignment, and image frame fusion are performed, which are the same as the principles of the foregoing embodiments and are not repeated here.
Furthermore, the dimensions of the X second reference frame features, the X aligned second video frame features, the X fused second video frame features, and the X second video frame features are the same, i.e., the dimensions are (b × N/s) × (c × H/f) × (c × W/f) × D.
And (5) utilizing 3D convolution to perform up-sampling on the X fused second video frame features according to the proportion of (b, c, c) to obtain Y reconstructed video frame target features.
Using 3D convolution to up-sample Y reconstructed video frame target characteristics to obtain target video characteristics, namely obtaining reconstructed video X rec 。
More specifically, respective up-sampling ratios and down-sampling ratios may be set, for example, the down-sampling ratio (s, f, f) may have a value of (4, 8), and the up-sampling ratio (b, c, c) may have a value of (2, 2). Also for example, in order to make the reconstructed video more restorable, the video frame number and resolution of the reconstructed video and the original video may be set to be the same. The up-sampling ratio and the down-sampling ratio are not particularly limited herein.
In this embodiment, when the quantized feature code is decoded, a reconstructed video with a higher reduction degree can be obtained through reference frame construction, video frame alignment, video frame fusion and 3D upsampling convolution.
Optionally, before the inputting N video frames of the original video into the encoding module of the video processing model, the method further includes: for any video sample, inputting the video sample into the video processing model, and outputting a prediction reconstruction video corresponding to the video sample; calculating a loss value according to the prediction reconstruction video corresponding to the video sample and the video sample by using a preset loss function; and if the loss value is smaller than a preset threshold value, finishing the training of the video processing model.
Specifically, before the video processing model is used for encoding and decoding, the video processing model also needs to be trained, and the specific training process is as follows:
after obtaining a plurality of video samples, for any one video sample, inputting the video sample into a video processing model, and outputting a prediction reconstructed video corresponding to the video sample. On the basis, a preset loss function is utilized to calculate a loss value according to the video sample and the prediction reconstruction video. The preset loss function can be set according to actual requirements, and the times are not specifically limited. After the loss value is obtained through calculation, the training process is finished, model parameters in the video processing model are updated, and then the next training is carried out. In the training process, if the loss value obtained by calculation for a certain video sample is smaller than a preset threshold value, the training of the video processing model is completed.
Further, in the embodiment of the present invention, for the video processing model and the conventional video coding and decoding model of the present invention, the MSE index, PSNR index, and SSIM index of the iterative training of 150000 times and 200000 times are detected respectively. The results are shown in table 1 below:
table 1 conventional video coding and decoding model and video processing model detection results of the present invention:
as can be seen from the table, the accuracy of the reconstructed video obtained by decoding with the conventional single upsampled convolutional layer is significantly lower than that of the video reconstructed by using the decoding module of the present application. The difference increases with the increase of the total iteration times, after 20 ten thousand iterations, the difference of MSE loss reaches 0.007, the difference of PSNR index reaches 0.75, and the difference of SSIM index reaches 0.02. And the method and the device can accurately reconstruct the video with the resolution as high as 256 multiplied by 256.
In addition, the updating of the model parameters in the training process also comprises updating of the codebook, namely updating of the discrete hidden layer feature codes in the codebook, and because argmin is not differentiable, updating of the discrete hidden layer feature codes in the codebook is performed by using a momentum updating moving index average method.
In this embodiment, the loss value of the video processing model is favorably controlled within a preset range by training the video processing model, so that the reduction degree of the video processing model for video reconstruction is favorably improved.
Optionally, the predetermined loss function is
Wherein X is the original video, X
rec In order to be able to reconstruct the video,
and in terms of mean square error loss, sg is gradient stopping operation, E is video characteristics output by a video coding module, VQ is characteristic quantization operation, and beta is a hyper-parameter of model training.
Specifically, the loss value calculated by the preset loss function described in the embodiment of the present application includes a video reconstruction loss value and a quantization coding loss value.
In the embodiment, the loss value is calculated while considering the video reconstruction loss value and the quantization coding loss value, so that the model can be better trained, and the video reconstruction restoration degree of the video processing model can be improved.
Fig. 3 is a detailed flowchart of a video quantization encoding and decoding method according to an embodiment of the present invention, as shown in fig. 3, including:
the method comprises the steps of inputting an original video into a video processing model, coding the original video by utilizing a 3D convolution downsampling network and a residual error network to obtain a first feature code, calculating Euclidean distance based on the first feature code and a discrete hidden layer feature code in a codebook (codebook), determining a quantization feature code, and outputting the quantization feature code. And the full connection layer and the residual error network map the quantized feature codes into first video frame features, and perform reference frame reconstruction, image frame alignment, fusion and 3D convolution upsampling on the first video frame features by the time and space attention fusion module to obtain reconstructed video frame features, and perform reference frame reconstruction, image frame alignment, fusion and 3D convolution upsampling on the reconstructed video frame features as second video frame features by the time and space attention fusion module again to obtain reconstructed video frame target features. And performing 3D convolution upsampling on the target characteristics of the reconstructed video frame to obtain target video characteristics, and outputting a reconstructed video.
The following describes the video quantization coding and decoding device provided by the present invention, and the video quantization coding and decoding device described below and the video quantization coding and decoding method described above can be referred to correspondingly.
Fig. 4 is a schematic structural diagram of a video quantization encoding and decoding device according to an embodiment of the present invention, as shown in fig. 4, including: a first output module 410, a mapping module 420, a reconstruction module 430, and a second output module 440; the first output module 410 is configured to input N video frames of an original video to a coding module of a video processing model, and output quantization feature codes of M video frames, where M and N are positive integers, and M is less than or equal to N; the mapping module 420 is configured to input the quantization feature codes of the M video frames into a decoding module of a video processing model, and map the quantization feature codes of the M video frames into M first video frame features; the reconstruction module 430 is configured to reconstruct, according to the M first video frame features, a first reference frame feature corresponding to each first video frame feature through a time axis attention mechanism, to obtain M first reference frame features; the second output module 440 is configured to output a reconstructed video based on the M first video frame characteristics and the M first reference frame characteristics.
Optionally, the first output module is specifically configured to input N video frames of an original video to a coding module of a video processing model, and perform coding processing on the N video frames of the original video to obtain first feature codes of M video frames; determining the corresponding feature code of each video frame in the codebook based on the Euclidean distance between the first feature code of each video frame and each feature code in the codebook, wherein the codebook comprises a plurality of discrete hidden layer feature codes; and outputting the quantized feature codes of the M video frames based on the corresponding feature codes of each video frame in the codebook.
Optionally, the second output module is specifically configured to align each first video frame feature with the corresponding first reference frame feature, respectively, to obtain M aligned first video frame features; fusing the M aligned first video frame features and the M first video frame features through a temporal and spatial attention mechanism to obtain M fused first video frame features; up-sampling the M fused first video frame features to obtain X reconstructed video frame features; outputting a reconstructed video based on the X reconstructed video frame characteristics, wherein X is a positive integer
Optionally, the second output module is specifically configured to use the X reconstructed video frame features as X second video frame features, and reconstruct, according to the X second video frame features, a second reference frame feature corresponding to each second video frame feature through a time axis attention mechanism, so as to obtain X second reference frame features; respectively aligning each second video frame feature with the corresponding second reference frame feature to obtain X aligned second video frame features; fusing the X aligned second video frame features and the X second video frame features through a time and space attention mechanism to obtain X fused second video frame features; up-sampling the X fused second video frame features to obtain Y reconstructed video frame target features; performing up-sampling on the Y reconstructed video frame target characteristics to obtain target video characteristics; and outputting a reconstructed video based on the target video characteristics, wherein Y is a positive integer.
Optionally, the apparatus further comprises:
the training module is used for inputting the video sample to the video processing model for any video sample and outputting a prediction reconstruction video corresponding to the video sample; calculating a loss value according to the prediction reconstruction video corresponding to the video sample and the video sample by using a preset loss function; and if the loss value is smaller than a preset threshold value, finishing the training of the video processing model.
Optionally, the predetermined loss function is
Wherein X is the original video, X
rec In order to be able to reconstruct the video,
and in order to obtain mean square error loss, sg is gradient stopping operation, E is video characteristics output by a video coding module, VQ is characteristic quantization operation, and beta is a hyper-parameter of model training.
In this embodiment, first, a video processing model obtained through pre-training is used to encode an original video to obtain a quantization feature code, so that the data size can be reduced, and compared with the case that a user processes the original video, the processing of the quantization feature code is more convenient. Secondly, when the quantized feature codes are decoded by using the video processing model obtained by pre-training, a time axis attention mechanism is adopted to reconstruct the reference frames, so that accurate and effective reference frames can be obtained, and the reconstructed video is output based on the M first video frame features and the M first reference frame features, so that the reconstructed video has richer details, and further the high-quality reconstructed video is obtained.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a video quantization codec method comprising: inputting N video frames of an original video into a coding module of a video processing model, and outputting quantization characteristic codes of M video frames, wherein M and N are positive integers, and M is less than or equal to N; inputting the quantization feature codes of the M video frames into a decoding module of a video processing model, and mapping the quantization feature codes of the M video frames into M first video frame features; according to the M first video frame characteristics, reconstructing first reference frame characteristics corresponding to each first video frame characteristic through a time axis attention mechanism to obtain M first reference frame characteristics; outputting a reconstructed video based on the M first video frame features and the M first reference frame features.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, a computer can execute the video quantization coding and decoding method provided by the above methods, where the method includes: inputting N video frames of an original video into a coding module of a video processing model, and outputting quantization feature codes of M video frames, wherein M and N are positive integers, and M is less than or equal to N; inputting the quantization feature codes of the M video frames into a decoding module of a video processing model, and mapping the quantization feature codes of the M video frames into M first video frame features; according to the M first video frame characteristics, reconstructing first reference frame characteristics corresponding to each first video frame characteristic through a time axis attention mechanism to obtain M first reference frame characteristics; outputting a reconstructed video based on the M first video frame features and the M first reference frame features.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the video quantization coding and decoding method provided by the above methods, the method including: inputting N video frames of an original video into a coding module of a video processing model, and outputting quantization feature codes of M video frames, wherein M and N are positive integers, and M is less than or equal to N; inputting the quantization feature codes of the M video frames into a decoding module of a video processing model, and mapping the quantization feature codes of the M video frames into M first video frame features; according to the M first video frame characteristics, reconstructing first reference frame characteristics corresponding to each first video frame characteristic through a time axis attention mechanism to obtain M first reference frame characteristics; outputting a reconstructed video based on the M first video frame features and the M first reference frame features.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.