CN113487564B

CN113487564B - Double-flow time sequence self-adaptive selection video quality evaluation method for original video of user

Info

Publication number: CN113487564B
Application number: CN202110753105.4A
Authority: CN
Inventors: 刘银豪; 张威; 殷海兵; 陈勇; 殷俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2024-04-05
Anticipated expiration: 2041-07-02
Also published as: CN113487564A

Abstract

The invention belongs to the technical field of video processing of original user content, and discloses a double-flow time sequence self-adaptive selection video quality evaluation method for original user video, which comprises the following steps of 1: distributing an intra-frame quality perception module based on the content weight; 2: double-flow global time domain modeling; 3: dual stream deeper level loss function weight allocation. The method extracts video quality features from two dimensions of a time domain and a space domain, extracts a multi-scale feature map in the space domain, and performs weight redistribution on the feature map by combining human eye visual saliency perception. In the aspect of time domain, a double-flow deeper RNN structure is introduced, and the forward and backward time sequence information is iterated to extract the deeper double-time sequence information. Finally, the last score is regressed after the distribution of the loss functions of different perception levels and sequences by the depth supervision module. On four UGC-VQA databases, further performance improvement is achieved compared with the current best deep learning method.

Description

Double-flow time sequence self-adaptive selection video quality evaluation method for original video of user

Technical Field

The invention belongs to the technical field of video processing of original user content, and particularly relates to a double-flow time sequence self-adaptive selection video quality evaluation method for original user video.

Background

With the development of mobile multimedia devices and the popularization of video social media platforms, there are more opportunities for common users to create content themselves. However, due to limitations of shooting environments and devices, user Generated Content (UGC) video is always accompanied by various shooting distortions such as defocus, motion blur, camera shake or underexposure/overexposure, sensor noise, bad shooting environments, and the like. This would severely detract from the consumer's visual experience. In addition, in practical applications, it is necessary to evaluate objective Video Quality (VQA). For example, it may be used to determine whether video content was made by a professional photographer, which may be used for recommendation ranking by a video distribution platform. In addition, the UGC video quality model can also be used as an optimization criterion of the video enhancement technology. Thus, helping video producers to enhance capturing UGC video by improving video quality through quantization

Over the past few years, there have been many emerging reference-free image quality assessment methods (NR-IQA) and reference-free video quality assessment methods (NR-VQA). The significant difference between NR-IQA and NR-VQA is that NR-VQA requires a combination of spatial and temporal correlations, which makes accurate assessment of video quality more difficult. The existing NR-VQA methods are designed for evaluating compression distortion and transmission artifacts, which are not satisfactory for UGC video where multiple mixed distortions exist. Recently, with the increasing explosion of video social media platforms and video conferencing systems, more and more UGC videos present new challenges for NR-VQA tasks.

Reviewing the development history of NR-VQA, as machine learning evolves rapidly, some students attempt to build an NR-VQA model using machine learning theory. For example, the classical NR-VQA method is proposed based on Natural Scene Statistics (NSS) analysis. Such as NIQE, BRISQUE, FRIQUEE and HIGRADE, etc. When applied to video, NSS-based methods yield a quality score for the entire video by measuring the deviation of each frame from the natural scene statistics and then averaging the statistics of all frames. They are clearly inadequate to model temporal features. V-blinds is an extension of image-based evaluation that combines temporal frequency characteristics with temporal motion information. V-CORNIA obtains a codebook through non-supervision learning and support vector machine learning, and then corresponds to the image frames one by one. Finally, the final video quality is obtained by time memory effect pooling in the aspect of aggregating the frame-level quality scores.

Recently, deep learning techniques have also been used for the NR-VQA model. The scholars of Varga et al apply LSTM to time domain modeling in NR-VQA. Wu et al scholars calculate video quality by estimating a similarity graph between frames of video quality. Liu et al teach a V-Meon model using a multitasking CNN framework that considers using 3D-CNN for feature extraction and fully connecting layers for predicting video quality. Zhang et al developed a general NR-VQA model and resampling strategy based on weakly supervised learning theory using a transfer learning method.

In view of the increasing interest in UGC quality assessment, four relevant data sets are collected and annotated: CVD2014, koNViD-1k, LIVE-Qualcomm and LIVE-VQC. These databases are very challenging and previous NR-VQA methods, which verify in synthesizing distorted video data sets, produce results that do not fit the visual perception of the human eye. For this reason, some methods of capturing spatial and temporal distortions have been proposed and made exciting.

Firstly, the TLVQM calculates a complete sequence of low complexity features first, then extracts high complexity features from a subset of representative video frames, and finally predicts by a support vector machine. After further improvement by the authors, called CNN-TLVQM, the method used a model trained by combining features extracted manually from TLVQM and using spatial features extracted by 2D-CNN as image quality prediction. RIR-Net comprises two parts: quality degradation learning and motion effect modeling. The first part is first composed of ResNet-50, and the distortion-aware features from a single frame can be extracted. The second part is a jump-level time model based on RNN, which is made to have three different time frequencies by downsampling in the time domain and motion information aggregation. Recently, li et al proposed a new NR-VQA framework (MDTVs SFA) that uses a mixed dataset training strategy for UGC videos, consisting of the previously proposed VSFAs as the backbone. Two losses are then resolved for backbone training of the blended data, namely monotonicity-induced losses and linearity-induced losses. The method greatly promotes the improvement of the UGC-VQA performance in the cross-database, and makes up the defect that a huge database is not available at present. Furthermore, chen et al teach a framework of NR-VQA called TRR-QoE and MS-TRR, which introduces a mechanism of attention to measure the different time resolutions of the corresponding multi-scale time relationship information. Zheng et al extract 60 representative features from the 763 existing statistical features and called VIDEVAL. The model employs a novel selection strategy to select and aggregate features from existing NR-VQA.

In summary, UGC-VQA is receiving more and more attention, and a model conforming to the visual perception mechanism of human eyes is urgently needed in the field. However, the recently proposed UGC-VQA method can lead to distortion of the predicted performance in video in the face of complex scene changes. The reason behind this may be due to the underutilization of the spatiotemporal information. This is also precisely the key clue in VQA evaluation. In particular, the temporal information in the video can be represented by inter-frame correlation, and this important cue for the evaluation has not yet been taken into consideration by the method.

1. The traditional NR-VQA model only adopts common multi-scale and coarse granularity acquisition for intra-frame features, and does not consider the attention distribution of human eyes to the intra-frame features of different frames.

2. For quality perception of time series information, human eyes tend to perform feedback process after processing after shallow perception and deep brain thinking. Thus, the current model simply considers forward RNN modules and the results obtained are inaccurate.

3. For forward and backward time domain information of different sensing layers, human eyes should have different sensing processing capacities, and the current algorithm does not consider the weight distribution of different time sequence sensing layers.

Disclosure of Invention

The invention aims to provide a double-flow time sequence self-adaptive selection video quality evaluation method for original videos of users, so as to solve the technical problems.

In order to solve the technical problems, the specific technical scheme of the double-flow time sequence self-adaptive selection video quality evaluation method for original videos of users is as follows:

a double-flow time sequence self-adaptive selection video quality evaluation method for original video of a user comprises the following steps:

step 1: distributing an intra-frame quality perception module based on the content weight;

step 2: double-flow global time domain modeling;

step 3: dual stream deeper level loss function weight allocation.

Further, the step 1 comprises the following specific steps:

in the content perception feature extraction, a feature extraction module with multiple perception levels is adopted; for intra features, hierarchical low-level texture/color features and high-level semantic features are incorporated into the reference at the same time, the expression of which is as follows:

ResNet-50 is used as the backbone network, whereinIntra-frame feature maps representing different sense levels, describing the frames from different angles; non-uniform sampling based on significance perception is introduced into an intra-frame feature map, and the expression is as follows:

and (3) representing a perception feature map after the significant perception feature is introduced, wherein Saliency represents a significant point window feature extraction mode provided for weighing feature scale and intra-frame perception fine granularity, and the significant map is extracted by adopting an FT method, and the formula is as follows:

wherein,representing Lab color space diagram of original image after Gaussian filtering, I _μ Representation->The average value of each channel in the Lab color space is represented by I and I;

non-uniform extraction is carried out on the intra-frame feature map by adopting two modes of mean value aggregation and standard deviation aggregation, and window type features of points with maximum significance are provided:

wherein AP represents mean aggregation, SD represents standard deviation aggregation, [. Cndot.] _λ Representing an extraction mode of window size lambda selected based on the maximum salient point; depth profile of three dimensionsCombining mean pooling and standard deviation mode in mean aggregation and standard deviation aggregation to generate two initial multi-scale depth features +.>And->

By exploring the interdependence between content attention allocation mechanisms with dual stream frameworks, the following weight allocation approach is adopted:

wherein, σ (·), δ (·), α (·) respectively represent Sigmoid, reLu and inter-frame mean aggregation, W2 (W4) represents a 1×1 convolution layer with a compression ratio r=0.5; then introducing nonlinear factors through the ReLu function, and introducing another 1×1 convolution layer again, wherein the parameters are represented by W1 (W3) to be the samer times to expand and restore the original dimension; then, more accurate intra-frame feature characterization is obtained through weight redistribution of Sigmoid function

Finally, the step of obtaining the product,and obtaining the representation of the intra-frame features through the aggregation of the splicing functions:

where St is the spatial signature of the current frame.

Further, the step 2 specifically includes the following steps:

adopting GRU as RNN module to introduce time sequence information, firstly introducing bidirectional perception modeling, and according to current input x _t And the previous hidden state h _t-1 The calculated current hidden state h _t The method comprises the following steps:

wherein x is _f ,x _b Respectively represents forward serial and backward serial, h ₀ A hidden state representing an initial state of 0; obtaining an intra-frame characteristic containing time sequence information through time sequence modeling of the RN N; wherein the method comprises the steps ofRepresents forward timing information and backward timing information, respectively,>hidden state message representing last frameExtinguishing;

reverse deeper modeling is proposed, namely:

wherein,representing the reverse operation in time sequence, +.>Representing intra features containing deep backward time domain information,representing the last hidden state after processing the deep backward time domain information;

the Dual Deep GRU (DDGRU) structure is used:

wherein,representing intra-frame features containing deep forward time domain information,/->Representing the last hidden state after deep forward time domain information processing;

respectively representing forward shallow layer characteristics, backward shallow layer characteristics, forward deep layer characteristics and backward deep layer characteristics, and finally calculating the whole video score Q through a single full-connection layer:

wherein W is _hQ And b _hQ Representing the weight coefficient and the bias coefficient.

Further, the step 3 specifically includes the following steps:

given an input distorted video Y, the loss function of depth monitoring is described as a weighted sum of several bypass output losses, predictedThe total loss can be calculated as:

wherein L is _m A loss function corresponding to the mth output and label, n=3, where α _m Representing different weight coefficients of different time series structures, beta is a super parameter obtained by learning for balancing two losses, L _out Is the loss function of the last output layer.

The double-flow time sequence self-adaptive selection video quality evaluation method for the original video of the user has the following advantages:

the method extracts video quality features from two dimensions of a time domain and a space domain, extracts a multi-scale feature map in the space domain, and performs weight redistribution on the feature map by combining human eye visual saliency perception. In the aspect of time domain, a double-flow deeper RNN structure is introduced, and the forward and backward time sequence information is iterated to extract the deeper double-time sequence information. Finally, the last score is regressed after the distribution of the loss functions of different perception levels and sequences by the depth supervision module. On four UGC-VQA databases, further performance improvement is achieved compared with the current best deep learning method.

Drawings

FIG. 1 is a flow chart of the overall architecture network of the present invention;

FIG. 2 is a feature map based on saliency awareness;

FIG. 3 is a schematic diagram of a saliency map extraction method;

FIG. 4 is a diagram of a feature attention weighting distribution framework within a frame;

FIG. 5 is a diagram of a dual-stream deeper RNN architecture.

Detailed Description

In order to better understand the purpose, structure and function of the present invention, the method for evaluating the quality of the video by the dual-stream time sequence adaptive selection for the original video of the user is described in further detail below with reference to the accompanying drawings.

1. Intra-frame quality perception module based on content weight distribution

In the content perception feature extraction, a multi-perception-level feature extraction module is adopted. For intra features, the hierarchical low-level texture/color features and the high-level semantic features are incorporated into the reference simultaneously.

Meanwhile, the method considers the significance perception influencing visual attention, and introduces significance perception characteristics into the feature map. The salient point window feature extraction is provided for balancing feature scale and intra-frame perception fine granularity.

And then, main characteristic extraction is carried out on the salient point window by adopting two modes of mean aggregation and standard deviation aggregation. The method is divided into two modes of mean value polymerization and standard deviation polymerization. Wherein three-scale depth feature maps combine the mean pooling and standard deviation modes in the two to generate two initial multi-scale depth featuresAnd->

This work further improves the weight distribution between aggregated features between different frames, inspired by the content attention mechanism in visual perception. Furthermore, note that the mean pooling and standard deviation aggregation have different features, respectively, when characterizing the feature map, in order to balance the two parameters between these frames, the content weights are used for better representation of the intra content. The interdependence between the allocation mechanisms is noted by exploring the content with a dual stream framework.

The method divides the average pooling standard deviation aggregation into two paths, respectively extracts the weights among frames through an improved inter-frame attention weight distribution mechanism, and finally, obtains more accurate intra-frame feature characterization through weight redistribution of a Sigmoid function

Finally, the two initial deep features are aggregated by a stitching function to obtain a representation of the intra-frame features:

wherein S is _t Is the spatial signature of the current frame.

2. Double-flow global time domain modeling

The method considers a bidirectional depth perception mechanism affecting a video time domain quality perception mechanism, and provides a global time domain modeling module with double flows and deeper layers, which comprises the following steps:

firstly, the method adopts GRU to carry out modeling of global time domain. Firstly, the spatial domain characteristics of the current frame of each frame are input in a forward serial mode, and visual time domain information based on forward perception is obtained. Meanwhile, considering the content characteristics of different videos, the reverse serial input of the spatial domain characteristics of the current frame of each frame is tried to obtain the visual time domain information based on backward perception.

Visual perception is a complex process involving hierarchical levels of processing corresponding to internal physiological structures. Classically, the visual system can be seen as a hierarchical cortical region and cell type. Neurons of the low-level regions (V1, V2) receive visual input and represent simple features such as lines or edges of a particular orientation and position. Their outputs are integrated and processed by successive cortical layers (V3, V4, medial temporal MT), progressively generalizing spatial parameters and exclusively representing global features. Finally, further layers (temporal region, frontal lobe region, etc.) integrate their outputs to represent abstract forms, objects and categories. However, with the current study, the function of the reverse link after the feedback process is unknown.

Reverse Hierarchy Theory (RHT) is a typical perceptual framework of physiological vision. The theory proposes that the forward hierarchy plays an implicit transfer role, while the explicit perception starts from higher layers. First, after the receptors accept low-order input, the optic nerve mechanism represents the gist of the scene using a first order approximation integral. The explicit perception then returns to the lower region through a feedback connection to integrate it into conscious vision by scrutinizing the detailed information available at the input. Thus, the initial perception is based on distraction (large receptive field), guessing of details, and errors in binding, linking. Subsequent vision fuses details, overcoming this blindness. The theory holds that the feed-forward process is spontaneous, implicit, while the conscious perception process begins at the end of the feed-forward process and gradually loops back as needed. This feedback process is mainly to meet the need for fine discrimination.

Based on the theory of RHT, the method provides a deeper bidirectional perception system, and the backward deep perception information is built on the forward shallow perception.

Meanwhile, for the emphasis of forward and backward perception of different videos, a double-flow deeper perception system is introduced for balancing multi-level perception information.

3. Double-flow deeper loss function weight allocation

The introduction of depth monitoring enables integrated direct monitoring of each side output, taking into account the unavoidable gradient transition breaks between different timing relationships, rather than the standard approach of providing monitoring only at the final output. It is assumed that with the help of this structure the network makes counter-propagation of the gradient possible. When the early gradient is reserved by adding a supervision layer to weaken the gradient vanishing problem, the optimizer can better solve the optimized conduction problem of different layers.

The framework for the double-flow time sequence self-adaptive selection video quality evaluation is suitable for evaluating original content videos of users with fire and heat at present. In order to better match the motion process of the human cerebral cortex, the method starts with extracting features from the frame, further divides the time domain information modeling network into four large modules of forward, backward, forward deep and backward deep, and finally performs optimized gradient transfer of different time sequence information through the self-adaptive weight distribution module. Fig. 1 illustrates a main structure of the present frame. Fig. 2 shows an MFE module, which extracts intra-frame features of different sensing layers, reduces and obtains original information of maximum quality through the added non-uniform sampling of the feature map window in fig. 3, and then distributes the content features more effectively through the inter-frame attention module in fig. 4. This is used for content weight distribution. In the GTM module of fig. 5, the extracted features perform forward and reverse timing information extraction in the time dimension, and at the same time, deep timing information is introduced. Finally, the DS module is introduced, and the score loss function regression distribution is carried out on the information of different time domains through the depth supervision module. The specific implementation steps are as follows:

1. intra-frame quality perception module based on content weight distribution

In the content perception feature extraction, a multi-perception-level feature extraction module is adopted. For intra features, the hierarchical low-level texture/color features and the high-level semantic features are incorporated into the reference simultaneously. Specifically, the expression is as follows:

in this formula, resNet-50 is used herein as the backbone network. Wherein the method comprises the steps ofThe intra-frame feature map representing different sensing layers describes the frame from different angles, specifically, the images are convolved by different convolution cores to obtain corresponding kernels as features of the images. Meanwhile, in order to fit the uneven perception of visual perception, the method considers the significance perception influencing visual attention, and uneven sampling based on the significance perception is introduced into the feature map. Its expressionThe formula is as follows:

in the formula (i),and (3) representing a perception feature map after the significant perception characteristic is introduced, wherein Saliency is represented as a feature extraction mode of weighting a feature scale and intra-frame perception granularity and providing a significant point window line feature extraction mode. In the method, the FT method is adopted to extract the saliency map, the computing framework is shown in fig. 3, and the specific formula is as follows:

wherein,representing Lab color space diagram of original image after Gaussian filtering, I _μ Representation->The mean value of each channel in the Lab color space is represented by the expression I.

And then, main nonuniform extraction is carried out on the intra-frame feature map by adopting two modes of mean value aggregation and standard deviation aggregation, and window type features of points with the greatest significance are provided:

wherein AP represents mean aggregation, SD represents standard deviation aggregation, [. Cndot.] _λ Representing an extraction mode of window size lambda selected based on the maximum salient point; depth profile of three dimensionsAt mean aggregationCombining mean pooling and standard deviation in standard deviation aggregation to generate two initial multi-scale depth features +.>And->

This work further improves the weight distribution between aggregated features between different frames, inspired by the content attention mechanism in visual perception. Furthermore, it is noted that the averaging pooling and standard deviation aggregation have different features, respectively, when characterizing the feature map, and that content weights are used for better representation of the intra-frame content in order to balance the two parameters between these frames. By exploring the interdependence between content attention allocation mechanisms with dual-stream frameworks, a specific network building method is shown in fig. 3, and the following weight allocation method is adopted:

wherein, sigma (·), delta (·), alpha (·) represent Sigmoid, reLu and interframe mean aggregation, respectively. W2 (W4) represents a 1×1 convolution layer with a compression ratio r=0.5. Then introducing nonlinear factors through a ReLu function, and introducing another 1X 1 convolution layer again, wherein the parameters of the nonlinear factors are represented by W1 (W3), and the original dimension is restored by expansion of the same r times. Then, by weight redistribution of Sigmoid function, more accurate characterization of the intra-frame features can be obtained

where St is the spatial signature of the current frame.

2. Double-flow global time domain modeling

According to the method, GRU is adopted as an RNN module to introduce timing information, and bidirectional perception modeling is introduced first. According to the current input x _t And the previous hidden state h _t-1 The calculated current hidden state h _t The method comprises the following steps:

wherein x is _f ,x _b Respectively represents forward serial and backward serial, h ₀ Representing a hidden state with an initial state of 0. And obtaining the intra-frame characteristics containing the time sequence information through time sequence modeling of the RN N. Wherein the method comprises the steps ofRepresents forward timing information and backward timing information, respectively,>hidden state information representing the last frame.

Then, considering RHT theory, reverse deeper modeling was proposed, namely:

wherein,representing the reverse operation in time sequence, +.>Representing intra features containing deep backward time domain information,representing the last hidden state after processing the information containing deep backward time domain.

Finally, in order to fully utilize the time domain correlation of the Deep GRU (DGRU) structure, a Dual Deep GRU (DDGRU) structure is tried to improve the data volume of information interaction.

Wherein,representing intra-frame features containing deep forward time domain information,/->And the final hidden state after deep forward time domain information processing is represented.

After the above-mentioned treatment has been carried out,respectively representing forward shallow layer characteristics, backward shallow layer characteristics, forward deep layer characteristics and backward deep layer characteristics, and finally calculating the whole video score Q through a single full-connection layer:

3. Double-flow deeper loss function weight allocation

Due to the designed structure, the designed structure can fuse multiple time sequence structures (such as) And is easy to monitor deeply. Given an input distorted video Y, the loss function of depth monitoring is described as a weighted sum of several bypass output losses, predicted +.>The total loss can be calculated as:

wherein L is _m Corresponding to the mth output and the loss function of the label, the method adopts N=3, wherein alpha is as follows _m Representing different weight coefficients of different time series structures, beta is a super parameter obtained by learning for balancing two losses, L _out Is the loss function of the last output layer.

The method extracts video quality features from two dimensions of a time domain and a space domain, extracts a multi-scale feature map in the space domain, and performs weight redistribution on the feature map by combining human eye visual saliency perception. In the aspect of time domain, a double-flow deeper RNN structure is introduced, and the forward and backward time sequence information is iterated to extract the deeper double-time sequence information. Finally, the last score is regressed after the distribution of the loss functions of different perception levels and sequences by the depth supervision module. On four UGC-VQA databases, further performance improvement is achieved compared with the current best deep learning method. Table 1 shows the performance comparisons of the different methods in the four UGC-VQA databases.

Table 1 comparison of the different method performance in the four UGC-VQA databases (best and second best results for each column are bolded and underlined, respectively)

It will be understood that the invention has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A double-flow time sequence self-adaptive selection video quality evaluation method for original videos of users is characterized by comprising the following steps:

step 2: double-flow global time domain modeling;

wherein x is _f ,x _b Respectively represents forward serial and backward serial, h ₀ A hidden state representing an initial state of 0; obtaining an intra-frame characteristic containing time sequence information through time sequence modeling of RNN; wherein the method comprises the steps ofRepresents forward timing information and backward timing information, respectively,>hidden state information representing the last frame;

reverse deeper modeling is proposed, namely:

wherein,representing the reverse operation in time sequence, +.>Representing intra-frame features containing deep backward time domain information,/->Representing the last hidden state after processing the deep backward time domain information;

using the Dual deep GRU structure:

wherein W is _hQ And b _hQ Representing the weight coefficient and the bias coefficient;

step 3: dual stream deeper level loss function weight allocation.

2. The dual stream timing adaptive selection video quality assessment method for user originated video according to claim 1, wherein step 1 comprises the specific steps of:

wherein, σ (·), δ (·), α (·) respectively represent Sigmoid, reLu and inter-frame mean aggregation, W2 (W4) represents a 1×1 convolution layer with a compression ratio r=0.5; then introducing a nonlinear factor through a ReLu function, and introducing another 1X 1 convolution layer again, wherein the parameters of the convolution layer are represented by W1 (W3), and the original dimension is restored by expanding with the same r times; then, more accurate intra-frame feature characterization is obtained through weight redistribution of Sigmoid function

where St is the spatial signature of the current frame.

3. The method for dual stream timing adaptive selection video quality assessment of user originated video according to claim 2, wherein step 3 specifically comprises the steps of:

given an input distorted video Y, the loss function of depth monitoring is described as a weighted sum of several bypass output losses, predictedThe total loss is calculated as:

wherein L is _m A loss function corresponding to the mth output and label, n=3, where α _m Different weight coefficients representing different time ordered structures, beta beingSuper-parameters obtained by learning for balancing two losses, L _out Is the loss function of the last output layer.