CN116071691B

CN116071691B - Video quality evaluation method based on content perception fusion characteristics

Info

Publication number: CN116071691B
Application number: CN202310343979.1A
Authority: CN
Inventors: 张诗涵; 杨瀚; 温序铭
Original assignee: Chengdu Sobey Digital Technology Co Ltd
Current assignee: Chengdu Sobey Digital Technology Co Ltd
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-06-23
Anticipated expiration: 2043-04-03
Also published as: CN116071691A

Abstract

The invention provides a video quality evaluation method based on content perception fusion characteristics, which comprises the following steps: step 1, constructing a multidirectional differential second-order differential Gaussian filter characteristic extraction module for extracting input image characteristics; step 2, building a residual feature extraction network model based on a multi-direction differential second-order differential Gaussian filter feature extraction module and a depth convolution neural network, and inputting video frame by frame into the residual feature extraction network model to obtain content perception features of each frame of image; step 3, reducing the dimension of the content perception characteristics, inputting the content perception characteristics into a gate control recurrent neural network GRU, and modeling long-term dependency relationship to obtain quality elements and weights of the video at different moments; and 4, determining the final quality score of the video based on the quality elements and weights at different moments. The video quality evaluation method provided by the invention can realize more accurate video quality evaluation effect by extracting the content perception characteristics of the video.

Description

Video quality evaluation method based on content perception fusion characteristics

Technical Field

The invention relates to the field of computer vision, in particular to a video quality evaluation method based on content perception fusion characteristics.

Background

In recent years, with the widespread use of intelligent devices in human production and life, a huge amount of video materials are generated every day, but due to the limitations of various real environments and hardware device performances, the quality of the video is inevitably lost to different degrees, so that the video cannot be used in an actual application scene, and therefore, quality evaluation of the video is necessary before the video is applied to the actual scene.

The video quality evaluation method commonly used at present is mainly divided into two categories, namely subjective quality evaluation and objective quality evaluation. The subjective quality evaluation is to subjectively score various videos with different quality by people, and the method is direct and simple, but is limited by resources such as limited manpower, time and the like, and the subjective deviation of different people on the same video segment causes no unified scoring standard, so that large-scale practical application cannot be realized.

Objective video quality assessment can be divided into three categories, namely full-reference, half-reference and no-reference, according to whether original lossless video information exists or not. Since the lossless video is not existed in the real application scene with high probability as real contrast, no reference video quality evaluation has become the key point of the current research. With the continuous progress and development of the deep learning technology, the technology is gradually and widely applied to actual life, and for the non-reference video quality evaluation, although some non-reference quality evaluation methods exist at the present stage, a plurality of barriers cannot be broken through: human visual characteristics are not fully considered, a large amount of manual characteristics are required to be extracted by the traditional method, time and labor are wasted, various characteristic information of an image is not fully considered, and the like.

Disclosure of Invention

Aiming at the problems existing in the prior art, the video quality evaluation method based on the content perception fusion characteristics is provided, the content perception characteristics of a video image are obtained through a multi-direction differential second-order differential Gaussian filter characteristic extraction module and a deep convolution neural network, then the quality score is obtained through modeling of a long-term dependency relationship by a gate control recurrent neural network GRU, and the video quality is determined by combining weights.

The technical scheme adopted by the invention is as follows: a video quality evaluation method based on content perception fusion characteristics comprises the following steps:

step 1, constructing a multidirectional differential second-order differential Gaussian filter characteristic extraction module for extracting input image characteristics;

step 2, building a residual feature extraction network model based on a multi-direction differential second-order differential Gaussian filter feature extraction module and a depth convolution neural network, and inputting video frame by frame into the residual feature extraction network model to obtain content perception features of each frame of image;

step 3, reducing the dimension of the content perception characteristics, inputting the content perception characteristics into a gate control recurrent neural network GRU, and modeling long-term dependency relationship to obtain quality elements and weights of the video at different moments;

and 4, determining the final quality score of the video based on the quality elements and weights at different moments.

Further, the substeps of the step 1 are as follows:

step 1.1, constructing a multidirectional differential second-order differential Gaussian kernel and a directional derivative thereof; in construction, the number of directions is preferably 8;

and 1.2, performing convolution operation on the input image and the multi-direction second-order differential Gaussian directional derivative to finish characteristic information extraction.

Further, the substep of the step 2 is as follows:

step 2.1, frame-by-frame splitting is carried out on an input video to obtain T RGB three-channel color images;

step 2.2, uniformly scaling the obtained image to 224 pixels by 224 pixels;

step 2.3, outputting the image obtained in the step 2.2 through a 2D convolution layer to obtain the image characteristics with the dimension of 112 multiplied by 64;

2.4, inputting the image obtained in the step 2.2 into a multi-directional differential second-order differential Gaussian filter characteristic extraction module for characteristic extraction, fusing the extracted characteristic with the characteristic output in the step 2.3, wherein the dimension of the fused characteristic is 112 multiplied by 72, and recovering the channel number to 64 dimensions by convolution operation on the fused characteristic;

step 2.5, the 64-dimensional fusion features are sent to a maximum pooling layer, and the dimensions of the output features are 56 multiplied by 64;

step 2.6, establishing a Bottleneck convolution structure, and inputting the output characteristic in the step 2.5 into the Bottleneck convolution structure output characteristic W _t Feature W _t The method comprises a plurality of feature graphs, wherein T is 1-T;

step 2.7, feature W _t Each feature map in the map is subjected to space global pooling and then is subjected to space global levelingAnd (5) carrying out pooling and spatial global standard deviation pooling combined operation to obtain the content perception features in the feature map.

Further, in the step 2.5, the maximum pooling layer and size are 3×3, the step size is 2, and the filling dimension is 1.

Further, the specific process of establishing the Bottleneck convolution structure in the step 2.6 is as follows:

step 2.6.1, setting a 2D convolution layer Conv_2D_2, wherein the number of convolution kernels is C ₁ The convolution kernel size is 1×1, the step size is 1, and the filling dimension is 0;

step 2.6.2, setting a 2D convolution layer Conv_2D_3, wherein the number of convolution kernels is C ₁ The convolution kernel size is 7×7, the step size is 1, and the filling dimension is 1;

step 2.6.3, setting 2D convolution layer Conv_2D_4 with number of convolution kernels C ₂ The convolution kernel size is 1×1, the step size is 1, and the filling dimension is 0;

step 2.6.4, sequentially connecting the 2D convolution layers Conv_2D_2, conv_2D_3 and Conv_2D_4 to obtain a convolution module named as a Bottleneck-A structure;

step 2.6.5, the number of convolution kernels of three 2D convolution layers in the Bottleneck-A structure is respectively set as 2C ₁ 、2C ₁ 、2C ₂ Obtaining a Bottleneck-B structure; similarly, the number of convolution kernels is set to 4C ₁ 、4C ₁ 、4C ₂ And 8C ₁ 、8C ₁ 、8C ₂ Obtaining a Bottleneck-C structure and Bottleneck-D;

step 2.6.6, connecting 3 Bottleneck-A structures, 4 Bottleneck-B structures, 6 Bottleneck-C structures and 3 Bottleneck-D structures in sequence to obtain Bottleneck convolution structures.

Further, the substep of the step 3 is as follows:

step 3.1, performing dimension reduction on the content perception feature through the full connection layer FC_1 to obtain a dimension reduction feature;

step 3.2, the dimension reduction characteristics are sent into a gate control recurrent neural network GRU which can integrate and adjust and learn long-term dependency;

step 3.3, calculating the hidden layer state at the time t by taking the hidden layer state of the GRU network as the comprehensive characteristic to obtain the integrated characteristic;

step 3.4, integrating the characteristic input full-connection layer FC_2 to obtain the mass fraction at the moment t;

step 3.5, taking the lowest mass fraction in the previous frames as a memory quality element at the time t;

step 3.6, constructing the current quality element in the t-th frame, and weighting the quality score in the next few frames, so as to assign a larger weight to the frames with low quality scores.

Further, in the step 3.5, the memory quality elements are:

wherein,,

representing memory quality element->

Index set representing all moments +.>

、/>

The quality scores at time t and time k are indicated, s being a super parameter associated with time t.

Further, in the step 3.6, the current mass elements are:

wherein,,

for the current quality element->

For the weights, a softmin function definition is used,

an index set indicating the relevant time, e indicating a natural constant.

Further, the substep of the step 4 is as follows:

step 4.1, linearly combining the memory quality element with the current quality element to obtain the approximate quality fraction of the subjective frame moment;

and 4.2, carrying out time global average pooling on the approximate quality score to obtain a final video quality score.

Further, in the step 4.1, the method for calculating the approximate mass fraction is as follows:

wherein,,

representing approximate mass fraction, ++>

Representing memory quality element->

R is a super parameter that balances the contributions of the memory mass element and the current mass element.

Compared with the prior art, the beneficial effects of adopting the technical scheme include:

1. the constructed multi-direction differential second-order differential Gaussian filter characteristic extraction module can extract rich edge characteristic information in the image.

2. The feature extraction network model obtained by combining the constructed feature extraction module with the deep convolutional neural network has the capability of identifying different content information.

3. The recurrent neural network GRU can effectively model long-term dependency relationship of quality elements at different moments in video.

Therefore, the video quality evaluation method provided by the invention can realize more accurate video quality evaluation effect.

Drawings

Fig. 1 is a flowchart of a video quality evaluation method according to the present invention.

Fig. 2 is a schematic diagram of extracting content-aware features according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of modeling long-term dependency and evaluating video quality according to an embodiment of the present invention.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar modules or modules having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the present application include all alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims.

Example 1

Aiming at the defects that the prior art does not fully consider the human visual characteristics, the traditional method needs to extract a large amount of manual characteristics, time and labor are wasted, various characteristic information of images are not fully considered, and the like, referring to fig. 1, the embodiment provides a video quality evaluation method based on content perception fusion characteristics, which comprises the following steps:

In step 1 of this embodiment, the number of directions is selected to be 8, and gradient information of different angles of the image is obtained through a multi-direction differential second-order differential gaussian filtering characteristic extraction module.

In the embodiment, gradient information is extracted through the multi-directional differential second-order differential Gaussian filter characteristic extraction module established in the step 1, and content perception characteristics are extracted through cooperation with the deep convolutional neural network, wherein the multi-directional differential second-order differential Gaussian filter characteristic extraction module and the deep convolutional neural network can form a residual characteristic extraction network model.

The gate control recurrent neural network GRU in the step 3 can integrate characteristics and learn long-term dependency.

Example 2

On the basis of embodiment 1, this embodiment further describes a multi-directional differential second order differential gaussian filter feature extraction module and a feature extraction method in step 1, which are specifically as follows:

construction of a multidirectional differentiated second order differential Gaussian kernel

And its directional derivative>

The method is specifically as follows:

wherein,,

and->

Respectively representing the abscissa and the ordinate of pixels in the image; />

Representing the differentiation factor; />

；/>

，/>

The selected angle value is represented, and the calculation formula is as follows:

the value range of m is +.>

，/>

The number of the selected directions is represented, and the value range of M is any positive integer.

In this embodiment, selecting the direction number m=8 to obtain gradient information of different angles of the image; in the feature extraction, the input image I (x, y) is differentiated with a multidirectional second-order differential Gaussian directional derivative

And carrying out convolution operation to achieve the purpose of extracting characteristic information, wherein the specific operation is as follows:

wherein,,

representing image features.

Example 3

On the basis of

embodiment

1 or 2, as shown in fig. 2, the specific process of extracting the content perception feature in step 2 is further described, and it should be noted that the feature extraction module in fig. 2 refers to a multi-direction differential second order differential gaussian filter feature extraction module:

step 2.1, for an input video material, carrying out frame-by-frame splitting on the video material to obtain T RGB three-channel color images;

step 2.2, the image is to be obtained

Wherein the value range of T is 1-T, and the image size is uniformly scaled to 224 pixels by 224 pixels through an image processing size operation;

step 2.3, setting a 2D convolution layer Conv_2D_1, wherein the number of convolution kernels is 64, the size of the convolution kernels is 7 multiplied by 7, the step length is 2, the filling dimension is 3, and the output dimension of an image subjected to B2 operation is 112 multiplied by 64 after the image is subjected to Conv_2D_1;

2.4, inputting the image obtained in the step 2.2 into a multi-directional differential second-order differential Gaussian filter characteristic extraction module to perform characteristic extraction, wherein the dimension of an output characteristic is 112×112×8, then performing concat characteristic fusion operation with the characteristic output in the step 2.3, wherein the dimension of the fusion characteristic is 112×112×72, and then sending the fusion characteristic into a 1×1×64 convolution to restore the channel number to 64 dimensions;

step 2.5, the fusion characteristic with 64 dimensions of channels is sent into a largest pooling layer with a core size of 3 multiplied by 3, a step length of 2 and a filling dimension of 1, and the dimension of the output characteristic is 56 multiplied by 64;

step 2.7, feature W _t Each feature map in the map is subjected to space global pooling (Spatialgp), and then is subjected to space global average pooling (GP) _mean ) And spatial global standard deviation pooling (GP) _std ) Obtaining feature F in feature map by joint operation _t ：

Feature F fused by multidirectional differential second-order differential Gaussian filter feature extraction module and deep convolution neural network _t Has the ability to distinguish information of different content and thus the feature has content-aware properties.

Example 4

On the basis of embodiment 3, this embodiment proposes a specific bottleck convolution structure construction process, which is specifically as follows:

Example 5

On the basis of embodiment 3 or 4, as shown in fig. 3, the present embodiment is further described with respect to a specific procedure of modeling long-term dependencies and acquiring quality elements using a recurrent neural network with gate control. Specific:

step 3.1, performing dimension reduction on the content perception feature through a full connection layer FC_1 to obtain a dimension reduction feature X _t ；

Wherein,,

and->

For two parameters in the fully connected layer fc_1, scaling and bias terms are represented, respectively.

in this embodiment, the hidden layer has an initial value ofh ₀ Hidden layer integration feature at time th _t From input features X at time t _t And the hidden layer at the previous momenth _t-1 And (3) calculating to obtain:

step 3.4, integration of features

Inputting another full connection layer FC_2 to obtain the quality fraction at the moment tq _t ；

Wherein,,

and->

For two parameters in the fully connected layer fc_2, scaling and bias terms are represented, respectively.

Step 3.5, taking the lowest quality fraction in the previous frames as a memory quality element at the time t

；

Wherein,,

representing memory quality element->

Index set representing all moments +.>

、/>

Step 3.6, in order to simulate the phenomenon that human beings have deep memory for video quality degradation and have weak perceptibility for video quality enhancement, in this embodiment, the current quality element is constructed in the t-th frame

And weighting the quality score in the next few frames (may be made of +.>

Determined) a greater weight is assigned to frames with low quality scores by:

wherein,,

for the current quality element->

For the weights, a softmin function definition is used,

an index set indicating the relevant time, e indicating a natural constant.

Example 6

On the basis of embodiment 5, this embodiment further describes the method for obtaining the final quality score of the video in step 4, specifically:

step 4.1, obtaining the approximate mass fraction of the subjective frame moment by linearly combining the memory mass element and the current mass element

；

Wherein r is a super parameter balancing the contributions of the memory mass element and the current mass element.

Step 4.2, approximate mass fraction

And obtaining the final video quality fraction Q after time global average pooling.

The invention can be better realized based on any of the embodiments 1-6, and the quality score of one end video can be accurately obtained.

It should be noted that, in the description of the embodiments of the present invention, unless explicitly specified and limited otherwise, the terms "disposed," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; may be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention will be understood in detail by those skilled in the art; the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A video quality assessment method based on content-aware fusion features, comprising:

step 4, determining the final quality score of the video based on quality elements and weights at different moments;

the substep of the step 3 is as follows:

2. The method for evaluating video quality based on content aware fusion feature according to claim 1, wherein the substeps of step 1 are:

step 1.1, constructing a multidirectional differential second-order differential Gaussian kernel and a directional derivative thereof;

3. The video quality evaluation method based on content aware fusion feature according to claim 1 or 2, wherein the sub-steps of step 2 are:

step 2.2, uniformly scaling the obtained image to 224 pixels by 224 pixels;

step 2.7, feature W _t And carrying out space global pooling on each feature map, and obtaining content perception features in the feature maps through the combined operation of space global average pooling and space global standard deviation pooling.

4. The video quality evaluation method based on content aware fusion feature according to claim 3, wherein in the step 2.5, the maximum pooling layer and size is 3×3, the step size is 2, and the filling dimension is 1.

5. The video quality evaluation method based on content aware fusion feature according to claim 3, wherein in the step 2.6, the specific process of creating the bottleck convolution structure is as follows:

6. The method for evaluating video quality based on content aware fusion feature according to claim 1, wherein in the step 3.5, the memory quality elements are:

wherein,,

representing memory quality element->

Index set representing all moments +.>

、/>

The quality scores at time t and time k are respectively represented, and s is a super parameter related to time t.

7. The method for evaluating video quality based on content aware fusion feature according to claim 6, wherein in step 3.6, the current quality element is:

wherein,,

for the current quality element->

For the weights, a softmin function definition is used,

the index set representing the time of interest, e represents a natural constant, s is a time-dependent hyper-parameter.

8. The method for evaluating video quality based on content aware fusion feature according to claim 1, wherein the sub-step of step 4 is:

9. The method for evaluating video quality based on content aware fusion feature according to claim 8, wherein in the step 4.1, the approximate quality score calculating method is as follows:

wherein,,

representing approximate mass fraction, ++>

Representing memory quality element->