CN112954312B - Non-reference video quality assessment method integrating space-time characteristics - Google Patents

Non-reference video quality assessment method integrating space-time characteristics Download PDF

Info

Publication number
CN112954312B
CN112954312B CN202110176125.XA CN202110176125A CN112954312B CN 112954312 B CN112954312 B CN 112954312B CN 202110176125 A CN202110176125 A CN 202110176125A CN 112954312 B CN112954312 B CN 112954312B
Authority
CN
China
Prior art keywords
video
network
sub
feature extraction
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110176125.XA
Other languages
Chinese (zh)
Other versions
CN112954312A (en
Inventor
牛玉贞
钟梦真
陈俊豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202110176125.XA priority Critical patent/CN112954312B/en
Publication of CN112954312A publication Critical patent/CN112954312A/en
Application granted granted Critical
Publication of CN112954312B publication Critical patent/CN112954312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a non-reference video quality assessment method fusing space-time characteristics, which comprises the following steps of S1, obtaining a video data set as a training set; s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set; s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set; s4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model; and S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected. The invention can obviously improve the performance of the non-reference video quality evaluation.

Description

Non-reference video quality assessment method integrating space-time characteristics
Technical Field
The invention relates to the fields of image and video processing and computer vision, in particular to a non-reference video quality assessment method integrating space-time characteristics.
Background
With the development of social media applications and the popularity of consumer capture devices, people can record their daily lives by capturing video through portable mobile devices anywhere and anytime, and share through various media platforms. This has led to a proliferation in the number of user-generated content (user-generated content, UGC) videos that are shared and streamed over the internet. Therefore, it is highly necessary to study accurate video quality assessment (video quality assessment, VQA) models for consumer video to monitor, control and optimize this vast content. In addition, since most users are not trained, distortion due to camera shake, sensor noise, defocus, and the like may be included therein because of lack of specialized imaging knowledge. And partial original data are inevitably lost in the processes of encoding and decoding, storing, transmitting and processing the video, so that the video generates distortion phenomenon, and noise, deformation, distortion, deletion and the like are generated. Distortion can lose information contained in the original video to varying degrees, thereby affecting the look and feel of people to the video and affecting people to acquire information from the video. For organizations that provide user-centric video services, it is critical to ensure that the video after the production and distribution chain can meet the quality requirements of the video receiving end. The video quality assessment model can assess the quality of video according to the video distortion degree, thereby providing a basis for subsequent video processing. Video quality assessment is one of the key technologies in the video processing field, and is important for current images in the fields of medicine, aviation, education, entertainment and the like.
The quality assessment of video can be categorized into subjective quality assessment and objective quality assessment. Subjective quality assessment relying on manual scoring is the most accurate and reasonable quality assessment, but its widespread use in the real world is limited by the time and manpower it consumes. Accordingly, researchers have proposed objective quality assessment methods to automatically predict the visual quality of distorted video. According to the availability of the reference information, the objective quality assessment method is divided into: full reference, half reference and no reference. Since in practical applications many videos do not have reference videos, such as user generated content videos, because during video capture it is not possible to capture a "perfect" video with no distortion at all, the additional information of the reference video also results in high bandwidth occupation during video transmission. Therefore, the non-reference quality assessment method without reference to the original video has wider practical application value.
Most existing reference-free video quality assessment models are mainly directed to synthetic distortion (e.g., compression distortion). There is a large distinction between true distorted video, which may suffer from complex mixed real world distortion, and composite distorted video, which may also be different at different time periods for the same video. And according to recent studies, some of the most advanced video quality assessment methods validated on synthetically distorted data sets perform poorly on truly distorted video data sets. In recent years, with the disclosure of true distortion video quality assessment data sets, and urgent demands of real-world applications. The method for estimating the quality of the reference-free video fused with the space-time characteristics is provided, the time domain characteristics of the video are obtained by using a 3D convolution network to calculate through a video residual image sequence input, and the influence of time domain and space domain distortion on the perceived quality of the video is adaptively adjusted by applying a attention mechanism. The model can significantly improve the performance of the reference-free video quality assessment model.
Disclosure of Invention
Therefore, the invention aims to provide a non-reference video quality assessment method integrating space-time characteristics, which effectively improves the efficiency and performance of non-reference video quality assessment.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a non-reference video quality assessment method integrating space-time characteristics comprises the following steps:
step S1, acquiring a video data set as a training set;
s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
s4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model;
and S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Further, the step S2 specifically includes:
step S21, uniformly downsampling each video of a training set, wherein the sampling frequency is that one frame is taken for each f frames, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing an airspace feature extraction sub-network according to the image classification network as a main network, and pre-training;
step S23, training the space domain feature extraction sub-network according to the training frame set by fixing the pre-trained parameters in the main network, and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all frames in the training frame set, so as to complete the training process of the space domain feature extraction sub-network.
Further, the airspace feature extraction sub-network specifically includes: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: firstly, a 1X 1 convolution layer with the channel number of C is used to obtain the space domain characteristic diagram of the video frameThen for the space domain feature map F s And carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video.
Further, the step S3 specifically includes:
s31, constructing a neural network formed by a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained by all videos in the training set as a sub-video set, wherein the real quality score of each sub-video is the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using the sub-video set and taking a batch as a unit; and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video, so that the training process of extracting the time domain characteristics from the sub-network is completed.
Further, the time domain feature extraction sub-network is sequentially composed of a 3D convolution module, a pooling module and a regression module, and specifically comprises: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F t Conversion to a feature vector; the regression module consists of a full connection layer and is used for realizing the mapping of the feature vector and the quality fraction.
Further, the step S32 specifically includes: dividing a video of a training set into a plurality of equal-length sub-videos, wherein each sub-video comprises continuous m frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:
RF i~j =F i+1~j -F i~j-1
wherein F is i An ith frame representing video, using F i~j Representing a sub-video from the i-th frame of the video to the j-th frame of the video,RF i~j a sequence of residual images representing the segment of sub-video;
the residual image sequence of each sub-video is input into the network designed in the step S31, and a C×F×W time domain feature map F is obtained through a 3D convolution module t And C, H and W are the channel number, the height and the width of the feature map respectively, then a vector of Cx1 is obtained through a pooling module, and the quality fraction of the sub video is obtained through mapping of a regression module.
Further, the video quality evaluation network comprises a airspace feature extraction module, a time domain feature extraction module, an attention module, a plurality of later pooling layers and a full connection layer; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.
Further, the video quality evaluation network construction and training specifically comprises:
calculating average value of m airspace feature images to obtain airspace feature image of corresponding sub-videoThen sum F by stitching t And F s Preliminary fusion into a spatiotemporal profile +.>
The attention module is designed to integrate attention and space attention, and is firstly based on a space-time characteristic diagram F st Calculate a fused attention diagram, for F st Aggregating spatial information of each feature map separately using average pooling and maximum pooling to obtainAnd->Then will->And->The results obtained by sharing the multi-layer perceptron are added, and then a fusion attention strive is obtained by a sigmoid function>
Calculating a spatial attention map of the spatiotemporal features, fusing the attention mapsBroadcast get +.>Will expand A f And original feature map F st Multiplication by element to obtain a new feature map F st Then use the new feature map F st Generates a spatial attention map A s
For new feature map F st Applying average pooling and maximum pooling along the channel dimensionAndand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid function
Attempt a to pay attention to space s And spatiotemporal features F st Element-by-element multiplication to obtain final spatiotemporal feature map F fusion
Spatiotemporal feature map F using global pooling method fusion Conversion to a C-dimensional vector F v Finally, vector F v Obtaining the son vision through full-connection layer regressionFrequency quality fraction;
using the parameters of the corresponding part in the trained airspace feature extraction sub-network as the parameters of the airspace feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
fixing parameters of the airspace feature extraction module and the time domain feature extraction module, and training a video quality evaluation network according to the sub-video set;
and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, so that the training process of the video quality evaluation network is completed.
Further, the A f The calculation formula is as follows:
wherein,representing sigmoid function according to channel splicing, wherein the MLP is a shared multi-layer perceptron, and each layer of perceptron is followed by a ReLU activation function;
A s the calculation formula is as follows:
wherein the method comprises the steps ofRepresenting multiplication element by element>Representing the sum of the sigmoid functions in terms of channel concatenation, sigma represents the sigmoid function,
conv stands for convolutional layer.
Further, the space-time characteristic diagram F is formed by using a global pooling method fusion Conversion to a C-dimensional vector F v The method comprises the following steps: will F fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with fusion The latter C/2 feature graphs are subjected to maximum pooling to obtain a vector with C/2 dimension, which is marked as F tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged sv And F tv Splicing to obtain a C-dimensional vector F v
Compared with the prior art, the invention has the following beneficial effects:
1. the invention extracts deep semantic features through the airspace feature extraction module to solve the problem of content dependence of the predicted video quality. Designing a time domain feature extraction module and using a video residual image to replace an RGB frame, and removing static objects and background information to capture more motion-specific information; the attention module is used for fusing the space-time characteristics, so that the influence of spatial domain and time domain distortion on the video perception quality can be adaptively adjusted, and the performance of non-reference video quality assessment can be remarkably improved.
2. The model of the invention can be well applied to videos suffering from complex mixed real world distortion, and has wider practical application value.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a non-reference video quality assessment model incorporating spatio-temporal features in an embodiment of the present invention;
fig. 3 is a block diagram of a time domain feature extraction sub-network in an example of the invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples.
Referring to fig. 1, the present invention provides a non-reference video quality assessment method with fusion of spatio-temporal features, comprising the following steps:
step S1, acquiring a video data set, and randomly dividing the video data set into a training set (80%) and a testing set (20%) according to a preset proportion;
s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set;
step S21, uniformly downsampling each video of a training set, wherein the sampling frequency is that one frame is taken for each f frames, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing an airspace feature extraction sub-network according to the image classification network as a main network, and pre-training;
step S23, training the space domain feature extraction sub-network according to the training frame set by fixing the pre-trained parameters in the main network, and learning the optimal parameters of the model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all frames in the training frame set, so as to complete the training process of the space domain feature extraction sub-network.
S3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set, specifically;
s31, constructing a neural network formed by a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained by all videos in the training set as a sub-video set, wherein the real quality score of each sub-video is the real quality score of the corresponding video;
preferably, a video of the training set is divided into a plurality of equal-length sub-videos, and each sub-video comprises continuous m frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:
RF i~j =F i+1~j -F i~j-1
wherein F is i An ith frame representing video, using F i~j Representing a sub-video from video ith frame to video jth frame, RF i~j A sequence of residual images representing the segment of sub-video;
the residual image sequence of each sub-video is input into the network designed in the step S31, and a C×F×W time domain feature map F is obtained through a 3D convolution module t And C, H and W are the channel number, the height and the width of the feature map respectively, then a vector of Cx1 is obtained through a pooling module, and the quality fraction of the sub video is obtained through mapping of a regression module.
Step S33, training a time domain feature extraction sub-network by using the sub-video set and taking a batch as a unit; and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video, so that the training process of extracting the time domain characteristics from the sub-network is completed.
S4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model;
and S5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected.
Preferably, in this embodiment, the spatial domain feature extraction sub-network specifically includes: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: first, a 1×1 (c=128) convolution layer with a channel number of C is used to obtain a spatial signature of a video frameThen for the space domain feature map F s And carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video.
Preferably, in this embodiment, the space-time domain feature extraction sub-network sequentially comprises a 3D convolution module, a pooling module and a regression module, which specifically includes: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F t Conversion to a feature vector; the regression module is composed of a full connection layer and is used for realizing the mapping of the feature vector and the quality score.
Preferably, in this embodiment, the video quality assessment network includes a spatial domain feature extraction module, a temporal domain feature extraction module, an attention module, and a plurality of subsequent pooling layers and full connection layers; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.
The video quality evaluation network construction and training is specifically as follows:
calculating average value of m airspace feature images to obtain airspace feature image of corresponding sub-videoThen sum F by stitching t And F s Preliminary fusion into a spatiotemporal profile +.>
The attention module is designed to integrate attention and space attention, and is firstly based on a space-time characteristic diagram F st Calculate a fused attention diagram, for F st Aggregating spatial information of each feature map separately using average pooling and maximum pooling to obtainAnd->Then will->And->The results obtained by sharing the multi-layer perceptron are added, and then a fusion attention strive is obtained by a sigmoid function>A f The calculation formula is as follows:
wherein,representing sigmoid function according to channel splicing, wherein the MLP is a shared multi-layer perceptron, and each layer of perceptron is followed by a ReLU activation function;
calculating a spatial attention map of the spatiotemporal features, fusing the attention mapsBroadcast get +.>Will expand A f And original feature map F st Multiplication by element to obtain a new feature map F st Then use the new feature map F st Generates a spatial attention map A s
For new feature map F st Applying average pooling and maximum pooling along the channel dimensionAndand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid functionA s The calculation formula is as follows:
wherein the method comprises the steps ofRepresenting multiplication element by element>Representing the splice by channel, σ represents the sigmoid function, conv represents the convolutional layer.
Attempt a to pay attention to space s And spatiotemporal features F st Element-by-element multiplication to obtain final spatiotemporal feature map F fusion
Spatiotemporal feature map F using global pooling method fusion Conversion to a C-dimensional vector F v F is to F fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with fusion The latter C/2 feature graphs are subjected to maximum pooling to obtain a vector with C/2 dimension, which is marked as F tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged Sv And F tv Splicing to obtain a C-dimensional vector F v Finally vector F v Obtaining sub-video quality scores through full-connection layer regression;
using the parameters of the corresponding part in the trained airspace feature extraction sub-network as the parameters of the airspace feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
fixing parameters of the airspace feature extraction module and the time domain feature extraction module, and training a video quality evaluation network according to the sub-video set;
and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, so that the training process of the video quality evaluation network is completed.
In this embodiment, step S5 is specifically;
step S51: each video to be tested is divided into a plurality of sub-videos by the method of step S32, and each sub-video contains consecutive m frames.
Step S52: first, the sub-video split frame is input to the spatial domain feature extraction module. The sub-video is then input to a temporal feature extraction module. And finally, predicting the quality score of the sub-video through a video quality evaluation network.
Step S53: taking the average value of the predicted quality scores obtained by all the sub-videos in the video as the predicted quality score of the video.
The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (4)

1. The reference-free video quality assessment method integrating the space-time characteristics is characterized by comprising the following steps of:
step S1, acquiring a video data set as a training set;
s2, constructing a space domain feature extraction sub-network, and training a frame set based on downsampling of a training set;
s3, constructing a time domain feature extraction sub-network, and training based on a residual image sequence of a training set;
s4, constructing a video quality evaluation network according to the trained airspace feature extraction sub-network and the trained time domain feature extraction sub-network, and adaptively adjusting the influence of the time domain and airspace features on the video perceived quality through an attention mechanism, and training to obtain a video quality evaluation model;
s5, extracting time domain and space domain characteristics of the video to be detected according to the obtained video quality evaluation model, and calculating the quality score of the video to be detected;
the step S2 specifically comprises the following steps:
step S21, uniformly downsampling each video of a training set, wherein the sampling frequency is that one frame is taken for each f frames, and the quality fraction of the video is taken as the quality fraction of each frame to obtain a training frame set;
s22, constructing an airspace feature extraction sub-network according to the image classification network as a main network, and pre-training;
step S23, fixing pre-trained parameters in a backbone network, training a space domain feature extraction sub-network according to a training frame set, and learning optimal parameters of a model by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all frames in the training frame set, so as to complete the training process of the space domain feature extraction sub-network;
the airspace characteristic extraction sub-network specifically comprises: VGG16, resNet50 or Densenet is used as a backbone network, and the part after the last convolution layer of the backbone network is replaced by the following parts: firstly, a 1X 1 convolution layer with the channel number of C is used to obtain the space domain characteristic diagram of the video frameThen for the space domain feature map F s Carrying out global average pooling and global standard deviation pooling, splicing the pooled two vectors, and finally mapping the spliced vectors into the quality fraction of the video frame by adopting a full connection layer, wherein the modified network is used as a spatial domain feature extraction sub-network of the video;
the step S3 specifically comprises the following steps:
s31, constructing a neural network formed by a plurality of 3D convolution layers as a video time domain feature extraction sub-network;
step S32, dividing the training set video into a plurality of sub-videos, taking the sub-videos obtained by all videos in the training set as a sub-video set, wherein the real quality score of each sub-video is the real quality score of the corresponding video;
step S33, training a time domain feature extraction sub-network by using the sub-video set and taking a batch as a unit; the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality fraction and the real quality fraction of the sub-video, and the training process of extracting the time domain characteristics from the sub-network is completed;
the time domain feature extraction sub-network is composed of a 3D convolution module, a pooling module and a regression module in sequence, and specifically comprises: the 3D convolution module is provided with 6 3D convolution layers, the convolution kernel size of the first 5 convolution layers is 3 multiplied by 3, and the convolution kernel size of the last layer is 1 multiplied by m; an activation function ReLU is used behind each convolution layer, and the channel number of the last 3D convolution layer is C; the output of the 3D convolution module is a time domain feature map of the input sub-videoThe pooling module consists of a global maximum pooling layer for realizing a time domain feature map F t Conversion to a feature vector; the regression module consists of a full connection layer and is used for realizing the mapping of the feature vector and the quality fraction;
the step S32 specifically includes: dividing a video of a training set into a plurality of equal-length sub-videos, wherein each sub-video comprises continuous n frames; for each sub-video, a corresponding residual image sequence is calculated, with the following formula:
RF i~j =F i+1~j -F i~j-1
wherein F is i An ith frame representing video, using F i~j Representing a sub-video from video ith frame to video jth frame, RF i~j A sequence of residual images representing the segment of sub-video;
the residual image sequence of each sub-video is input into the network designed in the step S31, and a time domain characteristic diagram F of C multiplied by H multiplied by W is obtained through a 3D convolution module t Wherein C, H and W are the channel number, the height and the width of the feature map respectively, and then a vector of Cx1 is obtained through a pooling module and is subjected to regressionThe module maps to obtain the quality fraction of the sub video;
the video quality evaluation network construction and training specifically comprises the following steps:
calculating average value of p airspace feature images to obtain airspace feature image of corresponding sub-videoThen sum F by stitching t And F s Preliminary fusion into a spatiotemporal profile +.>
The attention module is designed to integrate attention and space attention, and is firstly based on a space-time characteristic diagram F st Calculate a fused attention diagram, for F st Aggregating spatial information of each feature map separately using average pooling and maximum pooling to obtainAnd->Then will->And->The results obtained by sharing the multi-layer perceptron are added, and then a fusion attention strive is obtained by a sigmoid function>
Calculating a spatial attention map of the spatiotemporal features, fusing the attention mapsBroadcast get +.>Will expand A' f And original feature map F st Multiplication by element to obtain a new feature map F' st Then use the new feature map F' st Generates a spatial attention map A s
For new feature map F' st Applying average pooling and maximum pooling along the channel dimension Andand generating a spatial attention map by splicing them and passing through a convolution layer and sigmoid function
Attempt a to pay attention to space s And spatiotemporal features F' st Element-by-element multiplication to obtain final spatiotemporal feature map F fusion
Spatiotemporal feature map F using global pooling method fusion Conversion to a C-dimensional vector F v Finally, vector F v Obtaining sub-video quality scores through full-connection layer regression;
using the parameters of the corresponding part in the trained airspace feature extraction sub-network as the parameters of the airspace feature extraction module, and using the parameters of the corresponding part in the trained time domain feature extraction sub-network as the parameters of the time domain feature extraction module;
fixing parameters of the airspace feature extraction module and the time domain feature extraction module, and training a video quality evaluation network according to the sub-video set;
and (3) the optimal parameters of the model are learned by minimizing the mean square error loss between the predicted quality scores and the real quality scores of all the sub-videos, so that the training process of the video quality evaluation network is completed.
2. The method for non-reference video quality assessment with fusion of spatio-temporal features according to claim 1, wherein said video quality assessment network comprises a spatial domain feature extraction module, a temporal feature extraction module, an attention module, and a plurality of subsequent pooling layers and full connection layers; the trained airspace feature extraction module is a main network and a 1 multiplied by 1 convolution layer of the airspace feature extraction sub-network, and the time domain feature extraction module is a 3D convolution module of the time domain feature extraction sub-network.
3. The method for reference-free video quality assessment with fusion of spatio-temporal features of claim 1, wherein said a f The calculation formula is as follows:
wherein,representing the splice according to channel, sigmaRepresenting sigmoid functions, MLP is a shared multi-layer perceptron, each layer of perceptron is followed by a ReLU activation function;
A s the calculation formula is as follows:
wherein the method comprises the steps ofRepresenting multiplication element by element>Representing the splice by channel, σ represents the sigmoid function, conv represents the convolutional layer.
4. The method for non-reference video quality assessment with fusion of spatio-temporal features according to claim 1, wherein said spatio-temporal feature map F is generated using a global pooling method fusion Conversion to a C-dimensional vector F v The method comprises the following steps: will F fusion The first C/2 feature graphs of the model (C) are respectively subjected to average pooling and standard deviation pooling to obtain two vectors of C dimension, and the vectors of C dimension are reduced to C/2 dimension through a full connection layer for maintaining feature balance and marked as F sv The method comprises the steps of carrying out a first treatment on the surface of the And F is combined with fusion Through maximum C/2 feature mapsPooling to obtain a vector of C/2 dimension, denoted F tv The method comprises the steps of carrying out a first treatment on the surface of the And then F is arranged sv And F tv Splicing to obtain a C-dimensional vector F v
CN202110176125.XA 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics Active CN112954312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110176125.XA CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110176125.XA CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Publications (2)

Publication Number Publication Date
CN112954312A CN112954312A (en) 2021-06-11
CN112954312B true CN112954312B (en) 2024-01-05

Family

ID=76244601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110176125.XA Active CN112954312B (en) 2021-02-07 2021-02-07 Non-reference video quality assessment method integrating space-time characteristics

Country Status (1)

Country Link
CN (1) CN112954312B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554599B (en) * 2021-06-28 2023-08-18 杭州电子科技大学 Video quality evaluation method based on human visual effect
CN113473117B (en) * 2021-07-19 2022-09-02 上海交通大学 Non-reference audio and video quality evaluation method based on gated recurrent neural network
CN113822856B (en) * 2021-08-16 2024-06-21 南京中科逆熵科技有限公司 End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation
CN113810683B (en) * 2021-08-27 2023-07-18 南京信息工程大学 No-reference evaluation method for objectively evaluating underwater video quality
CN113784113A (en) * 2021-08-27 2021-12-10 中国传媒大学 No-reference video quality evaluation method based on short-term and long-term time-space fusion network and long-term sequence fusion network
CN113642513B (en) * 2021-08-30 2022-11-18 东南大学 Action quality evaluation method based on self-attention and label distribution learning
CN113837047B (en) * 2021-09-16 2022-10-28 广州大学 Video quality evaluation method, system, computer equipment and storage medium
CN114697648B (en) * 2022-04-25 2023-12-08 上海为旌科技有限公司 Variable frame rate video non-reference evaluation method, system, electronic equipment and storage medium
CN115278303B (en) * 2022-07-29 2024-04-19 腾讯科技(深圳)有限公司 Video processing method, device, equipment and medium
CN117676121A (en) * 2022-08-24 2024-03-08 腾讯科技(深圳)有限公司 Video quality assessment method, device, equipment and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023227A (en) * 2014-05-28 2014-09-03 宁波大学 Objective video quality evaluation method based on space domain and time domain structural similarities
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video actions identification and location algorithm based on cascade neural network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110837842A (en) * 2019-09-12 2020-02-25 腾讯科技(深圳)有限公司 Video quality evaluation method, model training method and model training device
CN111784694A (en) * 2020-08-20 2020-10-16 中国传媒大学 No-reference video quality evaluation method based on visual attention mechanism
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106028026B (en) * 2016-05-27 2017-09-05 宁波大学 A kind of efficient video assessment method for encoding quality based on space-time domain structure
US10417498B2 (en) * 2016-12-30 2019-09-17 Mitsubishi Electric Research Laboratories, Inc. Method and system for multi-modal fusion model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023227A (en) * 2014-05-28 2014-09-03 宁波大学 Objective video quality evaluation method based on space domain and time domain structural similarities
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video actions identification and location algorithm based on cascade neural network
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110837842A (en) * 2019-09-12 2020-02-25 腾讯科技(深圳)有限公司 Video quality evaluation method, model training method and model training device
CN111784694A (en) * 2020-08-20 2020-10-16 中国传媒大学 No-reference video quality evaluation method based on visual attention mechanism
CN112085102A (en) * 2020-09-10 2020-12-15 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于空时特征和注意力机制的无参考视频质量评价;朱泽,桑庆兵,张浩;激光与光电子学进展;第57卷(第18期);181509-1-181509-9 *

Also Published As

Publication number Publication date
CN112954312A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN112954312B (en) Non-reference video quality assessment method integrating space-time characteristics
CN108898579B (en) Image definition recognition method and device and storage medium
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
Moorthy et al. Visual quality assessment algorithms: what does the future hold?
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
Bovik Automatic prediction of perceptual image and video quality
Tian et al. A multi-order derivative feature-based quality assessment model for light field image
US11928793B2 (en) Video quality assessment method and apparatus
CN112995652B (en) Video quality evaluation method and device
Xu et al. Perceptual quality assessment of internet videos
Siahaan et al. Semantic-aware blind image quality assessment
CN113191495A (en) Training method and device for hyper-resolution model and face recognition method and device, medium and electronic equipment
CN111047543A (en) Image enhancement method, device and storage medium
Shen et al. An end-to-end no-reference video quality assessment method with hierarchical spatiotemporal feature representation
Li et al. Recent advances and challenges in video quality assessment
Min et al. Perceptual video quality assessment: A survey
Da et al. Perceptual quality assessment of nighttime video
Jenadeleh Blind Image and Video Quality Assessment
Ying et al. Telepresence video quality assessment
Shi et al. Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token
Kim et al. No‐reference quality assessment of dynamic sports videos based on a spatiotemporal motion model
Wang et al. Blind Multimodal Quality Assessment of Low-light Images
Huong et al. An Effective Foveated 360° Image Assessment Based on Graph Convolution Network
WO2024041268A1 (en) Video quality assessment method and apparatus, and computer device, computer storage medium and computer program product
Niu et al. Blind consumer video quality assessment with spatial-temporal perception and fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant