CN113822856B - End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation - Google Patents

End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation Download PDF

Info

Publication number
CN113822856B
CN113822856B CN202110945647.1A CN202110945647A CN113822856B CN 113822856 B CN113822856 B CN 113822856B CN 202110945647 A CN202110945647 A CN 202110945647A CN 113822856 B CN113822856 B CN 113822856B
Authority
CN
China
Prior art keywords
stage
video
quality
feature
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110945647.1A
Other languages
Chinese (zh)
Other versions
CN113822856A (en
Inventor
杨峰
周明亮
沈文昊
咸伟志
江蔚
纪程
隋修宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongke Inverse Entropy Technology Co ltd
Original Assignee
Nanjing Zhongke Inverse Entropy Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongke Inverse Entropy Technology Co ltd filed Critical Nanjing Zhongke Inverse Entropy Technology Co ltd
Priority to CN202110945647.1A priority Critical patent/CN113822856B/en
Publication of CN113822856A publication Critical patent/CN113822856A/en
Application granted granted Critical
Publication of CN113822856B publication Critical patent/CN113822856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The invention discloses an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation. First, a video is preprocessed: dividing an original video into non-overlapping time segments, and cutting the time segments, wherein the regions with the same positions in each segment form a video block which is used as the input of a neural network; secondly, training the neural network: extracting features of the input video segments, and outputting a series of space-time feature graphs extracted in stages; then, inputting the characteristic diagrams of each stage into a convolutional neural network and a cyclic neural network to obtain stage quality characteristic vectors with the same dimension; and finally, calculating the quality scores of all stages respectively, and calculating the global quality score of the video sequence by combining the attention model. The invention utilizes the three-dimensional convolution layer to form the characteristic extractor, and the network can effectively extract the space-time characteristic so as to detect the degradation mode of the video.

Description

End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation
Technical Field
The invention relates to the technical field of video quality evaluation in video coding, in particular to an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation.
Background
The demand for video services has grown exponentially over the past few years. Cisco corporation predicts that in the next few years, video-generated data traffic will account for 80-90% of the total network data traffic. With the development of communication technology, 2/3 of mobile data is transmitted to various multimedia mobile devices to meet the demands of consumers. This flexible digital lifestyle requires that consumers enjoy high quality multimedia content at any time wherever they are.
Digital Video can suffer from various distortions during acquisition, processing, compression, storage, and transmission, causing degradation in visual quality, with the purpose of Video quality assessment (Video QualityAssessment, VQA) to predict the perceived quality of the Video. A good quality evaluation method not only can automatically and accurately evaluate the quality of the video, but also can monitor and guide parameter updating and optimizing algorithms in real time, thereby better serving the video transmission.
Macroscopically, the video quality evaluation methods are divided into three types: full Reference (FR), half Reference (Reduced Reference, RR), no Reference (No Reference, NR). The non-reference quality evaluation method of the video is a method for evaluating the objective quality of the distorted video when the original lossless video is not available, so that the research difficulty of the method is high: 1) The distortion of the single frame image is poor in average value taking precision; 2) Lack of perception of motion-induced spatial distortion; 3) Interactions between spatio-temporal artifacts are difficult to estimate.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to effectively fuse video time-space domain characteristics and establish a non-reference video quality evaluation method with excellent performance based on deep learning and layered time-space domain characteristic representation. The evaluation method is more accurate and efficient, reasonably utilizes semantic information of the neural network intermediate layer, and can be used as an end-to-end overall architecture for co-optimization.
In order to achieve the above purpose, the technical scheme of the invention is as follows: an end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation comprises the following steps:
Preprocessing video: preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block;
Training a neural network to obtain a first network model with a video space-time feature extraction function, and carrying out staged feature extraction on an input video time segment by using the first network model;
training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of each stage into the second network model to obtain feature vectors of the same dimension of each stage;
And respectively calculating the quality scores of all stages by using the third network model with the function of calculating the video quality scores, and calculating the global quality score of the video sequence by combining the attention model in the third network model.
Preferably, the first network model comprises J three-dimensional convolution layers, J generalized divisor normalization layers and J maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model employs a linear rectification unit as an activation function.
Preferably, the second network model includes J branches, each branch includes a plurality of spatial convolution layers, a plurality of gating circulation units and a uniform pooling layer, the activation function adopts a linear rectification unit, and features of each stage are respectively input into each branch to obtain a quality feature vector of each branch, where J is the total stage number.
Preferably, the third network model comprises J fully connected layers for stage quality regression, an attention model consisting of two fully connected layers, a linear rectifying unit and an S-shaped growth curve unit, and a fully connected layer for global quality regression.
Compared with the prior art, the invention has the remarkable advantages that:
the method can effectively integrate the time-space domain characteristics of the video, reasonably utilize the semantic information of the middle layer of the neural network, and can be used as an end-to-end overall architecture for common optimization, and the accuracy and the reliability of the method are superior to those of other existing video objective quality evaluation methods; the invention utilizes the three-dimensional convolution layer to form the characteristic extractor, and the network can effectively extract the space-time characteristic so as to detect the degradation mode of the video.
Drawings
FIG. 1 is a basic flow chart of the present invention;
Detailed Description
The following describes the implementation of the present invention in detail with reference to the accompanying drawings.
The invention relates to an end-to-end non-reference video quality evaluation method for layered space-time characteristic representation, which is shown in figure 1 and comprises the following steps:
Step 1: preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block which is used as the input of a neural network;
Specifically, in order to sufficiently extract the temporal and spatial information of the videos, each video is segmented into video segments of time length 8; at the same time, each frame of video is also cut into image blocks uniformly and non-overlapping. According to different resolution, each video can extract several video blocks with 8 time length and 256×256×3 space size, which respectively represent the height, width and channel number of video frame.
Step 2: training a neural network to obtain a first network model with a video space-time feature extraction function, and carrying out staged feature extraction on an input video time segment by using the first network model;
Specifically, the first network model comprises J three-dimensional convolution layers, four generalized divisor normalization layers and four maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model employs a linear rectifying unit (ReLU) as an activation function. The resulting outputs of each stage are as follows:
Xj=CNNj(Xj-1),j∈[1,2,…,J],
Where CNN j represents the three-dimensional convolution layer of the J-th stage, X j represents the output of the J-th stage, J being the total number of stages.
Step 3: training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of each stage into the second network model to obtain feature vectors of the same dimension of each stage;
Specifically, each branch of the second network model comprises a plurality of spatial convolution layers, a plurality of gating circulation units and a uniform pooling layer, and the activation function adopts a linear rectification unit to finally obtain the quality feature vector of each branch.
Specifically, the specific process of inputting the feature map of each stage to the second network model to obtain the feature vector of the same dimension of each stage is as follows:
Step 31: spatial feature fusion is carried out by using a plurality of airspace convolution layers which are connected in sequence so as to obtain features with consistent dimensions:
Where Φ j (·) represents a series of spatial convolution layers, with kernels of 3 x 3 size, zero padding and step sizes of 2 x2 size, In the j-th stage, the characteristic diagram with the time slice number k is expressed as/>And in the j-th stage, the feature vector with the sequence number k of the time slice after the spatial information fusion is represented.
Step 32: given frame level featuresA global maximization layer (denoted GP max) is used to obtain efficient features and reduce spatial redundancy. Simultaneously using a gating loop unit (Gate Recurrent Unit, GRU), the frame-level features are refined by integrating temporal information:
Wherein the method comprises the steps of And in the j-th stage, the feature vector with the time slice sequence number k after the time domain information fusion is represented.
Step 33: obtaining the feature vector after the temporal and spatial information fusion at the stage by using uniform pooling
Step 34: repeating steps 31-33 for each stage to obtain the quality feature vectors with the same dimension of each stage.
Step 4: and respectively calculating the quality scores of all stages by using the third network model with the function of calculating the video quality scores, and calculating the global quality score of the video sequence by combining the attention model in the third network model.
Specifically, the third network model comprises four fully connected layers for stage quality regression, an attention model consisting of two fully connected layers, a linear rectifying unit and an S-shaped growth curve unit, and a fully connected layer for global quality regression.
Further, the specific process of calculating the global quality score of the video sequence is as follows:
Step 41: the quality feature vectors of all the stages are respectively input into a full-connection layer to obtain the quality scores of all the stages of the video:
Where FC j (·) represents the fully connected layer of quality feature vector inputs for stage j and q j is the quality score for that stage.
Step 42: because the learned model tends to over-fit a particular scene in the training set, the attention model is used to obtain a corresponding weight vector, resulting in features that have a greater impact on perceived quality. The attention model consists of two full-connection layers, a linear rectifying unit and an S-shaped growth curve unit, and the calculation mode is as follows:
HW=Sigmoid(FCw2(ReLu(FCw1(H)))),
Wherein, Representing the join operation, FC w1 (. Cndot.) and FC w2 (. Cndot.) represent fully joined layers, sigmoid (. Cndot.) and ReLu (. Cndot.) represent linear rectification functions and S-type growth curve functions, respectively, and H, H W represents eigenvectors and weight vectors, respectively.
Step 43: inputting the global quality feature vector into the full connection layer to obtain the global quality score:
Q=FC(H⊙HW).
Wherein, the parity element is multiplied correspondingly.
The foregoing detailed description has presented only one embodiment of the invention, which is described in greater detail and is not intended to limit the scope of the invention. It should be noted that, for other persons skilled in the art, several variations and modifications can be made without departing from the spirit of the invention, and such presently unforeseen alternatives or modifications to the present embodiments fall within the scope of the invention.

Claims (3)

1. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation is characterized by comprising the following steps of:
Preprocessing video: dividing an original video into non-overlapping time slices, and cutting the time slices into blocks, wherein the regions with the same positions in each slice form a video block;
Training a neural network to obtain a first network model with a video space-time feature extraction function, and carrying out staged feature extraction on an input video time segment by using the first network model;
The first network model comprises J three-dimensional convolution layers, J generalized divisor normalization layers and J maximum pooling layers, and each stage comprises the three-dimensional convolution layers, the generalized divisor normalization layers and the maximum pooling layers which are sequentially connected; the first network model adopts a linear rectifying unit as an activation function, and J is the total stage number;
training the convolutional neural network and the cyclic neural network to obtain a second network model with a video space-time feature fusion function, and inputting feature graphs of each stage into the second network model to obtain feature vectors of the same dimension of each stage;
The second network model comprises J branches, each branch comprises a plurality of airspace convolution layers, a plurality of gating circulation units and a uniform pooling layer, an activation function adopts a linear rectification unit, and features of each stage are respectively input into each branch to obtain a quality feature vector of each branch, wherein J is the total stage number;
calculating the quality scores of all stages by using a third network model with the function of calculating the video quality scores, which is obtained through training, and calculating the global quality score of the video sequence by combining the attention model in the third network model;
the third network model comprises J full-connection layers for stage quality regression, an attention model consisting of two full-connection layers, a linear rectifying unit and an S-shaped growth curve unit, and a full-connection layer for global quality regression;
The specific process for calculating the global quality score of the video sequence is as follows:
Step 41: the quality feature vectors of all the stages are respectively input into a full-connection layer to obtain the quality scores of all the stages of the video:
where FC j (·) represents the fully connected layer of quality feature vector inputs for stage j, q j is the quality score for that stage;
step 42: the mass fraction of each stage is input into an attention model to obtain a corresponding weight vector, and the characteristics with larger influence on the perceived quality are obtained by the following calculation modes:
HW=Sigmoid(FCw2(ReLu(FCw1(H)))),
Wherein, Representing connection operation, FC w1 (-) and FC w2 (-) represent full connection layers, sigmoid (-) and ReLu (-) represent linear rectification functions and S-shaped growth curve functions respectively, and H, H W represents feature vectors and weight vectors respectively;
step 43: inputting the global quality feature vector into the full connection layer to obtain the global quality score:
Q=FC(H⊙HW)
Wherein, the parity element is multiplied correspondingly.
2. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation according to claim 1, wherein the feature output at each stage of the first network model is:
Xj=CNNj(Xj-1),j∈[1,2,...,J],
Where CNN j represents the three-dimensional convolution layer of the J-th stage, X j represents the output of the J-th stage, J being the total number of stages.
3. The end-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation according to claim 1, wherein the specific process of inputting feature graphs of each stage to the second network model to obtain feature vectors of the same dimension of each stage is as follows:
Step 31: spatial feature fusion is carried out by using a plurality of airspace convolution layers which are connected in sequence so as to obtain features with consistent dimensions:
wherein phi j (DEG) represents a plurality of spatial convolution layers, In the j-th stage, the feature map with the time slice number k is shown,In the j-th stage, the feature vector with the sequence number k of the time slice after the spatial information fusion is represented;
Step 32: given frame level features The global maximization layer GP max is used to obtain efficient features and reduce spatial redundancy, while the gating loop unit GRU is used to refine frame-level features by integrating temporal information:
Wherein the method comprises the steps of In the j-th stage, the feature vector with the time slice sequence number k after the time domain information fusion is represented;
Step 33: obtaining the feature vector after the temporal and spatial information fusion at the stage by using uniform pooling
Wherein K is the total number of time slices;
Step 34: repeating the steps 31-33 for each stage feature to obtain quality feature vectors with the same dimension of each stage.
CN202110945647.1A 2021-08-16 2021-08-16 End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation Active CN113822856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110945647.1A CN113822856B (en) 2021-08-16 2021-08-16 End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110945647.1A CN113822856B (en) 2021-08-16 2021-08-16 End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation

Publications (2)

Publication Number Publication Date
CN113822856A CN113822856A (en) 2021-12-21
CN113822856B true CN113822856B (en) 2024-06-21

Family

ID=78922891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110945647.1A Active CN113822856B (en) 2021-08-16 2021-08-16 End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation

Country Status (1)

Country Link
CN (1) CN113822856B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108235003A (en) * 2018-03-19 2018-06-29 天津大学 Three-dimensional video quality evaluation method based on 3D convolutional neural networks
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100316131A1 (en) * 2009-06-12 2010-12-16 Motorola, Inc. Macroblock level no-reference objective quality estimation of video
CN106303507B (en) * 2015-06-05 2019-01-22 江苏惠纬讯信息科技有限公司 Video quality evaluation without reference method based on space-time united information
CN107959848B (en) * 2017-12-08 2019-12-03 天津大学 Universal no-reference video quality evaluation algorithms based on Three dimensional convolution neural network
CN110517237B (en) * 2019-08-20 2022-12-06 西安电子科技大学 No-reference video quality evaluation method based on expansion three-dimensional convolution neural network
CN110677639B (en) * 2019-09-30 2021-06-11 中国传媒大学 Non-reference video quality evaluation method based on feature fusion and recurrent neural network
CN112085102B (en) * 2020-09-10 2023-03-10 西安电子科技大学 No-reference video quality evaluation method based on three-dimensional space-time characteristic decomposition
CN113255786B (en) * 2021-05-31 2024-02-09 西安电子科技大学 Video quality evaluation method based on electroencephalogram signals and target salient characteristics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108235003A (en) * 2018-03-19 2018-06-29 天津大学 Three-dimensional video quality evaluation method based on 3D convolutional neural networks
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics

Also Published As

Publication number Publication date
CN113822856A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN112004085B (en) Video coding method under guidance of scene semantic segmentation result
CN110060236B (en) Stereoscopic image quality evaluation method based on depth convolution neural network
CN110837842A (en) Video quality evaluation method, model training method and model training device
CN112291570B (en) Real-time video enhancement method based on lightweight deformable convolutional neural network
US20220222796A1 (en) Image processing method and apparatus, server, and storage medium
CN112653899A (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
CN110222592B (en) Construction method of time sequence behavior detection network model based on complementary time sequence behavior proposal generation
CN110674925B (en) No-reference VR video quality evaluation method based on 3D convolutional neural network
CN112507920B (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN106937116A (en) Low-complexity video coding method based on random training set adaptive learning
CN115063326B (en) Infrared night vision image efficient communication method based on image compression
CN113743269A (en) Method for identifying video human body posture in light weight mode
CN113658122A (en) Image quality evaluation method, device, storage medium and electronic equipment
CN110807369B (en) Short video content intelligent classification method based on deep learning and attention mechanism
CN114598864A (en) Full-reference ultrahigh-definition video quality objective evaluation method based on deep learning
CN116580184A (en) YOLOv 7-based lightweight model
CN113822954B (en) Deep learning image coding method for man-machine cooperative scene under resource constraint
CN113822856B (en) End-to-end non-reference video quality evaluation method based on hierarchical time-space domain feature representation
CN110555120A (en) picture compression control method and device, computer equipment and storage medium
CN113938254A (en) Attention mechanism-based layered source-channel joint coding transmission system and transmission method thereof
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
CN116468625A (en) Single image defogging method and system based on pyramid efficient channel attention mechanism
CN116543338A (en) Student classroom behavior detection method based on gaze target estimation
CN114663307B (en) Integrated image denoising system based on uncertainty network
CN113255695A (en) Feature extraction method and system for target re-identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant