WO2023036045A1 - 模型训练方法、视频质量评估方法、装置、设备及介质 - Google Patents

模型训练方法、视频质量评估方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023036045A1
WO2023036045A1 PCT/CN2022/116480 CN2022116480W WO2023036045A1 WO 2023036045 A1 WO2023036045 A1 WO 2023036045A1 CN 2022116480 W CN2022116480 W CN 2022116480W WO 2023036045 A1 WO2023036045 A1 WO 2023036045A1
Authority
WO
WIPO (PCT)
Prior art keywords
video data
training
quality assessment
model
module
Prior art date
Application number
PCT/CN2022/116480
Other languages
English (en)
French (fr)
Inventor
陈俊江
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to KR1020247009562A priority Critical patent/KR20240052000A/ko
Publication of WO2023036045A1 publication Critical patent/WO2023036045A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • This application relates to but not limited to the technical field of image processing.
  • the present disclosure provides a model training method for video quality assessment, a video quality assessment method, a model training device for video quality assessment, a video quality assessment device, an electronic device, and a computer storage medium.
  • the present disclosure provides a model training method for video quality assessment, including: acquiring training video data, wherein the training video data includes reference video data and distorted video data; Mean Opinion Value MOS value; training a preset initial video quality assessment model according to the training video data and its MOS value until a convergence condition is reached, and obtaining a final video quality assessment model.
  • the present disclosure provides a video quality assessment method, including: processing the video data to be evaluated with a final quality assessment model trained and obtained according to any method described herein, to obtain a quality assessment score of the video data to be evaluated.
  • the present disclosure provides a model training device for video quality assessment, including: an acquisition module configured to acquire training video data; wherein the training video data includes reference video data and distorted video data; a processing module, It is configured to determine the average opinion value MOS value of each of the training video data; the training module is configured to train a preset initial video quality assessment model according to the training video data and its MOS value until a convergence condition is reached to obtain a final video quality assessment Model.
  • the present disclosure provides a video quality assessment device, including: an assessment module configured to process the video data to be assessed with the final quality assessment model obtained through training according to the aforementioned model training method for video quality assessment, A quality evaluation score of the video data to be evaluated is obtained.
  • the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored; when the one or more programs are processed by the one or more When executed by a processor, the one or more processors implement any model training method for video quality assessment described herein.
  • the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored; when the one or more programs are processed by the one or more When executed by a processor, the one or more processors are made to implement any video quality assessment method described herein.
  • the present disclosure provides a computer storage medium on which a computer program is stored, wherein when the program is executed by a processor, any model training method for video quality assessment described herein is implemented.
  • the present disclosure provides a computer storage medium on which a computer program is stored, wherein when the program is executed by a processor, any video quality assessment method described herein is implemented.
  • Fig. 1 is a schematic flow chart of a model training method for video quality assessment provided by the present disclosure
  • FIG. 2 is a schematic flow diagram of training an initial video quality assessment model provided by the present disclosure
  • FIG. 3 is a schematic diagram of a three-dimensional convolutional neural network provided by the present disclosure.
  • FIG. 4 is a schematic flow diagram of a dense convolutional network provided by the present disclosure.
  • FIG. 5 is a schematic flow diagram of the attention mechanism network provided by the present disclosure.
  • FIG. 6 is a schematic flow diagram of a layered convolutional network provided by the present disclosure.
  • FIG. 7 is a schematic flow diagram of an initial video quality assessment model provided by the present disclosure.
  • Figure 8a is a schematic diagram of the 3D-PVQA method provided by the present disclosure.
  • Fig. 8b is a screenshot of reference video data and a screenshot of distorted video data provided by the present disclosure
  • Fig. 9 is a schematic flow chart of determining the average opinion value MOS value of each training video data provided by the present disclosure.
  • FIG. 10 is a schematic flowchart of a video quality assessment method provided by the present disclosure.
  • FIG. 11 is a block diagram of a model training device for video quality assessment provided by the present disclosure.
  • Fig. 12 is a schematic diagram of modules of a video quality assessment device provided by the present disclosure.
  • FIG. 13 is a schematic diagram of an electronic device provided by the present disclosure.
  • FIG. 14 is a schematic diagram of a computer storage medium provided by the present disclosure.
  • Embodiments described herein may be described with reference to plan views and/or cross-sectional views by way of idealized schematic representations of the disclosure. Accordingly, the example illustrations may be modified according to manufacturing techniques and/or tolerances. Therefore, the embodiments are not limited to those shown in the drawings but include modifications of configurations formed based on manufacturing processes. Accordingly, the regions illustrated in the figures have schematic properties, and the shapes of the regions shown in the figures illustrate the specific shapes of the regions of the elements, but are not intended to be limiting.
  • This disclosure proposes to preset an initial video quality assessment model for fully extracting features and accurately detecting boundaries in an image, obtain reference video data and distorted video data, and use the reference video data and distorted video data and their MOS values (MOS--Mean Opinion Score, average opinion value) to train the initial video quality assessment model, and obtain the final video quality assessment model, so as to improve the accuracy of video quality assessment.
  • MOS--Mean Opinion Score, average opinion value MOS--Mean Opinion Score, average opinion value
  • the present disclosure provides a model training method for video quality assessment, which may include the following steps S11 to S13.
  • step S11 training video data is acquired, wherein the training video data includes reference video data and distorted video data.
  • step S12 the MOS value of each training video data is determined.
  • step S13 the preset initial video quality assessment model is trained according to the training video data and its MOS value until a convergence condition is reached, and a final video quality assessment model is obtained.
  • reference video data can be regarded as standard video data
  • reference video data and distorted video data can be obtained through open source data sets LIVE, CSIQ, IVP and self-made data set CENTER.
  • the MOS value is a numerical value used to characterize the quality of video data.
  • the video data in the open source data sets LIVE, CSIQ, and IVP usually carry corresponding MOS values, but the video data in the self-made data set CENTER does not carry corresponding MOS values, so it is necessary to determine the MOS values of each training video data.
  • the initial video quality assessment model for fully extracting image features and accurately detecting the boundary in the image is preset, and the acquisition includes the reference video Data and distorted video data training video data, while using the reference video data and distorted video data to train the initial video quality assessment model to obtain the final video quality assessment model, which can clearly distinguish distorted video data and non-distorted video data, that is, reference video data , so as to ensure the independence and diversity of the video data used to train the model.
  • the final video quality assessment model obtained by training the initial video quality assessment model can fully extract image features and accurately detect the boundaries in the image, and directly use the final video quality assessment model to perform quality assessment on the video data to be evaluated, which improves the accuracy of video quality assessment. Accuracy.
  • the network structure of the model when the network structure of the model is determined, there are two parts that affect the final performance of the model.
  • One part is the parameters of the model, such as weights, biases, etc.; the other part is the hyperparameters of the model, such as the learning rate. , network layers, etc. If the same training data is used to optimize the parameters and hyperparameters of the model, it will probably lead to absolute overfitting of the model. Therefore, two independent datasets can be used to optimize the parameters and hyperparameters of the initial video quality assessment model, respectively.
  • the training of the preset initial video quality assessment model according to the training video data and its MOS value until reaching the convergence condition may include the following steps S131 and S132.
  • step S131 a training set and a verification set are determined according to a preset ratio and training video data, wherein the intersection of the training set and the verification set is an empty set.
  • step S132 adjust the parameters of the initial video quality assessment model according to the training set and the MOS value of each video data in the training set, and adjust the hyperparameters of the initial video quality assessment model according to the verification set and the MOS value of each video data in the verification set Adjustments are made until a convergence criterion is reached.
  • the training data may be divided into a training set and a verification set according to a ratio of 6:4.
  • the preset ratio can also be other ratios such as 8:2 and 5:5.
  • the training video data can also be determined as a training set, a validation set, and a test set.
  • the training video data can be divided into a training set, a verification set, and a test set according to a ratio of 6:2:2, and the intersection between the training set, the verification set, and the test set is an empty set.
  • the training set and verification set are used to train the initial video quality assessment model to obtain the final video quality assessment model, and the test set is used to evaluate the generalization ability of the final video quality assessment model.
  • the more data in the test set the longer it takes to use the test set to evaluate the generalization ability of the final video quality assessment model; the more video data used to train the initial video quality assessment model, the greater the quality of the final video.
  • the number of training video data can be appropriately increased and the proportion of training set and verification set in training video data can be increased. For example, other ratios such as 10:1:1 can be used in training.
  • the video data is divided into training set, validation set and test set.
  • the training set and the verification set whose intersection is an empty set are determined according to the preset ratio and the training video data, and the training set and the training set are used.
  • the convergence condition it can be fully extracted. Image features and a final video quality assessment model with higher accuracy for accurately detecting boundaries in the image improve the accuracy of video quality assessment.
  • Training the preset initial video quality assessment model based on the training video data and its MOS value is a model training process based on deep learning, which is equivalent to taking the MOS value of the training video data as a benchmark, and is committed to continuously improving the output results of the model to the MOS value. move closer.
  • the gap between the evaluation result output by the model and the MOS value is small, it can be considered that the model has met the requirements of video quality evaluation.
  • the convergence condition includes that the evaluation error rate of each video data in the training set and the verification set does not exceed a preset threshold, and the evaluation error rate is calculated using the following formula:
  • E is the evaluation error rate of the current video data
  • S is the evaluation score of the current video data output by the initial quality evaluation model after adjusting parameters and hyperparameters
  • Mos is the Mos value of the current video data.
  • the MOS value of the current video data has been pre-determined.
  • the initial quality assessment model after adjusting parameters and hyperparameters will output the current video
  • the evaluation score S of the data so the evaluation error rate E of the current video data can be calculated.
  • the preset threshold may be 0.28, 0.26, 0.24 and so on.
  • the initial video quality assessment model preset in the present disclosure may include a three-dimensional convolutional neural network for extracting motion information, so as to improve the accuracy of video quality assessment.
  • the initial video quality assessment model includes a three-dimensional convolutional neural network for extracting motion information of image frames.
  • FIG. 3 it is a schematic diagram of a three-dimensional convolutional neural network provided by the present disclosure.
  • the 3D convolutional neural network can stack multiple consecutive image frames into a cube, and then apply a 3D convolution kernel in the cube.
  • each feature map in the convolutional layer (as shown in the right half of Figure 3) will be associated with multiple adjacent consecutive frames in the previous layer (as shown in the left half of Figure 3) are connected so that motion information between consecutive image frames can be captured.
  • the initial video quality evaluation model may also include an attention model, a data fusion processing module, The global pooling module and the fully connected layer, wherein the attention model, the data fusion processing module, the three-dimensional convolutional neural network, the global pooling module and the fully connected layer are cascaded in sequence.
  • the attention model includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a layered convolutional network, an upsampling processing module, and an attention mechanism network
  • the dense convolutional network includes at least two cascaded dense convolutional modules
  • the dense convolutional module includes four cascaded densely connected convolutional layers.
  • the dense convolutional network includes at least two cascaded dense convolutional modules, each dense convolutional module includes four cascaded densely connected convolutional layers, and the input of each densely connected convolutional layer is the current dense convolutional module Feature map fusion of all preceding dense convolutional layers.
  • the feature map after each layer of pooling of the encoder will pass through a dense convolution module, and each time a dense convolution module is passed, a BN (BatchNormalization, batch normalization) operation, ReLU (Rectified Linear Units, linear correction unit ) activation function operation and convolution Conv operation.
  • BN BatchNormalization, batch normalization
  • ReLU Rectified Linear Units, linear correction unit
  • the attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
  • FIG. 5 it is a schematic flow diagram of the attention mechanism network provided by the present disclosure.
  • the input of the attention mechanism network is the low-dimensional feature g i and the high-dimensional feature x l , where x l is obtained by double-upsampling the output xi of the layered convolutional network; a part of the output of the dense convolutional network After being processed by the upsampling processing module, it is input into the layered convolutional network and then output xi; g i is another part of the output of the dense convolutional network; 1*1 convolution is performed on g i (W g : Conv 1*1), and the x l performs 1*1 convolution (W x : Conv 1*1), and then performs matrix addition processing on the two convolution results, which are processed by the linear correction unit activation module (ReLU), 1*1 convolution ( ⁇ : Conv 1*1), nonlinear activation (Sigmoid), upsampling processing (Ups
  • W g is the result of 1*1 convolution of g i
  • W x is the result of 1*1 convolution of x l
  • T is the matrix transposition symbol
  • is the result of 1*1 convolution on the output of the linear correction unit activation module
  • the layered convolutional network includes a first layered network, a second layered network, a third layered network, and a fourth upsampling processing module
  • the first layered network includes cascaded first A downsampling processing module and a first layered convolution module
  • the second layered network includes a cascaded second downsampling processing module, a second layered convolution module and a second upsampling processing module
  • the third layered network includes The cascaded global pooling module, the third layered convolution module and the third upsampling processing module
  • the first layered convolution module is also cascaded with the second downsampling processing module
  • the second upsampling processing module is cascaded with the fourth upsampling processing module
  • the fourth upsampling processing module and the third upsampling processing module are also cascaded with the third layered convolution module.
  • the first down-sampling processing module and the second down-sampling processing module are both configured to down-sample the data.
  • the second up-sampling processing module, the third up-sampling processing module and the fourth up-sampling processing module are all used to perform up-sampling processing on the data.
  • FIG. 6 it is a schematic flow chart of the layered convolutional network provided by the present disclosure.
  • the data is input into the layered convolutional network, it is respectively input into the first layered network, the second layered network, the third layered network and the fourth upsampling processing module for processing; the output of the first layered network and the second layered network
  • the output of the network is processed by data fusion, and then input to the fourth upsampling processing module;
  • the data is input to the layered convolution network, then input to the global pooling module for processing, and then input to the third layered convolution module for processing, and the third
  • the output X1 of the layered convolution module and the output P(X) of the fourth upsampling processing module are subjected to matrix multiplication processing to obtain Perform matrix addition processing with the output X2 of the third upsampling processing module to obtain
  • the third layered convolution module again for processing, and obtain the output result of the layered convolutional network, namely the high-dimensional feature x
  • the first layered convolution module can perform Conv 5*5 operations on the data (ie 5*5 convolution), and the second layered convolution module can perform Conv 3*3 operations on the data (ie 3 *3 convolution), the third layered convolution module can perform Conv 1*1 operation on the data (that is, 1*1 convolution). It should be understood that the same convolution module can also be used to perform Conv 5*5 operation, Conv 3*3 operation and Conv 1*1 operation respectively.
  • the initial video quality assessment model may include a multi-input network, a two-dimensional convolution module, a dense convolution network, a downsampling processing module, a layered convolution network, an upsampling processing module, an attention mechanism network, a data fusion processing module, 3D Convolutional Neural Networks, Global Pooling Modules and Fully Connected Layers.
  • the video quality assessment model provided in this disclosure may be called a 3D-PVQA (3Dimensions Pyramid Video Quality Assessment, three-dimensional pyramid video quality assessment) model and a 3D-PVQA method.
  • each video data in the training set and each video data in the verification set are divided into distorted video data and residual video data and input to the 3D-PVQA model respectively, that is, residual multi-input Residual-Multi-Input and distorted multi-input Distorted-Multi-Input.
  • Residual video data can be obtained by processing residual frame Residual Frames based on distorted video data and reference video data.
  • the multi-input network outputs the input data into two sets of data, the first set of data is the original input data, and the second set of data is the data obtained by reducing the original input data to double the size of the data frame.
  • the multi-input network will output two sets of data. After the first set of data is processed by the two-dimensional convolution module, it is input to the dense convolution network for processing, and the input is down-sampled. The processing module is processed; the second set of data is processed by the two-dimensional convolution module, and after being fused (concat) with the output of the downsampling processing module, it is input to the dense convolution network again for processing. At this time, a part of the dense convolution network The output will be input to the downsampling processing module again for processing, and the output of the downsampling processing module will be input to the layered convolutional network for processing.
  • the output of the layered convolutional network will be input to the attention mechanism network together with another part of the output of the dense convolutional network for processing.
  • the data fusion processing module performs data fusion processing on the output results of the residual video data obtained by the attention mechanism network processing and the output results of the distorted video data.
  • the output of the data fusion processing module will be input into two 3D convolutional neural networks.
  • the neural network will output the threshold of frame loss perceptibility, perform matrix multiplication processing on the threshold of frame loss perceivability and the residual data frame obtained by the residual frame, and finally input it to the global pooling module and fully connected layer for processing , which will output a quality assessment score for the video data.
  • Figure 6 shows two first layered convolution modules, two second layered convolution modules, and three third layered convolution modules, and does not refer to layered There are two first layered convolution modules, two second layered convolution modules and three third layered convolution modules in the convolutional network.
  • the downsampling processing module and the downsampling processing module in the layered convolutional network can be the same downsampling processing module, or they can be different downsampling processing modules, and the upsampling processing module and the upsampling processing in the layered convolutional network
  • the attention upsampling processing modules in the module and the attention mechanism network may be the same upsampling processing module or different upsampling processing modules.
  • the training video data can be divided into training set, verification set and test set according to the preset ratio, the training set is input into the 3D-PVQA model for training, the verification set is input into the 3D-PVQA model for verification, and the test set Set input 3D-PVQA model for testing, can get the corresponding quality assessment score.
  • the test set can be used to evaluate the generalization ability of the final video quality assessment model, as shown in Figure 8b, the screenshot of the reference video data is on the left, and the screenshot of the distorted video data is on the right, as shown in Table 1 below. It is the MOS value of the video data and the quality evaluation score corresponding to the video data output by the 3D-PVQA model.
  • the determination of the average opinion value MOS value of each training video data may include the following steps S121 to S124.
  • each training video data is grouped, each group includes a piece of reference video data and a plurality of distorted video data, and the resolution of each video data in each group is the same, and the frame rate of each video data in each group same.
  • step S122 classify the distorted video data in each group.
  • step S123 each distorted video data of each category in each group is graded.
  • step S124 the MOS value of each training video data is determined according to the grouping, classification and classification of each training video data.
  • the distorted video data when classifying each distorted video data in each group, can be divided into distorted video data of different categories such as packet loss class distortion and encoding class distortion, and each distorted video data of each classification in each group When the data is graded, the distorted video data can be classified into three different levels of distortion: mild, moderate and severe.
  • each group After grouping, classifying, and grading each training video data, each group includes a piece of reference video data and multiple pieces of distorted video data, and multiple pieces of distorted video data belong to different categories, and the distorted video data under each category belong to different distortion levels , based on the reference video data in each group, the MOS value of each training video data can be determined by using the SAMVIQ (Subjective Assessment Method for Video Quality evaluation) method and grouping, classification and grading situations.
  • SAMVIQ Subjective Assessment Method for Video Quality evaluation
  • the present disclosure also provides a video quality assessment method, which may include the following step S21.
  • step S21 the video data to be evaluated is processed by training the final quality evaluation model obtained according to the above-mentioned model training method for video quality evaluation to obtain the quality evaluation score of the video data to be evaluated.
  • the initial video quality assessment model Preset the initial video quality assessment model to fully extract image features and accurately detect the boundaries in the image, obtain training video data including reference video data and distorted video data, and use the reference video data and distorted video data to evaluate the initial video quality assessment model Performing training to obtain the final video quality assessment model ensures the independence and diversity of the video data used to train the model.
  • the video data to be evaluated can be directly used for quality evaluation by using the final video quality evaluation model, which improves the accuracy of video quality evaluation.
  • the present disclosure also provides a model training device for video quality assessment, which may include: an acquisition module 101 , a processing module 102 , and a training module 103 .
  • the acquiring module 101 is configured to acquire training video data; wherein, the training video data includes reference video data and distorted video data.
  • the processing module 102 is configured to determine the MOS value of each training video data.
  • the training module 103 is configured to train a preset initial video quality assessment model according to the training video data and its MOS value until a convergence condition is reached, so as to obtain a final video quality assessment model.
  • the training module 103 is configured to: determine a training set and a verification set according to preset ratios and training video data, wherein the intersection of the training set and the verification set is an empty set;
  • the MOS value of the data adjusts the parameters of the initial video quality assessment model, and adjusts the hyperparameters of the initial video quality assessment model according to the verification set and the MOS value of each video data in the verification set until the convergence condition is reached.
  • the convergence condition includes that the evaluation error rate of each video data in the training set and the verification set does not exceed a preset threshold, and the evaluation error rate is calculated using the following formula:
  • E is the evaluation error rate of the current video data
  • S is the evaluation score of the current video data output by the initial quality evaluation model after adjusting parameters and hyperparameters
  • Mos is the Mos value of the current video data.
  • the initial video quality assessment model includes a three-dimensional convolutional neural network for extracting motion information of image frames.
  • the initial video quality assessment model also includes an attention model, a data fusion processing module, a global pooling module and a fully connected layer, and the attention model, a data fusion processing module, a three-dimensional convolutional neural network, and a global pooling
  • the modularization and fully connected layers are cascaded in sequence.
  • the attention model includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a layered convolutional network, an upsampling processing module, and an attention mechanism network
  • the dense convolutional network includes at least two cascaded dense convolutional modules
  • the dense convolutional module includes four cascaded densely connected convolutional layers.
  • the attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
  • the layered convolutional network includes a first layered network, a second layered network, a third layered network, and a fourth upsampling processing module
  • the first layered network includes cascaded first A downsampling processing module and a first layered convolution module
  • the second layered network includes a cascaded second downsampling processing module, a second layered convolution module and a second upsampling processing module
  • the third layered network includes The cascaded global pooling module, the third layered convolution module and the third upsampling processing module
  • the first layered convolution module is also cascaded with the second downsampling processing module
  • the second upsampling processing module is cascaded with the fourth upsampling processing module
  • the fourth upsampling processing module and the third upsampling processing module are also cascaded with the third layered convolution module.
  • the processing module 102 is configured to: group each training video data, each group includes a piece of reference video data and multiple pieces of distorted video data, and the resolution of each video data in each group is the same, and The frame rate of each video data in each group is the same; Classify each video data in each group; Classify each video data of each classification in each group; Determine each training video according to the grouping, classification and classification of each training video data The MOS value of the data.
  • the present disclosure also provides a video quality assessment device, including: an assessment module 201 configured to train the final video quality obtained according to the aforementioned model training method for video quality assessment.
  • the quality evaluation model processes the video data to be evaluated to obtain the quality evaluation score of the video data to be evaluated.
  • the embodiment of the present disclosure also provides an electronic device, including: one or more processors 301; a storage device 302, on which one or more programs are stored; when the one or more When the program is executed by the one or more processors 301, the one or more processors 301 are made to implement at least one of the following methods: the model training method for video quality assessment provided by the aforementioned embodiments; The video quality assessment method provided by each embodiment described above.
  • an embodiment of the present disclosure also provides a computer storage medium on which a computer program is stored, wherein, when the program is executed by a processor, at least one of the following methods is implemented: The model training method for video quality assessment provided by the method; the video quality assessment method provided by each embodiment described above.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute.
  • Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit .
  • Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
  • Example embodiments have been disclosed herein, and while specific terms have been employed, they are used and should be construed in a general descriptive sense only and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be described in combination with other embodiments, unless explicitly stated otherwise. Combinations of features and/or elements. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the scope of the present disclosure as set forth in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)

Abstract

本公开提供一种用于视频质量评估的模型训练方法,包括:获取训练视频数据,其中,所述训练视频数据包括参考视频数据和失真视频数据;确定各所述训练视频数据的平均意见值MOS值;根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。本公开还提供一种视频质量评估方法、装置、设备及介质。

Description

模型训练方法、视频质量评估方法、装置、设备及介质
相关申请的交叉引用
本申请要求2021年9月9日提交给中国专利局的第202111055446.0号专利申请的优先权,其全部内容通过引用合并于此。
技术领域
本申请涉及但不限于图像处理技术领域。
背景技术
随着5G时代的来临,视频应用越来越广泛,如:直播、短视频、视频通话等。在万物诉诸视频的互联网时代,日趋庞大的数据流量给视频业务系统的稳定性带来了严峻的挑战。如何正确地对视频质量进行评估成了制约各项技术发展的重要瓶颈,甚至可以说,视频质量评估成为音视频领域中最为基础、也是最为重要的问题并亟待解决。
发明内容
本公开提供一种用于视频质量评估的模型训练方法、一种视频质量评估方法、一种用于视频质量评估的模型训练装置、一种视频质量评估装置、一种电子设备及一种计算机存储介质。
第一方面,本公开提供一种用于视频质量评估的模型训练方法,包括:获取训练视频数据,其中,所述训练视频数据包括参考视频数据和失真视频数据;确定各所述训练视频数据的平均意见值MOS值;根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
又一方面,本公开提供一种视频质量评估方法,包括:根据本文所述的任一方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到所述待评估视频数据的质量评估分数。
又一方面,本公开提供一种用于视频质量评估的模型训练装置, 包括:获取模块,配置为获取训练视频数据;其中,所述训练视频数据包括参考视频数据和失真视频数据;处理模块,配置为确定各所述训练视频数据的平均意见值MOS值;训练模块,配置为根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
又一方面,本公开提供一种视频质量评估装置,包括:评估模块,配置为根据如前所述的用于视频质量评估的模型训练方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到所述待评估视频数据的质量评估分数。
又一方面,本公开提供一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现本文所述的任一用于视频质量评估的模型训练方法。
又一方面,本公开提供一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现本文所述的任一视频质量评估方法。
又一方面,本公开提供一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现本文所述的任一用于视频质量评估的模型训练方法。
又一方面,本公开提供一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现本文所述的任一视频质量评估方法。
附图说明
图1是本公开提供的用于视频质量评估的模型训练方法的流程示意图;
图2是本公开提供的训练初始视频质量评估模型的流程示意图;
图3是本公开提供的三维卷积神经网络的示意图;
图4是本公开提供的密集卷积网络的流程示意图;
图5是本公开提供的注意力机制网络的流程示意图;
图6是本公开提供的分层卷积网络的流程示意图;
图7是本公开提供的初始视频质量评估模型的流程示意图;
图8a是本公开提供的3D-PVQA方法的示意图;
图8b是本公开提供的参考视频数据的截图和失真视频数据的截图;
图9是本公开提供的确定各训练视频数据的平均意见值MOS值的流程示意图;
图10是本公开提供的视频质量评估方法的流程示意图;
图11是本公开提供的用于视频质量评估的模型训练装置的模块示意图;
图12是本公开提供的视频质量评估装的模块示意图;
图13是本公开提供的电子设备的示意图;
图14是本公开提供的计算机存储介质的示意图。
具体实施方式
在下文中将参考附图更充分地描述示例实施方式,但是所述示例实施方式可以以不同形式来体现且不应当被解释为限于本文阐述的实施方式。反之,提供这些实施方式的目的在于使本公开透彻和完整,并将使本领域技术人员充分理解本公开的范围。
如本文所使用的,术语“和/或”包括一个或多个相关列举条目的任何和所有组合。
本文所使用的术语仅用于描述特定实施方式,且不意欲限制本公开。如本文所使用的,单数形式“一个”和“该”也意欲包括复数形式,除非上下文另外清楚指出。还将理解的是,当本说明书中使用术语“包括”和/或“由……制成”时,指定存在所述特征、整体、步骤、操作、元件和/或组件,但不排除存在或添加一个或多个其他特征、整体、步骤、操作、元件、组件和/或其群组。
本文所述实施方式可借助本公开的理想示意图而参考平面图和/或截面图进行描述。因此,可根据制造技术和/或容限来修改示例图 示。因此,实施方式不限于附图中所示的实施方式,而是包括基于制造工艺而形成的配置的修改。因此,附图中例示的区具有示意性属性,并且图中所示区的形状例示了元件的区的具体形状,但并不旨在是限制性的。
除非另外限定,否则本文所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本公开的背景下的含义一致的含义,且将不解释为具有理想化或过度形式上的含义,除非本文明确如此限定。
目前行业视频质量评估方法分为两大类:主观视频质量评估和客观视频质量评估。主观方法是直接让观测者对视频质量做出直观判断,它较为准确但相对复杂,且其结果易受多种因素影响,无法直接应用在工业领域。因此,实际中通常使用易于实现的基于人工智能的客观方法。但是,现阶段利用这些技术形成的方案如PSNR(Peak Signal to Noise Ratio,峰值信噪比)、SSIM(Structural Similarity Index Measurement,结构相似度索引度量)、VMAF(Video Multi-Method Assessment Fusion,视频多方法评估融合)等,最终效果不够好。所以,对视频质量进行更加准确的评估仍然是亟需解决的难题。
目前,常用的如PSNR、SSIM、VMAF等视频质量评估方案存在着对特征提取不全、对边界区分不明显等问题,导致最终效果不够好。本公开提出,预设用以充分提取特征以及准确检测图像中边界的初始视频质量评估模型,获取参考视频数据和失真视频数据,利用参考视频数据和失真视频数据及其MOS值(MOS--Mean Opinion Score,平均意见值)来训练初始视频质量评估模型,获得最终视频质量评估模型,以此提高视频质量评估的准确度。
如图1所示,本公开提供一种用于视频质量评估的模型训练方法,可以包括如下步骤S11至S13。
在步骤S11中,获取训练视频数据,其中,训练视频数据包括参考视频数据和失真视频数据。
在步骤S12中,确定各训练视频数据的MOS值。
在步骤S13中,根据训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
其中,参考视频数据可以视作标准视频数据,可以通过开源数据集LIVE、CSIQ、IVP和自制数据集CENTER来获取参考视频数据和失真视频数据。MOS值即用以表征视频数据的质量高低的数值。开源数据集LIVE、CSIQ、IVP中的视频数据通常均携带有相应的MOS值,但自制数据集CENTER中的视频数据未携带有相应的MOS值,因此还需确定各训练视频数据的MOS值,可以直接获取来自开源数据集LIVE、CSIQ、IVP的训练视频数据的MOS值并为来自自制数据集CENTER的训练视频数据生成相应的MOS值,当然,也可以直接为所有训练视频数据生成相应的MOS值。在训练初始视频质量评估模型的过程中,当达到收敛条件时,可认为模型已满足视频质量评估需求,此时停止训练,得到最终视频质量评估模型。
从上述步骤S11-S13可以看出,通过本公开提供的用于视频质量评估的模型训练方法,预设用以充分提取图像特征以及准确检测图像中边界的初始视频质量评估模型,获取包括参考视频数据和失真视频数据的训练视频数据,同时利用参考视频数据和失真视频数据对初始视频质量评估模型进行训练来获得最终视频质量评估模型,能够明确区分失真视频数据和非失真视频数据即参考视频数据,从而保障用以训练模型的视频数据的独立性和多样性。训练初始视频质量评估模型得到的最终视频质量评估模型能够充分提取图像特征以及准确检测图像中边界,直接利用该最终视频质量评估模型即可对待评估的视频数据进行质量评估,提高了视频质量评估的准确度。
通常来说,模型在网络结构确定的情况下,有两部分内容影响模型最终的性能,一部分内容是模型的参数,例如权重、偏置等等;另一部分内容是模型的超参数,例如学习率、网络层数等等。如果使用相同的训练数据来对模型的参数和超参数进行优化,则将可能导致模型绝对过拟合。因此,可以使用独立的两个数据集来分别对初始视频质量评估模型的参数和超参数进行优化。
相应的,如图2所示,在一些实施方式中,所述根据训练视频 数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件(即步骤S13中所述)可以包括如下步骤S131和S132。
在步骤S131中,根据预设比例和训练视频数据确定训练集和验证集,其中,训练集和验证集的交集为空集。
在步骤S132中,根据训练集和训练集中各视频数据的MOS值对初始视频质量评估模型的参数进行调整,以及根据验证集和验证集中各视频数据的MOS值对初始视频质量评估模型的超参数进行调整,直至达到收敛条件。
本公开对预设比例并不做具体限定,例如可以按照6:4的比例将训练数据划分为训练集和验证集。当然,预设比例也可以为8:2、5:5等其他比例。为了对最终视频质量评估模型的泛化能力进行简单评估,也可以将训练视频数据确定为训练集、验证集和测试集。例如可以按照6:2:2的比例将训练视频数据划分为训练集、验证集和测试集,训练集、验证集和测试集两两之间的交集均为空集。划分完毕后,将训练集和验证集用以训练初始视频质量评估模型来得到最终视频质量评估模型,将测试集用以评估最终视频质量评估模型的泛化能力。需要说明的是,测试集数据越多,则使用测试集评估最终视频质量评估模型的泛化能力的耗时越长;用以训练初始视频质量评估模型的视频数据数量越多,则最终视频质量评估模型的准确度越高。为了进一步提高视频质量评估的效率及准确度,可以适当增加训练视频数据的数量以及增加训练集和验证集在训练视频数据中所占的比例,例如可以按照10:1:1等其他比例在训练视频数据中划分训练集、验证集和测试集。
从上述步骤S131-S132可以看出,通过本公开提供的用于视频质量评估的模型训练方法,根据预设比例和训练视频数据确定交集为空集的训练集和验证集,使用训练集和训练集中各视频数据的MOS值来调整初始视频质量评估模型的参数,使用验证集和验证集中各视频数据的MOS值来调整初始视频质量评估模型的超参数,当达到收敛条件时能够得到可以充分提取图像特征以及准确检测图像中边界的准确度较高的最终视频质量评估模型,提高了视频质量评估的准确 度。
根据训练视频数据及其MOS值训练预设的初始视频质量评估模型,是一个基于深度学习的模型训练过程,相当于以训练视频数据的MOS值为基准,致力于模型的输出结果不断向MOS值靠拢。当模型输出的评估结果与MOS值之间的差距较小时,即可认为模型已经满足合视频质量评估的需求。
相应的,所述收敛条件包括训练集中以及验证集中的各视频数据的评估误差率均不超过预设阈值,评估误差率利用如下公式计算得到:
E=(|S-Mos|)/Mos,其中,
E为当前视频数据的评估误差率;
S为调整参数和超参数后的初始质量评估模型输出的当前视频数据的评估分数;
Mos为当前视频数据的Mos值。
对于任一视频数据,当前视频数据的MOS值已预先确定出,将当前视频数据输入调整参数和超参数后的初始质量评估模型后,调整参数和超参数后的初始质量评估模型将输出当前视频数据的评估分数S,因此可以计算得到当前视频数据的评估误差率E。当训练集中各视频数据和验证集中各视频数据的误差评估率均不超过预设阈值时,可以说明模型输出的评估结果与MOS值之间的差距较小,模型已经满足合视频质量评估的需求,此时可以停止训练。需要说明的是,本公开对预设阈值并不做具体限定,例如,预设阈值可以为0.28、0.26、0.24等等。
目前,常用的如PSNR、SSIM、VMAF等视频质量评估方案还存在着运动信息丢失的问题,导致最终效果不够好。本公开预设的初始视频质量评估模型中可以包括用于提取运动信息的三维卷积神经网络,以提高视频质量评估的准确度。相应的,在一些实施方式中,所述初始视频质量评估模型包括用于提取图像帧的运动信息的三维卷积神经网络。
如图3所示,为本公开提供的三维卷积神经网络的示意图。该 三维卷积神经网络可以将多个连续的图像帧堆叠为一个立方体,然后在立方体中运用三维卷积核。在三维卷积神经网络的结构中,卷积层中每一个特征图(如图3右半部分所示)都会与上一层中多个邻近的连续帧(如图3左半部分所示)相连,因此可以捕捉到连续图像帧之间的运动信息。
仅利用三维卷积神经网络提取图像帧的运动信息,无法对视频数据进行完整评估,相应的,在一些实施方式中,所述初始视频质量评估模型还可以包括注意力模型、数据融合处理模块、全局池化模块和全连接层,其中,注意力模型、数据融合处理模块、三维卷积神经网络、全局池化模块和全连接层依次级联。
在一些实施方式中,所述注意力模型包括级联的多输入网络、二维卷积模块、密集卷积网络、下采样处理模块、分层卷积网络、上采样处理模块和注意力机制网络,密集卷积网络包括至少两个级联的密集卷积模块,密集卷积模块包括四个级联的密集连接卷积层。
如图4所示,为本公开提供的密集卷积网络的示意图。密集卷积网络包括至少两个级联密集卷积模块,每个密集卷积模块包括四个级联的密集连接卷积层,每个密集卷积连接层的输入均为当前密集卷积模块的所有前级密集卷积连接层的特征图融合。编码器每一层池化后的特征图都将经过一个密集卷积模块,每经过一个密集卷积模块,执行一次BN(BatchNormalization,批归一化)操作、ReLU(Rectified Linear Units,线性修正单元)激活函数操作和卷积Conv操作。
在一些实施方式中,所述注意力机制网络包括级联的注意力卷积模块、线性修正单元激活模块、非线性激活模块和注意力上采样处理模块。
如图5所示,为本公开提供的注意力机制网络的流程示意图。注意力机制网络的输入为低维特征g i和高维特征x l,其中,x l为将分层卷积网络的输出xi经2倍上采样处理后得到的;密集卷积网络的一部分输出经上采样处理模块处理后再输入分层卷积网络后输出xi;g i为密集卷积网络的另一部分输出;对g i进行1*1卷积(W g:Conv 1*1),对x l进行1*1卷积(W x:Conv 1*1),对两个卷积结果再进行矩阵相 加处理,经过线性修正单元激活模块进行处理(ReLU)、1*1卷积(ψ:Conv 1*1)、非线性激活(Sigmoid)、上采样处理(Upsample)等操作,得到线性注意系数
Figure PCTCN2022116480-appb-000001
最后将线性注意系数
Figure PCTCN2022116480-appb-000002
按元素与低维特征g i进行矩阵相乘处理,并保留相关激活,得到注意力系数
Figure PCTCN2022116480-appb-000003
线性注意系数
Figure PCTCN2022116480-appb-000004
可以通过以下公式计算获得:
Figure PCTCN2022116480-appb-000005
注意力系数
Figure PCTCN2022116480-appb-000006
可以通过以下公式计算获得:
Figure PCTCN2022116480-appb-000007
在公式(1)和公式(2)中,W g即为对g i进行1*1卷积的结果,W x即为对x l进行1*1卷积的结果,T为矩阵转置符号,ψ即为对线性修正单元激活模块的输出进行1*1卷积的结果,
Figure PCTCN2022116480-appb-000008
均在线性修正单元激活时得到。
在一些实施方式中,所述分层卷积网络包括第一分层网络、第二分层网络、第三分层网络和第四上采样处理模块,第一分层网络包括级联的第一下采样处理模块和第一分层卷积模块,第二分层网络包括级联的第二下采样处理模块、第二分层卷积模块和第二上采样处理模块,第三分层网络包括级联的全局池化模块、第三分层卷积模块和第三上采样处理模块,第一分层卷积模块还与第二下采样处理模块级联,第一分层卷积模块以及第二上采样处理模块与第四上采样处理模块级联,第四上采样处理模块和第三上采样处理模块还与第三分层卷积模块级联。
其中,第一下采样处理模块、第二下采样处理模块均配置为对数据进行下采样处理。第二上采样处理模块、第三上采样处理模块和第四上采样处理模块均用于对数据进行上采样处理。
如图6所示,为本公开提供的分层卷积网络的流程示意图。数据输入分层卷积网络后,分别输入第一分层网络、第二分层网络、第三分层网络和第四上采样处理模块进行处理;第一分层网络的输出与第二分层网络的输出进行数据融合处理,再输入至第四上采样处理模块;数据输入分层卷积网络后输入至全局池化模块进行处理,再输入至第三分层卷积模块进行处理,第三分层卷积模块的输出X 1与第四 上采样处理模块的输出P(X)进行矩阵相乘处理得到
Figure PCTCN2022116480-appb-000009
Figure PCTCN2022116480-appb-000010
再与第三上采样处理模块的输出X 2进行矩阵相加处理得到
Figure PCTCN2022116480-appb-000011
最后再次输入第三分层卷积模块进行处理,得到分层卷积网络的输出结果即高维特征xi。
在一些实施方式中,第一分层卷积模块可以对数据进行Conv 5*5操作(即5*5卷积),第二分层卷积模块可以对数据进行Conv 3*3操作(即3*3卷积),第三分层卷积模块可以对数据进行Conv 1*1操作(即1*1卷积)。应当理解,也可以使用相同的卷积模块分别进行Conv 5*5操作、Conv 3*3操作和Conv 1*1操作。
如图7所示,为本公开提供的初始视频质量评估模型的流程示意图。其中,初始视频质量评估模型可以包括多输入网络、二维卷积模块、密集卷积网络、下采样处理模块、分层卷积网络、上采样处理模块、注意力机制网络、数据融合处理模块、三维卷积神经网络、全局池化模块和全连接层。
本公开所提供的视频质量评估模型可以称为3D-PVQA(3Dimensions Pyramid Video Quality Assessment,三维金字塔视频质量评估)模型及3D-PVQA方法。在前述步骤S132中,训练集中各视频数据以及验证集中各视频数据均分为失真视频数据和残差视频数据分别输入至3D-PVQA模型,即残差多输入Residual-Multi-Input和失真多输入Distored-Multi-Input。残差视频数据可以利用残差框架Residual Frames根据失真视频数据和参考视频数据处理得到。多输入网络将输入的数据输出为两组数据,第一组数据为原始的输入数据,第二组数据为将原始的输入数据按照数据帧大小缩小一倍后的数据。
以下半部分的失真多输入Distored-Multi-Input为例,多输入网络将输出两组数据,第一组数据经二维卷积模块进行处理后,输入密集卷积网络进行处理后,输入下采样处理模块进行处理;第二组数据经二维卷积模块进行处理后,与下采样处理模块的输出进行融合(concat)后,再次输入密集卷积网络进行处理,此时密集卷积网络的一部分输出将再次输入下采样处理模块进行处理,下采样处理模块的输出将输入至分层卷积网络进行处理。分层卷积网络的输出将与密 集卷积网络的另一部分输出一起输入至注意力机制网络进行处理。数据融合处理模块对注意力机制网络处理得到的残差视频数据的输出结果以及失真视频数据的输出结果进行数据融合处理,数据融合处理模块的输出将输入两个三维卷积神经网络,三维卷积神经网络将输出失帧可感知度的阈值,将失帧可感知度的阈值与残差框架所得到的残差数据帧进行矩阵相乘处理,最后输入至全局池化模块和全连接层进行处理,将输出视频数据的质量评估分数。
应当理解,相同的模块可以重复使用,图6示出两个第一分层卷积模块、两个第二分层卷积模块以及三个第三分层卷积模块,并不指代分层卷积网络中共有两个第一分层卷积模块、两个第二分层卷积模块以及三个第三分层卷积模块。下采样处理模块与分层卷积网络中的下采样处理模块可以为相同的下采样处理模块,也可以为不同的下采样处理模块,上采样处理模块与分层卷积网络中的上采样处理模块以及注意力机制网络中的注意力上采样处理模块可以为相同的上采样处理模块,也可以为不同的上采样处理模块。
如图8a所示,可以将训练视频数据按照预设比例划分为训练集、验证集和测试集,将训练集输入3D-PVQA模型进行训练、将验证集输入3D-PVQA模型进行验证、将测试集输入3D-PVQA模型进行测试,均可以得到相应的质量评估分数。如前所示,可以将测试集用以评估最终视频质量评估模型的泛化能力,如图8b所示,左边为参考视频数据的截图,右边为失真视频数据的截图,如下表一所示,为视频数据的MOS值以及3D-PVQA模型输出的视频数据对应的质量评估分数。
表一
Figure PCTCN2022116480-appb-000012
如图9所示,在一些实施方式中,所述确定各训练视频数据的平均意见值MOS值(即步骤S12)可以包括如下步骤S121至S124。
在步骤S121中,对各训练视频数据进行分组,每组中包括一条参考视频数据和多条失真视频数据,且每组中各视频数据的分辨率相同,且每组中各视频数据的帧率相同。
在步骤S122中,对每组中各失真视频数据进行分类。
在步骤S123中,对每组中每个分类的各失真视频数据进行分级。
在步骤S124中,根据各训练视频数据的分组、分类和分级确定各训练视频数据的MOS值。
其中,在对每组中各失真视频数据进行分类时,可以将失真视频数据分为丢包类失真、编码类失真等不同类别的失真视频数据,在对每组中每个分类的各失真视频数据进行分级时,可以将失真视频数据分为轻度、中度和重度三种不同程度的失真级别。
对各训练视频数据分组、分类、分级之后,每组中包括一条参考视频数据和多条失真视频数据,多条失真视频数据属于不同的类别, 每种类别下的失真视频数据属于不同的失真级别,可以基于每组中的参考视频数据,利用SAMVIQ(Subjective Assessment Method for Video Quality evaluation,多媒体视频质量的主观评估)方法以及分组、分类和分级情况确定各训练视频数据的MOS值。
如图10所示,本公开还提供一种视频质量评估方法,可以包括如下步骤S21。
在步骤S21中,根据如前所述的用于视频质量评估的模型训练方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到待评估视频数据的质量评估分数。
预设用以充分提取图像特征以及准确检测图像中边界的初始视频质量评估模型,获取包括参考视频数据和失真视频数据的训练视频数据,同时利用参考视频数据和失真视频数据对初始视频质量评估模型进行训练来获得最终视频质量评估模型,保障了用以训练模型的视频数据的独立性和多样性。直接利用该最终视频质量评估模型即可对待评估的视频数据进行质量评估,提高了视频质量评估的准确度。
基于相同的技术构思,如图11所示,本公开还提供一种用于视频质量评估的模型训练装置,可以包括:获取模块101、处理模块102、训练模块103。
获取模块101,配置为获取训练视频数据;其中,训练视频数据包括参考视频数据和失真视频数据。
处理模块102,配置为确定各训练视频数据的MOS值。
训练模块103,配置为根据训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
在一些实施方式中,所述训练模块103配置为:根据预设比例和训练视频数据确定训练集和验证集,其中,训练集和验证集的交集为空集;根据训练集和训练集中各视频数据的MOS值对初始视频质量评估模型的参数进行调整,以及根据验证集和验证集中各视频数据的MOS值对初始视频质量评估模型的超参数进行调整,直至达到收敛条件。
在一些实施方式中,所述收敛条件包括训练集中以及验证集中的各视频数据的评估误差率均不超过预设阈值,所述评估误差率利用如下公式计算得到:
E=(|S-Mos|)/Mos,其中,
E为当前视频数据的评估误差率;
S为调整参数和超参数后的初始质量评估模型输出的当前视频数据的评估分数;
Mos为当前视频数据的Mos值。
在一些实施方式中,所述初始视频质量评估模型包括用于提取图像帧的运动信息的三维卷积神经网络。
在一些实施方式中,所述初始视频质量评估模型还包括注意力模型、数据融合处理模块、全局池化模块和全连接层,注意力模型、数据融合处理模块、三维卷积神经网络、全局池化模块和全连接层依次级联。
在一些实施方式中,所述注意力模型包括级联的多输入网络、二维卷积模块、密集卷积网络、下采样处理模块、分层卷积网络、上采样处理模块和注意力机制网络,密集卷积网络包括至少两个级联的密集卷积模块,密集卷积模块包括四个级联的密集连接卷积层。
在一些实施方式中,所述注意力机制网络包括级联的注意力卷积模块、线性修正单元激活模块、非线性激活模块和注意力上采样处理模块。
在一些实施方式中,所述分层卷积网络包括第一分层网络、第二分层网络、第三分层网络和第四上采样处理模块,第一分层网络包括级联的第一下采样处理模块和第一分层卷积模块,第二分层网络包括级联的第二下采样处理模块、第二分层卷积模块和第二上采样处理模块,第三分层网络包括级联的全局池化模块、第三分层卷积模块和第三上采样处理模块,第一分层卷积模块还与第二下采样处理模块级联,第一分层卷积模块以及第二上采样处理模块与第四上采样处理模块级联,第四上采样处理模块和第三上采样处理模块还与第三分层卷积模块级联。
在一些实施方式中,所述处理模块102配置为:对各训练视频数据进行分组,每组中包括一条参考视频数据和多条失真视频数据,且每组中各视频数据的分辨率相同,且每组中各视频数据的帧率相同;对每组中各视频数据进行分类;对每组中每个分类的各视频数据进行分级;根据各训练视频数据的分组、分类和分级确定各训练视频数据的MOS值。
基于相同的技术构思,如图12所示,本公开还提供一种视频质量评估装置,包括:评估模块201,配置为根据如前所述的用于视频质量评估的模型训练方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到待评估视频数据的质量评估分数。
此外,如图13所示,本公开实施方式还提供一种电子设备,包括:一个或多个处理器301;存储装置302,其上存储有一个或多个程序;当所述一个或多个程序被所述一个或多个处理器301执行时,使得所述一个或多个处理器301实现以下至少一项方法:如前所述各实施方式提供的用于视频质量评估的模型训练方法;如前所述各实施方式提供的视频质量评估方法。
此外,如图14所示,本公开实施方式还提供一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现以下至少一项方法:如前所述各实施方式提供的用于视频质量评估的模型训练方法;如前所述各实施方式提供的视频质量评估方法。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公 知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
本文已经公开了示例实施方式,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施方式相结合描述的特征、特性和/或元素,或可与其他实施方式相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本公开的范围的情况下,可进行各种形式和细节上的改变。

Claims (16)

  1. 一种用于视频质量评估的模型训练方法,包括:
    获取训练视频数据,其中,所述训练视频数据包括参考视频数据和失真视频数据;
    确定各所述训练视频数据的平均意见值MOS值;
    根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
  2. 根据权利要求1所述的方法,其中,所述根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件包括:
    根据预设比例和所述训练视频数据确定训练集和验证集,其中,所述训练集和所述验证集的交集为空集;
    根据所述训练集和所述训练集中各视频数据的MOS值对所述初始视频质量评估模型的参数进行调整,以及根据所述验证集和所述验证集中各视频数据的MOS值对所述初始视频质量评估模型的超参数进行调整,直至达到收敛条件。
  3. 根据权利要求2所述的方法,其中,所述收敛条件包括所述训练集中以及所述验证集中的各视频数据的评估误差率均不超过预设阈值,所述评估误差率利用如下公式计算得到:
    E=(|S-Mos|)/Mos,其中,
    E为当前视频数据的评估误差率;
    S为调整参数和超参数后的所述初始质量评估模型输出的当前视频数据的评估分数;
    Mos为当前视频数据的Mos值。
  4. 根据权利要求1至3中任意一项所述的方法,其中,所述初始视频质量评估模型包括用于提取图像帧的运动信息的三维卷积神 经网络。
  5. 根据权利要求4所述的方法,其中,所述初始视频质量评估模型还包括注意力模型、数据融合处理模块、全局池化模块和全连接层,所述注意力模型、所述数据融合处理模块、所述三维卷积神经网络、所述全局池化模块和所述全连接层依次级联。
  6. 根据权利要求5所述的方法,其中,所述注意力模型包括级联的多输入网络、二维卷积模块、密集卷积网络、下采样处理模块、分层卷积网络、上采样处理模块和注意力机制网络,所述密集卷积网络包括至少两个级联的密集卷积模块,所述密集卷积模块包括四个级联的密集连接卷积层。
  7. 根据权利要求6所述的方法,其中,所述注意力机制网络包括级联的注意力卷积模块、线性修正单元激活模块、非线性激活模块和注意力上采样处理模块。
  8. 根据权利要求5所述的方法,其中,所述分层卷积网络包括第一分层网络、第二分层网络、第三分层网络和第四上采样处理模块,所述第一分层网络包括级联的第一下采样处理模块和第一分层卷积模块,所述第二分层网络包括级联的第二下采样处理模块、第二分层卷积模块和第二上采样处理模块,所述第三分层网络包括级联的全局池化模块、第三分层卷积模块和第三上采样处理模块,所述第一分层卷积模块还与所述第二下采样处理模块级联,所述第一分层卷积模块以及所述第二上采样处理模块与所述第四上采样处理模块级联,所述第四上采样处理模块和所述第三上采样处理模块还与所述第三分层卷积模块级联。
  9. 根据权利要求1至3中任意一项所述的方法,其中,所述确定各所述训练视频数据的平均意见值MOS值包括:
    对各所述训练视频数据进行分组,每组中包括一条参考视频数据和多条失真视频数据,且每组中各视频数据的分辨率相同,且每组中各视频数据的帧率相同;
    对每组中各视频数据进行分类;
    对每组中每个分类的各视频数据进行分级;
    根据所述各训练视频数据的分组、分类和分级确定各所述训练视频数据的MOS值。
  10. 一种视频质量评估方法,包括:
    根据权利要求1-9任一项所述的方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到所述待评估视频数据的质量评估分数。
  11. 一种用于视频质量评估的模型训练装置,包括:
    获取模块,配置为获取训练视频数据;其中,所述训练视频数据包括参考视频数据和失真视频数据;
    处理模块,配置为确定各所述训练视频数据的平均意见值MOS值;
    训练模块,配置为根据所述训练视频数据及其MOS值训练预设的初始视频质量评估模型直至达到收敛条件,得到最终视频质量评估模型。
  12. 一种视频质量评估装置,包括:
    评估模块,配置为根据权利要求1-9任一项所述的用于视频质量评估的模型训练方法训练获得的最终质量评估模型对待评估视频数据进行处理,得到所述待评估视频数据的质量评估分数。
  13. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现:如权利要求1-9任一项所述的用于视频质量评估的模型训练方法。
  14. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现:如权利要求10所述的视频质量评估方法。
  15. 一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现:如权利要求1-9任一项所述的用于视频质量评估的模型训练方法。
  16. 一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现:如权利要求10所述的视频质量评估方法。
PCT/CN2022/116480 2021-09-09 2022-09-01 模型训练方法、视频质量评估方法、装置、设备及介质 WO2023036045A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020247009562A KR20240052000A (ko) 2021-09-09 2022-09-01 모델 훈련 방법, 비디오 품질 평가 방법, 장치, 설비 및 매체

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111055446.0 2021-09-09
CN202111055446.0A CN115775218A (zh) 2021-09-09 2021-09-09 模型训练方法、视频质量评估方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2023036045A1 true WO2023036045A1 (zh) 2023-03-16

Family

ID=85387481

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116480 WO2023036045A1 (zh) 2021-09-09 2022-09-01 模型训练方法、视频质量评估方法、装置、设备及介质

Country Status (3)

Country Link
KR (1) KR20240052000A (zh)
CN (1) CN115775218A (zh)
WO (1) WO2023036045A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079081A (zh) * 2023-10-16 2023-11-17 山东海博科技信息系统股份有限公司 一种多模态视频文本处理模型训练方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190258902A1 (en) * 2018-02-16 2019-08-22 Spirent Communications, Inc. Training A Non-Reference Video Scoring System With Full Reference Video Scores
CN110674925A (zh) * 2019-08-29 2020-01-10 厦门大学 基于3d卷积神经网络的无参考vr视频质量评价方法
CN110751649A (zh) * 2019-10-29 2020-02-04 腾讯科技(深圳)有限公司 视频质量评估方法、装置、电子设备及存储介质
CN110958467A (zh) * 2019-11-21 2020-04-03 清华大学 视频质量预测方法和装置及电子设备
CN111524110A (zh) * 2020-04-16 2020-08-11 北京微吼时代科技有限公司 视频质量的评价模型构建方法、评价方法及装置
CN113196761A (zh) * 2018-10-19 2021-07-30 三星电子株式会社 用于评估视频的主观质量的方法及装置
CN114598864A (zh) * 2022-03-12 2022-06-07 中国传媒大学 一种基于深度学习的全参考超高清视频质量客观评价方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190258902A1 (en) * 2018-02-16 2019-08-22 Spirent Communications, Inc. Training A Non-Reference Video Scoring System With Full Reference Video Scores
CN113196761A (zh) * 2018-10-19 2021-07-30 三星电子株式会社 用于评估视频的主观质量的方法及装置
CN110674925A (zh) * 2019-08-29 2020-01-10 厦门大学 基于3d卷积神经网络的无参考vr视频质量评价方法
CN110751649A (zh) * 2019-10-29 2020-02-04 腾讯科技(深圳)有限公司 视频质量评估方法、装置、电子设备及存储介质
CN110958467A (zh) * 2019-11-21 2020-04-03 清华大学 视频质量预测方法和装置及电子设备
CN111524110A (zh) * 2020-04-16 2020-08-11 北京微吼时代科技有限公司 视频质量的评价模型构建方法、评价方法及装置
CN114598864A (zh) * 2022-03-12 2022-06-07 中国传媒大学 一种基于深度学习的全参考超高清视频质量客观评价方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079081A (zh) * 2023-10-16 2023-11-17 山东海博科技信息系统股份有限公司 一种多模态视频文本处理模型训练方法及系统
CN117079081B (zh) * 2023-10-16 2024-01-26 山东海博科技信息系统股份有限公司 一种多模态视频文本处理模型训练方法及系统

Also Published As

Publication number Publication date
KR20240052000A (ko) 2024-04-22
CN115775218A (zh) 2023-03-10

Similar Documents

Publication Publication Date Title
Kang et al. Robust median filtering forensics using an autoregressive model
CN111079539B (zh) 一种基于异常追踪的视频异常行为检测方法
CN111178120B (zh) 一种基于作物识别级联技术的害虫图像检测方法
TWI729861B (zh) 處理異常檢測的裝置及方法
CN110457524B (zh) 模型生成方法、视频分类方法及装置
WO2023036045A1 (zh) 模型训练方法、视频质量评估方法、装置、设备及介质
Goh et al. A hybrid evolutionary algorithm for feature and ensemble selection in image tampering detection
US20230066499A1 (en) Method for establishing defect detection model and electronic apparatus
CN108830829B (zh) 联合多种边缘检测算子的无参考质量评价算法
CN111931857A (zh) 一种基于mscff的低照度目标检测方法
CN112686869A (zh) 一种布匹瑕疵检测方法及其装置
Li et al. Robust median filtering detection based on the difference of frequency residuals
CN114862857A (zh) 一种基于两阶段学习的工业产品外观异常检测方法及系统
CN113888477A (zh) 网络模型的训练方法、金属表面缺陷检测方法及电子设备
Agarwal et al. Median filtering forensics based on optimum thresholding for low-resolution compressed images
CN112418229A (zh) 一种基于深度学习的无人船海上场景图像实时分割方法
CN116152577B (zh) 图像分类方法及装置
Rodríguez-Lois et al. A Critical Look into Quantization Table Generalization Capabilities of CNN-based Double JPEG Compression Detection
CN112949344B (zh) 一种用于异常检测的特征自回归方法
CN114897842A (zh) 基于纹理增强网络的红外小目标分割检测方法
Hebbar et al. A Deep Learning Framework with Transfer Learning Approach for Image Forgery Localization
CN114155198A (zh) 一种去雾图像的质量评价方法和装置
Zhu et al. Recaptured image detection through multi-resolution residual-based correlation coefficients
Sornalatha et al. Detecting contrast enhancement based image forgeries by parallel approach
Okarma Image and video quality assessment with the use of various verification databases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22866509

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247009562

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2022866509

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022866509

Country of ref document: EP

Effective date: 20240322

NENP Non-entry into the national phase

Ref country code: DE