CN111479107A - No-reference audio and video joint quality evaluation method based on natural audio and video statistics - Google Patents

No-reference audio and video joint quality evaluation method based on natural audio and video statistics Download PDF

Info

Publication number
CN111479107A
CN111479107A CN202010171587.8A CN202010171587A CN111479107A CN 111479107 A CN111479107 A CN 111479107A CN 202010171587 A CN202010171587 A CN 202010171587A CN 111479107 A CN111479107 A CN 111479107A
Authority
CN
China
Prior art keywords
video
audio
natural
statistical model
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010171587.8A
Other languages
Chinese (zh)
Other versions
CN111479107B (en
Inventor
闵雄阔
翟广涛
杨小康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010171587.8A priority Critical patent/CN111479107B/en
Publication of CN111479107A publication Critical patent/CN111479107A/en
Application granted granted Critical
Publication of CN111479107B publication Critical patent/CN111479107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • H04N21/4756End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for rating content, e.g. scoring a recommended movie

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized; the method comprises the following steps: constructing a natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing a natural audio and video combined statistical model by using the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistics, the natural video statistics and the natural audio and video combined statistics, and performing characteristic regression to obtain final audio and video combined quality estimation; the method for evaluating the joint quality of the audio and video without the reference can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.

Description

No-reference audio and video joint quality evaluation method based on natural audio and video statistics
Technical Field
The invention relates to the technical field of multimedia quality evaluation, in particular to a non-reference audio and video joint quality evaluation method based on a natural audio and video statistical model.
Background
In recent years, multimedia quality evaluation has attracted attention from many researchers in the fields of audio processing, video processing, and the like. According to the type of the signal to be evaluated, the multimedia quality evaluation can be divided into: video/video quality assessment (image/video assessment) and audio quality assessment (audio assessment). Over the past decades, researchers have proposed a number of objective visual quality assessment algorithms. The search of the prior art finds that:
l in and Kuo in W. L in and C. -C.J.Kuo, "quantitative Visual quality metrics" in "Activity," Journal of Visual Communication and Image reproduction, vol.22, No.4, pp.297-312,2011, "Wang and Bovik in Z.Wang and Alanc.Bovik," Mean square error: L or left corner Signal, "IEEE Signal Processing Magazine, Auum.26, 1, pp.98-117,2009," Z.Wang and Alford C.Bovik, "Reduced-no-reference," Audio-quality metrics "in" sample and sample, Journal of "Audio-quality metrics" in "patent application, Journal of No.1, pp.98-117,2009," Journal-6, sample No.8, sample No. 9, Journal of Audio-9, sample No.8, Journal of "sample No. 9, Journal of No. 9, sample, Journal of No. 9, Journal of Audio-5, and No. 9, Journal of the same patent No. 9, Journal of the same.
Although the quality evaluation techniques described above achieve encouraging results, they mostly evaluate the quality of multimedia signals of a single modality, such as a single image, video or audio, ignoring the interaction and fusion between audiovisual multimodal signals. Compared with the extensive research of single-mode quality evaluation, the attention degree of audio-video cross-mode quality evaluation is smaller, but the signal of audio-video multi-mode is considered to be closer to the practical application situation. A review of audio and video quality assessment is given by You et al in J.You, U.Reiter, M.M.Hannuksela, M.Gabbouj, and A.Perki, "Perruptual-based quality assessment for audio-visual services, A surfey," Signal Processing, imaging communication, vol.25, No.7, pp.482-501,2010. This evaluation technique generally requires a fundamental study of multi-modal perception in order to study the interaction between audiovisual signals and other factors affecting the evaluation of audiovisual quality, and these studies are usually performed by some audiovisual experiments. In general, these evaluation techniques are not based on content analysis, but directly estimate the audio/video quality from parameters such as bit rate and encoder type, so the application scenarios are very limited.
Alan Bovik et al, in A.K. Moorthy and A.C. Bovik, "bland image quality assessment From natural scene status to quality," IEEETrans.image Process, vol.20, No.12, pp.3350-3364, Dec.2011, M.A.Saad, A.C. Bovik, and C.Charrrier, "bland image quality assessment, A.natural scene status assessment of processing in the DCT domain," IEEE.image Process, Bovik.21, 8, pp.3339-3352, Aumaig.2012, and A.Mittal, A.K. molecular and A.C. Bovik, "Bovik-image," De.12, IEEE.12, and "video quality assessment" 1.12, 1,2, 1, A.Mittal, A.K.C. image, and B.5, B.1, A.C. Bovik, and B.1, C.1, C.465, C.1, C. Bovik, P.4, P.1, C. Observation, 2, and C.1, C. 4, C. for evaluating the quality statistics of the video quality of the video statistics of the video quality of the video. However, the natural video statistical model in the above method is only applicable to images and videos, and the designed method is only applicable to images and videos.
At present, no research or method is available for popularizing and applying a related natural video statistical model to audio and further constructing a natural audio and video joint statistical model, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized.
Disclosure of Invention
In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a no-reference audio/video joint quality evaluation method based on a natural audio and video statistical model (natural audio/video statistics for short), which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so as to realize no-reference audio/video joint quality evaluation based on natural audio and video statistics.
The invention is realized by the following technical scheme.
A no-reference audio and video joint quality evaluation method based on natural audio and video statistics comprises the following steps:
s1: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on a video;
s2: popularizing the natural video statistical model obtained in the S1 into natural audio statistics, and constructing a natural audio statistical model for the input audio signal, wherein the natural audio statistical model is used for carrying out statistical modeling on audio;
s3: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the S1 and the natural audio statistical model obtained in the S2, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on videos and audios;
s4: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
s5: and performing characteristic regression operation on the audio and video quality characteristics obtained in the S4 to obtain the final audio and video joint quality estimation.
Preferably, in S1, regularization processing is performed on the input video signal, and a natural video statistical model of a spatial domain is constructed by using the regularized video signal; the method comprises the following steps:
carrying out regularization processing on an input video signal:
Figure BDA0002409369980000031
wherein I (I, j) is an original video signal,
Figure BDA0002409369980000032
for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:
Figure BDA0002409369980000033
Figure BDA0002409369980000034
in the formula, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window;
regularized video signal using natural video statistics
Figure BDA0002409369980000035
Modeling a natural video statistical model:
regularized video signal using generalized Gaussian distribution
Figure BDA0002409369980000036
The description is that:
Figure BDA0002409369980000037
wherein f (x; α, sigma)2) A probability density function representing values of pixels of the regularized video signal; x represents a regularized video signal
Figure BDA0002409369980000038
α denotes a parameter for controlling the shape of the distribution, sigma denotes a parameterNumber, σ2Variance for the control distribution; (. cndot.) represents the gamma function:
Figure BDA0002409369980000039
β denotes the following parameters;
Figure BDA00024093699800000310
describing two samples adjacent to the video signal after regularization by adopting asymmetric generalized Gaussian distribution
Figure BDA00024093699800000311
And
Figure BDA0002409369980000041
the product between, i.e.
Figure BDA0002409369980000042
Figure BDA0002409369980000043
Figure BDA0002409369980000044
Figure BDA0002409369980000045
Figure BDA0002409369980000046
In the formula (I), the compound is shown in the specification,
Figure BDA0002409369980000047
representing a probability density function to which products of adjacent pixel values of the regularized video signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,
Figure BDA0002409369980000048
variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,
Figure BDA0002409369980000049
for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
Figure BDA00024093699800000410
Figure BDA00024093699800000411
preferably, in S2, regularization processing is performed on the input audio signal, and a natural audio statistical model is constructed by using the regularized audio signal; the method comprises the following steps:
regularizing an input audio signal:
Figure BDA00024093699800000412
wherein a (t) is an original audio signal,
Figure BDA00024093699800000413
for the audio signal after regularization, t is a time sequence index, and k is a constant for keeping the division equation stable, which is set according to the dynamic range of the audio signal; μ (t) and σ (t) represent the local mean and standard deviation, respectively, of the audio signal:
Figure BDA00024093699800000414
Figure BDA00024093699800000415
in the formula, wττ ═ T, …, T representing a groupA one-dimensional local Gaussian window;
regularized audio signals using natural audio statistical properties
Figure BDA00024093699800000416
Modeling a natural audio statistical model:
regularized audio signals using generalized Gaussian distribution
Figure BDA00024093699800000417
The description is that:
Figure BDA00024093699800000418
wherein f (x; α, sigma)2) Representing a probability density function to which sample values of the regularized audio signal are subjected; x represents a regularized audio signal
Figure BDA0002409369980000051
α represents a parameter for controlling the shape of the distribution, sigma represents a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
Figure BDA0002409369980000052
β denotes the following parameters:
Figure BDA0002409369980000053
describing two samples adjacent to the audio signal after regularization by adopting asymmetric generalized Gaussian distribution
Figure BDA0002409369980000054
And
Figure BDA0002409369980000055
the product between, i.e.
Figure BDA0002409369980000056
Figure BDA0002409369980000057
In the formula (I), the compound is shown in the specification,
Figure BDA0002409369980000058
representing a probability density function to which products of adjacent sample values of the regularized audio signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,
Figure BDA0002409369980000059
variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,
Figure BDA00024093699800000510
for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
Figure BDA00024093699800000511
Figure BDA00024093699800000512
preferably, in S3, constructing a natural audio and video joint statistical model includes:
for each pixel of each frame of the video signal after regularization, randomly selecting a sample from the most adjacent section of audio clip of the frame video to pair with each pixel in pairs to form a sample pair; and carrying out regularization treatment on the sample pairs, and constructing a natural audio and video joint statistical model by using the sample pairs subjected to regularization treatment.
Preferably, in S3, the natural audio and video joint statistical model is described using a two-dimensional generalized gaussian distribution:
Figure BDA00024093699800000513
in the formula (f)x(x; s, Σ) represents a probability density function to which the regularized sample pair is subjected, x represents the regularized sample pair, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
Figure BDA0002409369980000061
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
describing a sample pair formed by a video pixel product and an audio sample product adjacent to the regularized sample pair by adopting two-dimensional generalized Gaussian distribution:
Figure BDA0002409369980000062
in the formula (f)x(x, s, Σ) represents a probability density function of a sample pair formed by the video pixel product and the audio sample product after the regularization processing, x represents a sample pair formed by the video pixel product and the audio sample product after the regularization processing, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
Figure BDA0002409369980000063
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
wherein: the sample pairs formed by the video pixel products and the audio sample products are distributed in four quadrants of the distribution formed by the regularized sample pairs; the four quadrants are respectively: the neighboring video pixel product is greater than zero and the neighboring audio sample product is greater than zero, the video pixel product is greater than zero and the neighboring audio sample product is less than zero, the video pixel product is less than zero and the neighboring audio sample product is greater than zero, the video pixel product is less than zero and the neighboring audio sample product is less than zero.
Preferably, in S4, the extracting the audio quality feature based on the natural audio statistical model includes:
extracting distribution parameters for describing audio quality from natural audio statistical model, wherein the shape parameter α and variance parameter sigma in generalized Gaussian distribution2Shape parameter v, left difference parameter in asymmetric generalized Gaussian distribution for describing audio quality
Figure BDA0002409369980000064
Right variance parameter
Figure BDA0002409369980000065
And its mean parameter η for describing audio quality;
wherein:
Figure BDA0002409369980000066
preferably, in S4, the extracting the video quality feature based on the natural video statistical model includes:
extracting distribution parameters for describing video quality from a natural video statistical model, wherein the shape parameter α and the variance parameter sigma in the generalized Gaussian distribution2Shape parameter v and left difference parameter in asymmetric generalized Gaussian distribution for describing video quality
Figure BDA0002409369980000067
Right variance parameter
Figure BDA0002409369980000068
And its mean parameter η is used to describe the video quality;
wherein:
Figure BDA0002409369980000069
preferably, in S4, extracting an audio-video joint feature of the natural audio-video joint statistical model includes:
extracting joint distribution parameters for describing audio and video quality from a natural audio and video joint statistical model; the two-dimensional generalized Gaussian-distributed shape parameter s and the scale parameter Σ are used to describe the quality of audio and video.
Preferably, the S4 further includes: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,
and (3) performing difference on two adjacent frames of the video frame and/or two adjacent samples of the audio sample, and then respectively extracting the audio quality characteristic and/or the video quality characteristic.
Preferably, in S5, performing feature regression on all audio/video quality features extracted in S4 to obtain a single quality score describing audio/video joint quality, where the audio/video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is inspired by a visual quality evaluation method based on natural video statistics, and realizes no-reference audio and video joint quality evaluation based on natural audio and video statistics by popularizing a related natural video statistical model to natural audio statistics and further constructing a natural audio and video joint statistical model; the method for evaluating the joint quality of the audio and video signals without the reference based on the natural audio and video statistics can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a general flowchart of a non-reference audio/video joint quality evaluation method based on natural audio/video statistics according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a natural audio/video joint statistical model construction provided in an embodiment of the present invention;
FIG. 3 is a distribution diagram of normalized audio and video signal sample pairs of different compression levels according to an embodiment of the present invention;
FIG. 4 is a sample pair distribution diagram of neighboring video pixel products and neighboring audio sample products according to an embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
The invention provides a no-reference audio and video joint quality evaluation method based on a natural audio and video statistical model, which comprises the following steps:
the first step is as follows: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on the video signal;
the second step is that: the natural video statistical model obtained in the first step is popularized to natural audio statistics, and a natural audio statistical model is constructed for the input audio signals and used for carrying out statistical modeling on the audio signals;
the third step: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the first step and the natural audio statistical model obtained in the second step, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on a video signal and an audio signal;
the fourth step: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
the fifth step: and performing characteristic regression operation on the audio and video quality characteristics obtained in the fourth step to obtain the final audio and video joint quality estimation.
The detailed steps of the no-reference audio/video joint quality evaluation method based on natural audio/video statistics provided by the embodiment of the invention are further explained below with reference to the accompanying drawings.
As shown in fig. 1, a method provided in an embodiment of the present invention includes:
firstly, constructing a natural video statistical model
The natural video statistical model constructed by the embodiment of the invention is a natural video statistical model of a spatial domain, and the specific process comprises the following steps: regularizing the input video signal and performing natural video statistical modeling by using the regularized video signal.
The regularization process for the input video signal is as follows:
Figure BDA0002409369980000081
wherein I (I, j) is the original video signal,
Figure BDA0002409369980000082
for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division equation stable, μ (i, j) and σ (i, j) represent the local mean and standard deviation of the video signal, respectively:
Figure BDA0002409369980000083
Figure BDA0002409369980000084
wherein, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window.
Using regularized video signals
Figure BDA0002409369980000091
The statistical modeling process of the natural video statistical model is as follows:
after the raw natural video is regularized as described above,
Figure BDA0002409369980000092
generally obey a Gaussian distribution, and video distortion is forced
Figure BDA0002409369980000093
Deviates from this gaussian distribution, whereas both the gaussian distribution of the natural video and the distribution of the processed distorted video (i.e. the video to be tested) can be described by a generalized gaussian distribution:
Figure BDA0002409369980000094
wherein
Figure BDA0002409369980000095
Wherein (. smallcircle.) represents the following gamma function
Figure BDA0002409369980000096
Where α controls the shape of the distribution and σ2The variance of the distribution is controlled.
Except that
Figure BDA0002409369980000097
Can be described by a generalized Gaussian distribution, and the product between two adjacent samples of the regularized video signal also follows an asymmetric generalized Gaussian distribution, i.e. the distribution of
Figure BDA0002409369980000098
Figure BDA0002409369980000099
Figure BDA00024093699800000910
Figure BDA00024093699800000911
Also obeys the following asymmetric generalized Gaussian distribution
Figure BDA00024093699800000912
Wherein
Figure BDA00024093699800000913
Figure BDA00024093699800000914
Wherein the shape parameter v controls the shape of the distribution,
Figure BDA00024093699800000915
and
Figure BDA00024093699800000916
the variance of the left and right distributions is controlled separately.
Secondly, the natural video statistical model is popularized to natural audio statistics
The specific process of popularizing a natural video statistical model to natural audio statistics includes regularizing an input audio signal and performing natural audio statistical modeling using the regularized audio signal.
The regularization process for the input audio signal is as follows:
Figure BDA0002409369980000101
wherein a (t) is an original audio signal,
Figure BDA0002409369980000102
for the audio signal after regularization, t is a time sequence index, c is a constant set according to the dynamic range of the audio signal for keeping the division equation stable, and μ (t) and σ (t) respectively represent the local mean and standard deviation of the audio signal
Figure BDA0002409369980000103
Figure BDA0002409369980000104
Wherein, wτT denotes a one-dimensional local gaussian window, ….
Using regularized audio signals
Figure BDA0002409369980000105
The statistical modeling process for the natural audio statistical model is as follows:
after the raw natural audio is regularized as described above,
Figure BDA0002409369980000106
generally obey a Gaussian distribution, and audio distortion is forced
Figure BDA0002409369980000107
Deviates from the gaussian distribution, while both the gaussian distribution of natural audio and the distribution of distorted audio can be described by a generalized gaussian distribution
Figure BDA0002409369980000108
Wherein
Figure BDA0002409369980000109
Wherein (. smallcircle.) represents the following gamma function
Figure BDA00024093699800001010
Where α controls the shape of the distribution and σ2The variance of the distribution is controlled.
Except that
Figure BDA00024093699800001011
Can be described by a generalized Gaussian distribution, two samples adjacent to each other after regularization
Figure BDA00024093699800001012
And
Figure BDA00024093699800001013
the product between also obeys an asymmetric generalized Gaussian distribution, i.e.
Figure BDA00024093699800001014
Also obeys the following asymmetric generalized Gaussian distribution
Figure BDA00024093699800001015
Wherein
Figure BDA00024093699800001016
Figure BDA0002409369980000111
Wherein the shape parameter v controls the shape of the distribution,
Figure BDA0002409369980000112
and
Figure BDA0002409369980000113
the variance of the left and right distributions is controlled separately.
Thirdly, constructing a natural audio and video combined statistical model by utilizing a natural video statistical model and a natural audio statistical model
The specific process of constructing the natural audio and video joint statistical model by using the natural video statistical model and the natural audio statistical model is as follows:
as shown in fig. 2, for each pixel in each frame of video, a sample is randomly selected from the most adjacent segment of audio samples of the video frame and paired with the pixel to form a sample pair, the sample pair is regularized, and a natural audio and video joint statistical model is constructed by using the regularized sample pair.
Specifically, the normalized audio and video signal sample pairs generally follow a two-dimensional gaussian distribution, and audio and video distortion forces the distribution of the normalized sample pairs to deviate from the two-dimensional gaussian distribution, while the two-dimensional gaussian distribution of natural audio and video and the distribution of distorted audio and video can be described by a two-dimensional generalized gaussian distribution:
Figure BDA0002409369980000114
where s is a shape parameter, ∑ is a scale parameter, d represents the dimension of x, (. cndot.) represents the gamma function
Figure BDA0002409369980000115
In the embodiment of the invention the parameter s of the two-dimensional generalized gaussian distribution is a scalar and the parameter Σ is a matrix of 2 × 2 the distribution of pairs of audio and video signal samples after regularization is shown in fig. 3, it can be seen that the two-dimensional generalized gaussian distribution describes the distribution very well.
In addition to the normalized sample pair obeying the two-dimensional generalized gaussian distribution, as shown in fig. 4, the sample pair formed by the adjacent video pixel product and the adjacent audio sample product also obeys the distribution of a certain rule, and the distribution can be described by using the two-dimensional generalized gaussian distribution in four quadrants of the distribution formed by the sample pair, namely, the adjacent video pixel product is greater than zero and the adjacent audio sample product is greater than zero, the video pixel product is greater than zero and the adjacent audio sample product is less than zero, the video pixel product is less than zero and the adjacent audio sample product is less than zero.
Fourthly, extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model
First, it is necessary to describe from the second step
Figure BDA0002409369980000116
Extracting distribution parameters capable of describing audio quality from the obedient generalized Gaussian distribution and the p (t) obedient asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution2The shape parameter v and the left difference parameter of the asymmetric generalized Gaussian distribution can describe the audio quality
Figure BDA0002409369980000121
Right variance parameter
Figure BDA0002409369980000122
And the following mean parameters
Figure BDA0002409369980000123
The audio quality can be described.
Secondly, it is necessary to start from the first step
Figure BDA0002409369980000124
Extracting distribution parameters capable of describing video quality from obeyed generalized Gaussian distribution and H (i, j), V (i, j), D1(i, j) and D2(i, j) obeyed asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution2Shape of asymmetric generalized Gaussian distribution capable of describing video qualityParameter v, left difference parameter
Figure BDA0002409369980000125
Right variance parameter
Figure BDA0002409369980000126
And the following mean parameters:
Figure BDA0002409369980000127
video quality can be described.
Finally, distribution parameters capable of describing the audio and video quality are extracted from the two-dimensional generalized Gaussian distribution obeyed by the regularized audio and video signal sample pairs and the two-dimensional generalized Gaussian distribution obeyed by the adjacent video pixel product and the adjacent audio sample product in the three-dimensional generalized Gaussian distribution obeyed by the four quadrants. Wherein the two-dimensional generalized gaussian distributed shape parameter s, and the scale parameter Σ can both describe the quality of audio and video.
Fifthly, performing characteristic regression to obtain final audio and video joint quality estimation
And finally, performing regression on all audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video combined statistical model in the fourth step to obtain a single quality score for describing audio and video combined quality, wherein the audio and video quality characteristic regression can be a simple machine learning characteristic fusion method for a support vector machine, a random forest and the like, and can also be a complex deep learning characteristic fusion method for a neural network and the like.
The implementation effect is as follows:
in order to verify the no-Reference Audio Quality evaluation method based on natural Audio-Video statistics provided by the above embodiment of the present invention, a correlation algorithm may be tested on a L IVE-sJTU Audio and Video Quality Assessment (A/V-QA) Database, L IVE-sJTU A/V-QA Database is an Audio-Video Quality evaluation Database, and includes 336 pieces of distorted Audio-Video generated from 14 pieces of high-Quality Reference Audio-Video by using 24 Audio-Video distortion types/degrees, wherein 24 distortion conditions include any combination of two Video distortion types (compression and compression plus scaling, both including four levels of distortion) and one Audio distortion type (compression, including three levels of distortion).
The test utilizes 80% of data in L IVE-SJTU A/V-QA database to train, the rest 20% of data to test, the training test can be randomly carried out 1000 times, and the SRCC median value of 1000 tests can be used as the performance test result of the algorithm.
The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistical model provided by the embodiment of the invention comprises the five steps of constructing the natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing the natural audio and video joint statistical model by utilizing the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video joint statistical model, and obtaining final audio and video joint quality estimation by characteristic regression, so that the joint quality of the audio and video can be effectively evaluated. According to the method, the related natural video statistical model is popularized to natural audio statistics, and a natural audio and video combined statistical model is further constructed, so that the non-reference audio and video combined quality evaluation based on the natural audio and video statistics is realized.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (10)

1. A no-reference audio and video joint quality evaluation method based on natural audio and video statistics is characterized by comprising the following steps:
s1: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on a video;
s2: popularizing the natural video statistical model obtained in the S1 into natural audio statistics, and constructing a natural audio statistical model for the input audio signal, wherein the natural audio statistical model is used for carrying out statistical modeling on audio;
s3: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the S1 and the natural audio statistical model obtained in the S2, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on videos and audios;
s4: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
s5: and performing characteristic regression operation on the audio and video quality characteristics obtained in the S4 to obtain the final audio and video joint quality estimation.
2. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 1, wherein in S1, input video signals are regularized, and a natural video statistical model of a spatial domain is constructed by using the regularized video signals; the method comprises the following steps:
carrying out regularization processing on an input video signal:
Figure FDA0002409369970000011
wherein I (I, j) is an original video signal,
Figure FDA0002409369970000012
for the video signal after regularization, i, j are the pixel indices, c is the viewA constant set by the dynamic range of the frequency signal is used for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:
Figure FDA0002409369970000013
Figure FDA0002409369970000014
in the formula, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window;
regularized video signal using natural video statistics
Figure FDA0002409369970000015
Modeling a natural video statistical model:
regularized video signal using generalized Gaussian distribution
Figure FDA0002409369970000016
The description is that:
Figure FDA0002409369970000021
wherein f (x; α, sigma)2) A probability density function representing values of pixels of the regularized video signal; x represents a regularized video signal
Figure FDA0002409369970000022
α denotes a parameter for controlling the shape of the distribution, sigma denotes a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
Figure FDA0002409369970000023
β denotes the following parameters;
Figure FDA0002409369970000024
describing two samples adjacent to the video signal after regularization by adopting asymmetric generalized Gaussian distribution
Figure FDA0002409369970000025
And
Figure FDA0002409369970000026
the product between, i.e.
Figure FDA0002409369970000027
Figure FDA0002409369970000028
Figure FDA0002409369970000029
Figure FDA00024093699700000210
Figure FDA00024093699700000211
In the formula (I), the compound is shown in the specification,
Figure FDA00024093699700000212
representing a probability density function to which products of adjacent pixel values of the regularized video signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,
Figure FDA00024093699700000213
variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,
Figure FDA00024093699700000214
for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
Figure FDA00024093699700000215
Figure FDA00024093699700000216
3. the no-reference audio-video joint quality evaluation method based on natural audio-video statistics as claimed in claim 1, wherein in S2, input audio signals are regularized, and a natural audio statistical model is constructed using the regularized audio signals; the method comprises the following steps:
regularizing an input audio signal:
Figure FDA00024093699700000217
wherein a (t) is an original audio signal,
Figure FDA00024093699700000218
for the audio signal after regularization, t is a time sequence index, and k is a constant for keeping the division equation stable, which is set according to the dynamic range of the audio signal; μ (t) and σ (t) represent the local mean and standard deviation, respectively, of the audio signal:
Figure FDA0002409369970000031
Figure FDA0002409369970000032
in the formula,wτT denotes a one-dimensional local gaussian window, …;
regularized audio signals using natural audio statistical properties
Figure FDA0002409369970000033
Modeling a natural audio statistical model:
regularized audio signals using generalized Gaussian distribution
Figure FDA0002409369970000034
The description is that:
Figure FDA0002409369970000035
wherein f (x; α, sigma)2) Representing a probability density function to which sample values of the regularized audio signal are subjected; x represents a regularized audio signal
Figure FDA0002409369970000036
α represents a parameter for controlling the shape of the distribution, sigma represents a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
Figure FDA0002409369970000037
β denotes the following parameters:
Figure FDA0002409369970000038
describing two samples adjacent to the audio signal after regularization by adopting asymmetric generalized Gaussian distribution
Figure FDA0002409369970000039
And
Figure FDA00024093699700000310
the product between, i.e.
Figure FDA00024093699700000311
Figure FDA00024093699700000312
In the formula (I), the compound is shown in the specification,
Figure FDA00024093699700000313
representing a probability density function to which products of adjacent sample values of the regularized audio signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,
Figure FDA00024093699700000314
variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,
Figure FDA00024093699700000315
for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
Figure FDA00024093699700000316
Figure FDA00024093699700000317
4. the method for evaluating the joint quality of the non-reference audio and video based on the natural audio and video statistics as claimed in claim 1, wherein in the step S3, a natural audio and video joint statistical model is constructed, which comprises:
for each pixel of each frame of the video signal after regularization, randomly selecting a sample from the most adjacent section of audio clip of the frame video to pair with each pixel in pairs to form a sample pair; and carrying out regularization treatment on the sample pairs, and constructing a natural audio and video joint statistical model by using the sample pairs subjected to regularization treatment.
5. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 4, wherein in S3, a two-dimensional generalized Gaussian distribution is adopted to describe the joint statistical model of the natural audio and video:
Figure FDA0002409369970000041
in the formula (f)x(x; s, Σ) represents a probability density function to which the regularized sample pair is subjected, x represents the regularized sample pair, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
Figure FDA0002409369970000042
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
describing a sample pair formed by a video pixel product and an audio sample product adjacent to the regularized sample pair by adopting two-dimensional generalized Gaussian distribution:
Figure FDA0002409369970000043
in the formula (f)x(x, s, Σ) represents a probability density function of a sample pair formed by the video pixel product and the audio sample product after the regularization processing, x represents a sample pair formed by the video pixel product and the audio sample product after the regularization processing, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
Figure FDA0002409369970000044
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
wherein: the sample pairs formed by the video pixel products and the audio sample products are distributed in four quadrants of the distribution formed by the regularized sample pairs; the four quadrants are respectively: the neighboring video pixel product is greater than zero and the neighboring audio sample product is greater than zero, the video pixel product is greater than zero and the neighboring audio sample product is less than zero, the video pixel product is less than zero and the neighboring audio sample product is greater than zero, the video pixel product is less than zero and the neighboring audio sample product is less than zero.
6. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 3, wherein in the step S4, the extracting of the audio quality features based on the natural audio statistical model includes:
extracting distribution parameters for describing audio quality from natural audio statistical model, wherein the shape parameter α and variance parameter sigma in generalized Gaussian distribution2Shape parameter v, left difference parameter in asymmetric generalized Gaussian distribution for describing audio quality
Figure FDA0002409369970000051
Right variance parameter
Figure FDA0002409369970000052
And its mean parameter η for describing audio quality;
wherein:
Figure FDA0002409369970000053
7. the method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 2, wherein in the step S4, extracting the video quality characteristics based on the natural video statistics model includes:
extracting distribution parameters for describing video quality from a natural video statistical model, wherein the shape parameter α and the variance parameter sigma in the generalized Gaussian distribution2Shape parameter v and left difference parameter in asymmetric generalized Gaussian distribution for describing video quality
Figure FDA0002409369970000054
Right variance parameter
Figure FDA0002409369970000055
And its mean parameter η is used to describe the video quality;
wherein:
Figure FDA0002409369970000056
8. the method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 5, wherein in the step S4, the extracting of the audio and video joint characteristics of the natural audio and video joint statistical model comprises the following steps:
extracting joint distribution parameters for describing audio and video quality from a natural audio and video joint statistical model; the two-dimensional generalized Gaussian-distributed shape parameter s and the scale parameter Σ are used to describe the quality of audio and video.
9. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in any one of claims 1 to 8, wherein the S4 further comprises: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,
and (3) performing difference on two adjacent frames of the video frame and/or two adjacent samples of the audio sample, and then respectively extracting the audio quality characteristic and/or the video quality characteristic.
10. The no-reference audio and video joint quality evaluation method based on natural audio and video statistics as claimed in any one of claims 1 to 8, wherein in S5, feature regression is performed on all audio and video quality features extracted in S4 to obtain a single quality score describing audio and video joint quality, wherein the audio and video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.
CN202010171587.8A 2020-03-12 2020-03-12 No-reference audio and video joint quality evaluation method based on natural audio and video statistics Active CN111479107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010171587.8A CN111479107B (en) 2020-03-12 2020-03-12 No-reference audio and video joint quality evaluation method based on natural audio and video statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010171587.8A CN111479107B (en) 2020-03-12 2020-03-12 No-reference audio and video joint quality evaluation method based on natural audio and video statistics

Publications (2)

Publication Number Publication Date
CN111479107A true CN111479107A (en) 2020-07-31
CN111479107B CN111479107B (en) 2021-06-08

Family

ID=71747429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010171587.8A Active CN111479107B (en) 2020-03-12 2020-03-12 No-reference audio and video joint quality evaluation method based on natural audio and video statistics

Country Status (1)

Country Link
CN (1) CN111479107B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid
CN113382232A (en) * 2021-08-12 2021-09-10 北京微吼时代科技有限公司 Method, device and system for monitoring audio and video quality and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109302603A (en) * 2017-07-25 2019-02-01 中国移动通信集团北京有限公司 A kind of video speech quality appraisal procedure and device
CN108683909B (en) * 2018-07-12 2020-07-07 北京理工大学 VR audio and video integral user experience quality evaluation method
CN108933938A (en) * 2018-08-23 2018-12-04 北京奇艺世纪科技有限公司 A kind of video quality method of inspection, device and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968677A (en) * 2020-08-21 2020-11-20 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid
CN111968677B (en) * 2020-08-21 2021-09-07 南京工程学院 Voice quality self-evaluation method for fitting-free hearing aid
CN113382232A (en) * 2021-08-12 2021-09-10 北京微吼时代科技有限公司 Method, device and system for monitoring audio and video quality and electronic equipment

Also Published As

Publication number Publication date
CN111479107B (en) 2021-06-08

Similar Documents

Publication Publication Date Title
CN109615582B (en) Face image super-resolution reconstruction method for generating countermeasure network based on attribute description
CN107977932B (en) Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network
CN108537743B (en) Face image enhancement method based on generation countermeasure network
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
Li et al. No-reference image quality assessment with deep convolutional neural networks
CN111062314B (en) Image selection method and device, computer readable storage medium and electronic equipment
CN113112416B (en) Semantic-guided face image restoration method
Yang et al. Blind assessment for stereo images considering binocular characteristics and deep perception map based on deep belief network
Wu et al. VP-NIQE: An opinion-unaware visual perception natural image quality evaluator
CN111479107A (en) No-reference audio and video joint quality evaluation method based on natural audio and video statistics
CN108259893B (en) Virtual reality video quality evaluation method based on double-current convolutional neural network
CN111709914A (en) Non-reference image quality evaluation method based on HVS characteristics
Ji et al. Blind image quality assessment with semantic information
CN111882516B (en) Image quality evaluation method based on visual saliency and deep neural network
Krishnan et al. SwiftSRGAN-Rethinking super-resolution for efficient and real-time inference
CN111368734A (en) Micro expression recognition method based on normal expression assistance
CN111508528B (en) No-reference audio quality evaluation method and device based on natural audio statistical characteristics
Chang et al. LG-IQA: Integration of local and global features for no-reference image quality assessment
CN108492275B (en) No-reference stereo image quality evaluation method based on deep neural network
CN117058735A (en) Micro-expression recognition method based on parameter migration and optical flow feature extraction
CN112818950B (en) Lip language identification method based on generation of countermeasure network and time convolution network
Li et al. Unsupervised neural rendering for image hazing
CN114897884A (en) No-reference screen content image quality evaluation method based on multi-scale edge feature fusion
Kim et al. Cnn-based blind quality prediction on stereoscopic images via patch to image feature pooling
CN110930398B (en) Total reference video quality evaluation method based on Log-Gabor similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant