CN111479107A - No-reference audio and video joint quality evaluation method based on natural audio and video statistics - Google Patents
No-reference audio and video joint quality evaluation method based on natural audio and video statistics Download PDFInfo
- Publication number
- CN111479107A CN111479107A CN202010171587.8A CN202010171587A CN111479107A CN 111479107 A CN111479107 A CN 111479107A CN 202010171587 A CN202010171587 A CN 202010171587A CN 111479107 A CN111479107 A CN 111479107A
- Authority
- CN
- China
- Prior art keywords
- video
- audio
- natural
- statistical model
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 27
- 238000013179 statistical model Methods 0.000 claims abstract description 109
- 238000009826 distribution Methods 0.000 claims description 104
- 230000005236 sound signal Effects 0.000 claims description 38
- 230000006870 function Effects 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 13
- 238000007500 overflow downdraw method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 150000001875 compounds Chemical class 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 10
- 238000001303 quality assessment method Methods 0.000 description 9
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000013442 quality metrics Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/475—End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
- H04N21/4756—End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for rating content, e.g. scoring a recommended movie
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
Abstract
The invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized; the method comprises the following steps: constructing a natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing a natural audio and video combined statistical model by using the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistics, the natural video statistics and the natural audio and video combined statistics, and performing characteristic regression to obtain final audio and video combined quality estimation; the method for evaluating the joint quality of the audio and video without the reference can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.
Description
Technical Field
The invention relates to the technical field of multimedia quality evaluation, in particular to a non-reference audio and video joint quality evaluation method based on a natural audio and video statistical model.
Background
In recent years, multimedia quality evaluation has attracted attention from many researchers in the fields of audio processing, video processing, and the like. According to the type of the signal to be evaluated, the multimedia quality evaluation can be divided into: video/video quality assessment (image/video assessment) and audio quality assessment (audio assessment). Over the past decades, researchers have proposed a number of objective visual quality assessment algorithms. The search of the prior art finds that:
l in and Kuo in W. L in and C. -C.J.Kuo, "quantitative Visual quality metrics" in "Activity," Journal of Visual Communication and Image reproduction, vol.22, No.4, pp.297-312,2011, "Wang and Bovik in Z.Wang and Alanc.Bovik," Mean square error: L or left corner Signal, "IEEE Signal Processing Magazine, Auum.26, 1, pp.98-117,2009," Z.Wang and Alford C.Bovik, "Reduced-no-reference," Audio-quality metrics "in" sample and sample, Journal of "Audio-quality metrics" in "patent application, Journal of No.1, pp.98-117,2009," Journal-6, sample No.8, sample No. 9, Journal of Audio-9, sample No.8, Journal of "sample No. 9, Journal of No. 9, sample, Journal of No. 9, Journal of Audio-5, and No. 9, Journal of the same patent No. 9, Journal of the same.
Although the quality evaluation techniques described above achieve encouraging results, they mostly evaluate the quality of multimedia signals of a single modality, such as a single image, video or audio, ignoring the interaction and fusion between audiovisual multimodal signals. Compared with the extensive research of single-mode quality evaluation, the attention degree of audio-video cross-mode quality evaluation is smaller, but the signal of audio-video multi-mode is considered to be closer to the practical application situation. A review of audio and video quality assessment is given by You et al in J.You, U.Reiter, M.M.Hannuksela, M.Gabbouj, and A.Perki, "Perruptual-based quality assessment for audio-visual services, A surfey," Signal Processing, imaging communication, vol.25, No.7, pp.482-501,2010. This evaluation technique generally requires a fundamental study of multi-modal perception in order to study the interaction between audiovisual signals and other factors affecting the evaluation of audiovisual quality, and these studies are usually performed by some audiovisual experiments. In general, these evaluation techniques are not based on content analysis, but directly estimate the audio/video quality from parameters such as bit rate and encoder type, so the application scenarios are very limited.
Alan Bovik et al, in A.K. Moorthy and A.C. Bovik, "bland image quality assessment From natural scene status to quality," IEEETrans.image Process, vol.20, No.12, pp.3350-3364, Dec.2011, M.A.Saad, A.C. Bovik, and C.Charrrier, "bland image quality assessment, A.natural scene status assessment of processing in the DCT domain," IEEE.image Process, Bovik.21, 8, pp.3339-3352, Aumaig.2012, and A.Mittal, A.K. molecular and A.C. Bovik, "Bovik-image," De.12, IEEE.12, and "video quality assessment" 1.12, 1,2, 1, A.Mittal, A.K.C. image, and B.5, B.1, A.C. Bovik, and B.1, C.1, C.465, C.1, C. Bovik, P.4, P.1, C. Observation, 2, and C.1, C. 4, C. for evaluating the quality statistics of the video quality of the video statistics of the video quality of the video. However, the natural video statistical model in the above method is only applicable to images and videos, and the designed method is only applicable to images and videos.
At present, no research or method is available for popularizing and applying a related natural video statistical model to audio and further constructing a natural audio and video joint statistical model, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized.
Disclosure of Invention
In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a no-reference audio/video joint quality evaluation method based on a natural audio and video statistical model (natural audio/video statistics for short), which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so as to realize no-reference audio/video joint quality evaluation based on natural audio and video statistics.
The invention is realized by the following technical scheme.
A no-reference audio and video joint quality evaluation method based on natural audio and video statistics comprises the following steps:
s1: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on a video;
s2: popularizing the natural video statistical model obtained in the S1 into natural audio statistics, and constructing a natural audio statistical model for the input audio signal, wherein the natural audio statistical model is used for carrying out statistical modeling on audio;
s3: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the S1 and the natural audio statistical model obtained in the S2, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on videos and audios;
s4: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
s5: and performing characteristic regression operation on the audio and video quality characteristics obtained in the S4 to obtain the final audio and video joint quality estimation.
Preferably, in S1, regularization processing is performed on the input video signal, and a natural video statistical model of a spatial domain is constructed by using the regularized video signal; the method comprises the following steps:
carrying out regularization processing on an input video signal:
wherein I (I, j) is an original video signal,for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:
in the formula, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window;
wherein f (x; α, sigma)2) A probability density function representing values of pixels of the regularized video signal; x represents a regularized video signalα denotes a parameter for controlling the shape of the distribution, sigma denotes a parameterNumber, σ2Variance for the control distribution; (. cndot.) represents the gamma function:
β denotes the following parameters;
describing two samples adjacent to the video signal after regularization by adopting asymmetric generalized Gaussian distributionAndthe product between, i.e.
In the formula (I), the compound is shown in the specification,representing a probability density function to which products of adjacent pixel values of the regularized video signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
preferably, in S2, regularization processing is performed on the input audio signal, and a natural audio statistical model is constructed by using the regularized audio signal; the method comprises the following steps:
regularizing an input audio signal:
wherein a (t) is an original audio signal,for the audio signal after regularization, t is a time sequence index, and k is a constant for keeping the division equation stable, which is set according to the dynamic range of the audio signal; μ (t) and σ (t) represent the local mean and standard deviation, respectively, of the audio signal:
in the formula, wττ ═ T, …, T representing a groupA one-dimensional local Gaussian window;
regularized audio signals using natural audio statistical propertiesModeling a natural audio statistical model:
wherein f (x; α, sigma)2) Representing a probability density function to which sample values of the regularized audio signal are subjected; x represents a regularized audio signalα represents a parameter for controlling the shape of the distribution, sigma represents a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
β denotes the following parameters:
describing two samples adjacent to the audio signal after regularization by adopting asymmetric generalized Gaussian distributionAndthe product between, i.e.
In the formula (I), the compound is shown in the specification,representing a probability density function to which products of adjacent sample values of the regularized audio signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
preferably, in S3, constructing a natural audio and video joint statistical model includes:
for each pixel of each frame of the video signal after regularization, randomly selecting a sample from the most adjacent section of audio clip of the frame video to pair with each pixel in pairs to form a sample pair; and carrying out regularization treatment on the sample pairs, and constructing a natural audio and video joint statistical model by using the sample pairs subjected to regularization treatment.
Preferably, in S3, the natural audio and video joint statistical model is described using a two-dimensional generalized gaussian distribution:
in the formula (f)x(x; s, Σ) represents a probability density function to which the regularized sample pair is subjected, x represents the regularized sample pair, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
describing a sample pair formed by a video pixel product and an audio sample product adjacent to the regularized sample pair by adopting two-dimensional generalized Gaussian distribution:
in the formula (f)x(x, s, Σ) represents a probability density function of a sample pair formed by the video pixel product and the audio sample product after the regularization processing, x represents a sample pair formed by the video pixel product and the audio sample product after the regularization processing, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
wherein: the sample pairs formed by the video pixel products and the audio sample products are distributed in four quadrants of the distribution formed by the regularized sample pairs; the four quadrants are respectively: the neighboring video pixel product is greater than zero and the neighboring audio sample product is greater than zero, the video pixel product is greater than zero and the neighboring audio sample product is less than zero, the video pixel product is less than zero and the neighboring audio sample product is greater than zero, the video pixel product is less than zero and the neighboring audio sample product is less than zero.
Preferably, in S4, the extracting the audio quality feature based on the natural audio statistical model includes:
extracting distribution parameters for describing audio quality from natural audio statistical model, wherein the shape parameter α and variance parameter sigma in generalized Gaussian distribution2Shape parameter v, left difference parameter in asymmetric generalized Gaussian distribution for describing audio qualityRight variance parameterAnd its mean parameter η for describing audio quality;
wherein:
preferably, in S4, the extracting the video quality feature based on the natural video statistical model includes:
extracting distribution parameters for describing video quality from a natural video statistical model, wherein the shape parameter α and the variance parameter sigma in the generalized Gaussian distribution2Shape parameter v and left difference parameter in asymmetric generalized Gaussian distribution for describing video qualityRight variance parameterAnd its mean parameter η is used to describe the video quality;
wherein:
preferably, in S4, extracting an audio-video joint feature of the natural audio-video joint statistical model includes:
extracting joint distribution parameters for describing audio and video quality from a natural audio and video joint statistical model; the two-dimensional generalized Gaussian-distributed shape parameter s and the scale parameter Σ are used to describe the quality of audio and video.
Preferably, the S4 further includes: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,
and (3) performing difference on two adjacent frames of the video frame and/or two adjacent samples of the audio sample, and then respectively extracting the audio quality characteristic and/or the video quality characteristic.
Preferably, in S5, performing feature regression on all audio/video quality features extracted in S4 to obtain a single quality score describing audio/video joint quality, where the audio/video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is inspired by a visual quality evaluation method based on natural video statistics, and realizes no-reference audio and video joint quality evaluation based on natural audio and video statistics by popularizing a related natural video statistical model to natural audio statistics and further constructing a natural audio and video joint statistical model; the method for evaluating the joint quality of the audio and video signals without the reference based on the natural audio and video statistics can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a general flowchart of a non-reference audio/video joint quality evaluation method based on natural audio/video statistics according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a natural audio/video joint statistical model construction provided in an embodiment of the present invention;
FIG. 3 is a distribution diagram of normalized audio and video signal sample pairs of different compression levels according to an embodiment of the present invention;
FIG. 4 is a sample pair distribution diagram of neighboring video pixel products and neighboring audio sample products according to an embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
The invention provides a no-reference audio and video joint quality evaluation method based on a natural audio and video statistical model, which comprises the following steps:
the first step is as follows: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on the video signal;
the second step is that: the natural video statistical model obtained in the first step is popularized to natural audio statistics, and a natural audio statistical model is constructed for the input audio signals and used for carrying out statistical modeling on the audio signals;
the third step: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the first step and the natural audio statistical model obtained in the second step, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on a video signal and an audio signal;
the fourth step: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
the fifth step: and performing characteristic regression operation on the audio and video quality characteristics obtained in the fourth step to obtain the final audio and video joint quality estimation.
The detailed steps of the no-reference audio/video joint quality evaluation method based on natural audio/video statistics provided by the embodiment of the invention are further explained below with reference to the accompanying drawings.
As shown in fig. 1, a method provided in an embodiment of the present invention includes:
firstly, constructing a natural video statistical model
The natural video statistical model constructed by the embodiment of the invention is a natural video statistical model of a spatial domain, and the specific process comprises the following steps: regularizing the input video signal and performing natural video statistical modeling by using the regularized video signal.
The regularization process for the input video signal is as follows:
wherein I (I, j) is the original video signal,for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division equation stable, μ (i, j) and σ (i, j) represent the local mean and standard deviation of the video signal, respectively:
wherein, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window.
Using regularized video signalsThe statistical modeling process of the natural video statistical model is as follows:
after the raw natural video is regularized as described above,generally obey a Gaussian distribution, and video distortion is forcedDeviates from this gaussian distribution, whereas both the gaussian distribution of the natural video and the distribution of the processed distorted video (i.e. the video to be tested) can be described by a generalized gaussian distribution:
wherein
Wherein (. smallcircle.) represents the following gamma function
Where α controls the shape of the distribution and σ2The variance of the distribution is controlled.
Except thatCan be described by a generalized Gaussian distribution, and the product between two adjacent samples of the regularized video signal also follows an asymmetric generalized Gaussian distribution, i.e. the distribution of
Also obeys the following asymmetric generalized Gaussian distribution
Wherein
Wherein the shape parameter v controls the shape of the distribution,andthe variance of the left and right distributions is controlled separately.
Secondly, the natural video statistical model is popularized to natural audio statistics
The specific process of popularizing a natural video statistical model to natural audio statistics includes regularizing an input audio signal and performing natural audio statistical modeling using the regularized audio signal.
The regularization process for the input audio signal is as follows:
wherein a (t) is an original audio signal,for the audio signal after regularization, t is a time sequence index, c is a constant set according to the dynamic range of the audio signal for keeping the division equation stable, and μ (t) and σ (t) respectively represent the local mean and standard deviation of the audio signal
Wherein, wτT denotes a one-dimensional local gaussian window, ….
Using regularized audio signalsThe statistical modeling process for the natural audio statistical model is as follows:
after the raw natural audio is regularized as described above,generally obey a Gaussian distribution, and audio distortion is forcedDeviates from the gaussian distribution, while both the gaussian distribution of natural audio and the distribution of distorted audio can be described by a generalized gaussian distribution
Wherein
Wherein (. smallcircle.) represents the following gamma function
Where α controls the shape of the distribution and σ2The variance of the distribution is controlled.
Except thatCan be described by a generalized Gaussian distribution, two samples adjacent to each other after regularizationAndthe product between also obeys an asymmetric generalized Gaussian distribution, i.e.
Also obeys the following asymmetric generalized Gaussian distribution
Wherein
Wherein the shape parameter v controls the shape of the distribution,andthe variance of the left and right distributions is controlled separately.
Thirdly, constructing a natural audio and video combined statistical model by utilizing a natural video statistical model and a natural audio statistical model
The specific process of constructing the natural audio and video joint statistical model by using the natural video statistical model and the natural audio statistical model is as follows:
as shown in fig. 2, for each pixel in each frame of video, a sample is randomly selected from the most adjacent segment of audio samples of the video frame and paired with the pixel to form a sample pair, the sample pair is regularized, and a natural audio and video joint statistical model is constructed by using the regularized sample pair.
Specifically, the normalized audio and video signal sample pairs generally follow a two-dimensional gaussian distribution, and audio and video distortion forces the distribution of the normalized sample pairs to deviate from the two-dimensional gaussian distribution, while the two-dimensional gaussian distribution of natural audio and video and the distribution of distorted audio and video can be described by a two-dimensional generalized gaussian distribution:
where s is a shape parameter, ∑ is a scale parameter, d represents the dimension of x, (. cndot.) represents the gamma function
In the embodiment of the invention the parameter s of the two-dimensional generalized gaussian distribution is a scalar and the parameter Σ is a matrix of 2 × 2 the distribution of pairs of audio and video signal samples after regularization is shown in fig. 3, it can be seen that the two-dimensional generalized gaussian distribution describes the distribution very well.
In addition to the normalized sample pair obeying the two-dimensional generalized gaussian distribution, as shown in fig. 4, the sample pair formed by the adjacent video pixel product and the adjacent audio sample product also obeys the distribution of a certain rule, and the distribution can be described by using the two-dimensional generalized gaussian distribution in four quadrants of the distribution formed by the sample pair, namely, the adjacent video pixel product is greater than zero and the adjacent audio sample product is greater than zero, the video pixel product is greater than zero and the adjacent audio sample product is less than zero, the video pixel product is less than zero and the adjacent audio sample product is less than zero.
Fourthly, extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model
First, it is necessary to describe from the second stepExtracting distribution parameters capable of describing audio quality from the obedient generalized Gaussian distribution and the p (t) obedient asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution2The shape parameter v and the left difference parameter of the asymmetric generalized Gaussian distribution can describe the audio qualityRight variance parameterAnd the following mean parameters
The audio quality can be described.
Secondly, it is necessary to start from the first stepExtracting distribution parameters capable of describing video quality from obeyed generalized Gaussian distribution and H (i, j), V (i, j), D1(i, j) and D2(i, j) obeyed asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution2Shape of asymmetric generalized Gaussian distribution capable of describing video qualityParameter v, left difference parameterRight variance parameterAnd the following mean parameters:
video quality can be described.
Finally, distribution parameters capable of describing the audio and video quality are extracted from the two-dimensional generalized Gaussian distribution obeyed by the regularized audio and video signal sample pairs and the two-dimensional generalized Gaussian distribution obeyed by the adjacent video pixel product and the adjacent audio sample product in the three-dimensional generalized Gaussian distribution obeyed by the four quadrants. Wherein the two-dimensional generalized gaussian distributed shape parameter s, and the scale parameter Σ can both describe the quality of audio and video.
Fifthly, performing characteristic regression to obtain final audio and video joint quality estimation
And finally, performing regression on all audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video combined statistical model in the fourth step to obtain a single quality score for describing audio and video combined quality, wherein the audio and video quality characteristic regression can be a simple machine learning characteristic fusion method for a support vector machine, a random forest and the like, and can also be a complex deep learning characteristic fusion method for a neural network and the like.
The implementation effect is as follows:
in order to verify the no-Reference Audio Quality evaluation method based on natural Audio-Video statistics provided by the above embodiment of the present invention, a correlation algorithm may be tested on a L IVE-sJTU Audio and Video Quality Assessment (A/V-QA) Database, L IVE-sJTU A/V-QA Database is an Audio-Video Quality evaluation Database, and includes 336 pieces of distorted Audio-Video generated from 14 pieces of high-Quality Reference Audio-Video by using 24 Audio-Video distortion types/degrees, wherein 24 distortion conditions include any combination of two Video distortion types (compression and compression plus scaling, both including four levels of distortion) and one Audio distortion type (compression, including three levels of distortion).
The test utilizes 80% of data in L IVE-SJTU A/V-QA database to train, the rest 20% of data to test, the training test can be randomly carried out 1000 times, and the SRCC median value of 1000 tests can be used as the performance test result of the algorithm.
The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistical model provided by the embodiment of the invention comprises the five steps of constructing the natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing the natural audio and video joint statistical model by utilizing the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video joint statistical model, and obtaining final audio and video joint quality estimation by characteristic regression, so that the joint quality of the audio and video can be effectively evaluated. According to the method, the related natural video statistical model is popularized to natural audio statistics, and a natural audio and video combined statistical model is further constructed, so that the non-reference audio and video combined quality evaluation based on the natural audio and video statistics is realized.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.
Claims (10)
1. A no-reference audio and video joint quality evaluation method based on natural audio and video statistics is characterized by comprising the following steps:
s1: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on a video;
s2: popularizing the natural video statistical model obtained in the S1 into natural audio statistics, and constructing a natural audio statistical model for the input audio signal, wherein the natural audio statistical model is used for carrying out statistical modeling on audio;
s3: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the S1 and the natural audio statistical model obtained in the S2, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on videos and audios;
s4: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;
s5: and performing characteristic regression operation on the audio and video quality characteristics obtained in the S4 to obtain the final audio and video joint quality estimation.
2. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 1, wherein in S1, input video signals are regularized, and a natural video statistical model of a spatial domain is constructed by using the regularized video signals; the method comprises the following steps:
carrying out regularization processing on an input video signal:
wherein I (I, j) is an original video signal,for the video signal after regularization, i, j are the pixel indices, c is the viewA constant set by the dynamic range of the frequency signal is used for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:
in the formula, wk,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window;
wherein f (x; α, sigma)2) A probability density function representing values of pixels of the regularized video signal; x represents a regularized video signalα denotes a parameter for controlling the shape of the distribution, sigma denotes a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
β denotes the following parameters;
describing two samples adjacent to the video signal after regularization by adopting asymmetric generalized Gaussian distributionAndthe product between, i.e.
In the formula (I), the compound is shown in the specification,representing a probability density function to which products of adjacent pixel values of the regularized video signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
3. the no-reference audio-video joint quality evaluation method based on natural audio-video statistics as claimed in claim 1, wherein in S2, input audio signals are regularized, and a natural audio statistical model is constructed using the regularized audio signals; the method comprises the following steps:
regularizing an input audio signal:
wherein a (t) is an original audio signal,for the audio signal after regularization, t is a time sequence index, and k is a constant for keeping the division equation stable, which is set according to the dynamic range of the audio signal; μ (t) and σ (t) represent the local mean and standard deviation, respectively, of the audio signal:
in the formula,wτT denotes a one-dimensional local gaussian window, …;
regularized audio signals using natural audio statistical propertiesModeling a natural audio statistical model:
wherein f (x; α, sigma)2) Representing a probability density function to which sample values of the regularized audio signal are subjected; x represents a regularized audio signalα represents a parameter for controlling the shape of the distribution, sigma represents a parameter, sigma2Variance for the control distribution; (. cndot.) represents the gamma function:
β denotes the following parameters:
describing two samples adjacent to the audio signal after regularization by adopting asymmetric generalized Gaussian distributionAndthe product between, i.e.
In the formula (I), the compound is shown in the specification,representing a probability density function to which products of adjacent sample values of the regularized audio signal obey, v representing a parameter for controlling the shape of the distribution; sigmalIt is indicated that one of the parameters,variance for controlling left-hand distribution; sigmarIt is indicated that one of the parameters,for controlling the variance of the right distribution βlAnd βrThe following parameters are indicated:
4. the method for evaluating the joint quality of the non-reference audio and video based on the natural audio and video statistics as claimed in claim 1, wherein in the step S3, a natural audio and video joint statistical model is constructed, which comprises:
for each pixel of each frame of the video signal after regularization, randomly selecting a sample from the most adjacent section of audio clip of the frame video to pair with each pixel in pairs to form a sample pair; and carrying out regularization treatment on the sample pairs, and constructing a natural audio and video joint statistical model by using the sample pairs subjected to regularization treatment.
5. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 4, wherein in S3, a two-dimensional generalized Gaussian distribution is adopted to describe the joint statistical model of the natural audio and video:
in the formula (f)x(x; s, Σ) represents a probability density function to which the regularized sample pair is subjected, x represents the regularized sample pair, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
describing a sample pair formed by a video pixel product and an audio sample product adjacent to the regularized sample pair by adopting two-dimensional generalized Gaussian distribution:
in the formula (f)x(x, s, Σ) represents a probability density function of a sample pair formed by the video pixel product and the audio sample product after the regularization processing, x represents a sample pair formed by the video pixel product and the audio sample product after the regularization processing, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:
where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;
wherein: the sample pairs formed by the video pixel products and the audio sample products are distributed in four quadrants of the distribution formed by the regularized sample pairs; the four quadrants are respectively: the neighboring video pixel product is greater than zero and the neighboring audio sample product is greater than zero, the video pixel product is greater than zero and the neighboring audio sample product is less than zero, the video pixel product is less than zero and the neighboring audio sample product is greater than zero, the video pixel product is less than zero and the neighboring audio sample product is less than zero.
6. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 3, wherein in the step S4, the extracting of the audio quality features based on the natural audio statistical model includes:
extracting distribution parameters for describing audio quality from natural audio statistical model, wherein the shape parameter α and variance parameter sigma in generalized Gaussian distribution2Shape parameter v, left difference parameter in asymmetric generalized Gaussian distribution for describing audio qualityRight variance parameterAnd its mean parameter η for describing audio quality;
wherein:
7. the method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 2, wherein in the step S4, extracting the video quality characteristics based on the natural video statistics model includes:
extracting distribution parameters for describing video quality from a natural video statistical model, wherein the shape parameter α and the variance parameter sigma in the generalized Gaussian distribution2Shape parameter v and left difference parameter in asymmetric generalized Gaussian distribution for describing video qualityRight variance parameterAnd its mean parameter η is used to describe the video quality;
wherein:
8. the method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 5, wherein in the step S4, the extracting of the audio and video joint characteristics of the natural audio and video joint statistical model comprises the following steps:
extracting joint distribution parameters for describing audio and video quality from a natural audio and video joint statistical model; the two-dimensional generalized Gaussian-distributed shape parameter s and the scale parameter Σ are used to describe the quality of audio and video.
9. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in any one of claims 1 to 8, wherein the S4 further comprises: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,
and (3) performing difference on two adjacent frames of the video frame and/or two adjacent samples of the audio sample, and then respectively extracting the audio quality characteristic and/or the video quality characteristic.
10. The no-reference audio and video joint quality evaluation method based on natural audio and video statistics as claimed in any one of claims 1 to 8, wherein in S5, feature regression is performed on all audio and video quality features extracted in S4 to obtain a single quality score describing audio and video joint quality, wherein the audio and video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010171587.8A CN111479107B (en) | 2020-03-12 | 2020-03-12 | No-reference audio and video joint quality evaluation method based on natural audio and video statistics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010171587.8A CN111479107B (en) | 2020-03-12 | 2020-03-12 | No-reference audio and video joint quality evaluation method based on natural audio and video statistics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111479107A true CN111479107A (en) | 2020-07-31 |
CN111479107B CN111479107B (en) | 2021-06-08 |
Family
ID=71747429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010171587.8A Active CN111479107B (en) | 2020-03-12 | 2020-03-12 | No-reference audio and video joint quality evaluation method based on natural audio and video statistics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111479107B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968677A (en) * | 2020-08-21 | 2020-11-20 | 南京工程学院 | Voice quality self-evaluation method for fitting-free hearing aid |
CN113382232A (en) * | 2021-08-12 | 2021-09-10 | 北京微吼时代科技有限公司 | Method, device and system for monitoring audio and video quality and electronic equipment |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109302603A (en) * | 2017-07-25 | 2019-02-01 | 中国移动通信集团北京有限公司 | A kind of video speech quality appraisal procedure and device |
CN108683909B (en) * | 2018-07-12 | 2020-07-07 | 北京理工大学 | VR audio and video integral user experience quality evaluation method |
CN108933938A (en) * | 2018-08-23 | 2018-12-04 | 北京奇艺世纪科技有限公司 | A kind of video quality method of inspection, device and electronic equipment |
-
2020
- 2020-03-12 CN CN202010171587.8A patent/CN111479107B/en active Active
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968677A (en) * | 2020-08-21 | 2020-11-20 | 南京工程学院 | Voice quality self-evaluation method for fitting-free hearing aid |
CN111968677B (en) * | 2020-08-21 | 2021-09-07 | 南京工程学院 | Voice quality self-evaluation method for fitting-free hearing aid |
CN113382232A (en) * | 2021-08-12 | 2021-09-10 | 北京微吼时代科技有限公司 | Method, device and system for monitoring audio and video quality and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111479107B (en) | 2021-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109615582B (en) | Face image super-resolution reconstruction method for generating countermeasure network based on attribute description | |
CN107977932B (en) | Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network | |
CN108537743B (en) | Face image enhancement method based on generation countermeasure network | |
CN110555434B (en) | Method for detecting visual saliency of three-dimensional image through local contrast and global guidance | |
Li et al. | No-reference image quality assessment with deep convolutional neural networks | |
CN111062314B (en) | Image selection method and device, computer readable storage medium and electronic equipment | |
CN113112416B (en) | Semantic-guided face image restoration method | |
Yang et al. | Blind assessment for stereo images considering binocular characteristics and deep perception map based on deep belief network | |
Wu et al. | VP-NIQE: An opinion-unaware visual perception natural image quality evaluator | |
CN111479107A (en) | No-reference audio and video joint quality evaluation method based on natural audio and video statistics | |
CN108259893B (en) | Virtual reality video quality evaluation method based on double-current convolutional neural network | |
CN111709914A (en) | Non-reference image quality evaluation method based on HVS characteristics | |
Ji et al. | Blind image quality assessment with semantic information | |
CN111882516B (en) | Image quality evaluation method based on visual saliency and deep neural network | |
Krishnan et al. | SwiftSRGAN-Rethinking super-resolution for efficient and real-time inference | |
CN111368734A (en) | Micro expression recognition method based on normal expression assistance | |
CN111508528B (en) | No-reference audio quality evaluation method and device based on natural audio statistical characteristics | |
Chang et al. | LG-IQA: Integration of local and global features for no-reference image quality assessment | |
CN108492275B (en) | No-reference stereo image quality evaluation method based on deep neural network | |
CN117058735A (en) | Micro-expression recognition method based on parameter migration and optical flow feature extraction | |
CN112818950B (en) | Lip language identification method based on generation of countermeasure network and time convolution network | |
Li et al. | Unsupervised neural rendering for image hazing | |
CN114897884A (en) | No-reference screen content image quality evaluation method based on multi-scale edge feature fusion | |
Kim et al. | Cnn-based blind quality prediction on stereoscopic images via patch to image feature pooling | |
CN110930398B (en) | Total reference video quality evaluation method based on Log-Gabor similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |