CN111479107A

CN111479107A - No-reference audio and video joint quality evaluation method based on natural audio and video statistics

Info

Publication number: CN111479107A
Application number: CN202010171587.8A
Authority: CN
Inventors: 闵雄阔; 翟广涛; 杨小康
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-07-31
Anticipated expiration: 2040-03-12
Also published as: CN111479107B

Abstract

The invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized; the method comprises the following steps: constructing a natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing a natural audio and video combined statistical model by using the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistics, the natural video statistics and the natural audio and video combined statistics, and performing characteristic regression to obtain final audio and video combined quality estimation; the method for evaluating the joint quality of the audio and video without the reference can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.

Description

No-reference audio and video joint quality evaluation method based on natural audio and video statistics

Technical Field

The invention relates to the technical field of multimedia quality evaluation, in particular to a non-reference audio and video joint quality evaluation method based on a natural audio and video statistical model.

Background

In recent years, multimedia quality evaluation has attracted attention from many researchers in the fields of audio processing, video processing, and the like. According to the type of the signal to be evaluated, the multimedia quality evaluation can be divided into: video/video quality assessment (image/video assessment) and audio quality assessment (audio assessment). Over the past decades, researchers have proposed a number of objective visual quality assessment algorithms. The search of the prior art finds that:

l in and Kuo in W. L in and C. -C.J.Kuo, "quantitative Visual quality metrics" in "Activity," Journal of Visual Communication and Image reproduction, vol.22, No.4, pp.297-312,2011, "Wang and Bovik in Z.Wang and Alanc.Bovik," Mean square error: L or left corner Signal, "IEEE Signal Processing Magazine, Auum.26, 1, pp.98-117,2009," Z.Wang and Alford C.Bovik, "Reduced-no-reference," Audio-quality metrics "in" sample and sample, Journal of "Audio-quality metrics" in "patent application, Journal of No.1, pp.98-117,2009," Journal-6, sample No.8, sample No. 9, Journal of Audio-9, sample No.8, Journal of "sample No. 9, Journal of No. 9, sample, Journal of No. 9, Journal of Audio-5, and No. 9, Journal of the same patent No. 9, Journal of the same.

Although the quality evaluation techniques described above achieve encouraging results, they mostly evaluate the quality of multimedia signals of a single modality, such as a single image, video or audio, ignoring the interaction and fusion between audiovisual multimodal signals. Compared with the extensive research of single-mode quality evaluation, the attention degree of audio-video cross-mode quality evaluation is smaller, but the signal of audio-video multi-mode is considered to be closer to the practical application situation. A review of audio and video quality assessment is given by You et al in J.You, U.Reiter, M.M.Hannuksela, M.Gabbouj, and A.Perki, "Perruptual-based quality assessment for audio-visual services, A surfey," Signal Processing, imaging communication, vol.25, No.7, pp.482-501,2010. This evaluation technique generally requires a fundamental study of multi-modal perception in order to study the interaction between audiovisual signals and other factors affecting the evaluation of audiovisual quality, and these studies are usually performed by some audiovisual experiments. In general, these evaluation techniques are not based on content analysis, but directly estimate the audio/video quality from parameters such as bit rate and encoder type, so the application scenarios are very limited.

Alan Bovik et al, in A.K. Moorthy and A.C. Bovik, "bland image quality assessment From natural scene status to quality," IEEETrans.image Process, vol.20, No.12, pp.3350-3364, Dec.2011, M.A.Saad, A.C. Bovik, and C.Charrrier, "bland image quality assessment, A.natural scene status assessment of processing in the DCT domain," IEEE.image Process, Bovik.21, 8, pp.3339-3352, Aumaig.2012, and A.Mittal, A.K. molecular and A.C. Bovik, "Bovik-image," De.12, IEEE.12, and "video quality assessment" 1.12, 1,2, 1, A.Mittal, A.K.C. image, and B.5, B.1, A.C. Bovik, and B.1, C.1, C.465, C.1, C. Bovik, P.4, P.1, C. Observation, 2, and C.1, C. 4, C. for evaluating the quality statistics of the video quality of the video statistics of the video quality of the video. However, the natural video statistical model in the above method is only applicable to images and videos, and the designed method is only applicable to images and videos.

At present, no research or method is available for popularizing and applying a related natural video statistical model to audio and further constructing a natural audio and video joint statistical model, so that no-reference audio and video joint quality evaluation based on natural audio and video statistics is realized.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a no-reference audio/video joint quality evaluation method based on a natural audio and video statistical model (natural audio/video statistics for short), which is characterized in that a related natural video statistical model is popularized to natural audio statistics, and a natural audio and video joint statistical model is further constructed, so as to realize no-reference audio/video joint quality evaluation based on natural audio and video statistics.

The invention is realized by the following technical scheme.

A no-reference audio and video joint quality evaluation method based on natural audio and video statistics comprises the following steps:

s1: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on a video;

s2: popularizing the natural video statistical model obtained in the S1 into natural audio statistics, and constructing a natural audio statistical model for the input audio signal, wherein the natural audio statistical model is used for carrying out statistical modeling on audio;

s3: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the S1 and the natural audio statistical model obtained in the S2, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on videos and audios;

s4: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;

s5: and performing characteristic regression operation on the audio and video quality characteristics obtained in the S4 to obtain the final audio and video joint quality estimation.

Preferably, in S1, regularization processing is performed on the input video signal, and a natural video statistical model of a spatial domain is constructed by using the regularized video signal; the method comprises the following steps:

carrying out regularization processing on an input video signal:

wherein I (I, j) is an original video signal,

for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:

in the formula, w_k,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window;

regularized video signal using natural video statistics

Modeling a natural video statistical model:

regularized video signal using generalized Gaussian distribution

The description is that:

wherein f (x; α, sigma)²) A probability density function representing values of pixels of the regularized video signal; x represents a regularized video signal

α denotes a parameter for controlling the shape of the distribution, sigma denotes a parameterNumber, σ²Variance for the control distribution; (. cndot.) represents the gamma function:

β denotes the following parameters;

describing two samples adjacent to the video signal after regularization by adopting asymmetric generalized Gaussian distribution

And

the product between, i.e.

In the formula (I), the compound is shown in the specification,

representing a probability density function to which products of adjacent pixel values of the regularized video signal obey, v representing a parameter for controlling the shape of the distribution; sigma_lIt is indicated that one of the parameters,

variance for controlling left-hand distribution; sigma_rIt is indicated that one of the parameters,

for controlling the variance of the right distribution β_lAnd β_rThe following parameters are indicated:

preferably, in S2, regularization processing is performed on the input audio signal, and a natural audio statistical model is constructed by using the regularized audio signal; the method comprises the following steps:

regularizing an input audio signal:

wherein a (t) is an original audio signal,

for the audio signal after regularization, t is a time sequence index, and k is a constant for keeping the division equation stable, which is set according to the dynamic range of the audio signal; μ (t) and σ (t) represent the local mean and standard deviation, respectively, of the audio signal:

in the formula, w_ττ ═ T, …, T representing a groupA one-dimensional local Gaussian window;

regularized audio signals using natural audio statistical properties

Modeling a natural audio statistical model:

regularized audio signals using generalized Gaussian distribution

The description is that:

wherein f (x; α, sigma)²) Representing a probability density function to which sample values of the regularized audio signal are subjected; x represents a regularized audio signal

α represents a parameter for controlling the shape of the distribution, sigma represents a parameter, sigma²Variance for the control distribution; (. cndot.) represents the gamma function:

β denotes the following parameters:

describing two samples adjacent to the audio signal after regularization by adopting asymmetric generalized Gaussian distribution

And

the product between, i.e.

In the formula (I), the compound is shown in the specification,

representing a probability density function to which products of adjacent sample values of the regularized audio signal obey, v representing a parameter for controlling the shape of the distribution; sigma_lIt is indicated that one of the parameters,

preferably, in S3, constructing a natural audio and video joint statistical model includes:

for each pixel of each frame of the video signal after regularization, randomly selecting a sample from the most adjacent section of audio clip of the frame video to pair with each pixel in pairs to form a sample pair; and carrying out regularization treatment on the sample pairs, and constructing a natural audio and video joint statistical model by using the sample pairs subjected to regularization treatment.

Preferably, in S3, the natural audio and video joint statistical model is described using a two-dimensional generalized gaussian distribution:

in the formula (f)_x(x; s, Σ) represents a probability density function to which the regularized sample pair is subjected, x represents the regularized sample pair, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:

where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;

describing a sample pair formed by a video pixel product and an audio sample product adjacent to the regularized sample pair by adopting two-dimensional generalized Gaussian distribution:

in the formula (f)_x(x, s, Σ) represents a probability density function of a sample pair formed by the video pixel product and the audio sample product after the regularization processing, x represents a sample pair formed by the video pixel product and the audio sample product after the regularization processing, s is a shape parameter, Σ is a scale parameter, and d represents the dimension of x; (. cndot.) represents the gamma function:

where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;

wherein: the sample pairs formed by the video pixel products and the audio sample products are distributed in four quadrants of the distribution formed by the regularized sample pairs; the four quadrants are respectively: the neighboring video pixel product is greater than zero and the neighboring audio sample product is greater than zero, the video pixel product is greater than zero and the neighboring audio sample product is less than zero, the video pixel product is less than zero and the neighboring audio sample product is greater than zero, the video pixel product is less than zero and the neighboring audio sample product is less than zero.

Preferably, in S4, the extracting the audio quality feature based on the natural audio statistical model includes:

extracting distribution parameters for describing audio quality from natural audio statistical model, wherein the shape parameter α and variance parameter sigma in generalized Gaussian distribution²Shape parameter v, left difference parameter in asymmetric generalized Gaussian distribution for describing audio quality

Right variance parameter

And its mean parameter η for describing audio quality;

wherein:

preferably, in S4, the extracting the video quality feature based on the natural video statistical model includes:

extracting distribution parameters for describing video quality from a natural video statistical model, wherein the shape parameter α and the variance parameter sigma in the generalized Gaussian distribution²Shape parameter v and left difference parameter in asymmetric generalized Gaussian distribution for describing video quality

Right variance parameter

And its mean parameter η is used to describe the video quality;

wherein:

preferably, in S4, extracting an audio-video joint feature of the natural audio-video joint statistical model includes:

extracting joint distribution parameters for describing audio and video quality from a natural audio and video joint statistical model; the two-dimensional generalized Gaussian-distributed shape parameter s and the scale parameter Σ are used to describe the quality of audio and video.

Preferably, the S4 further includes: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,

and (3) performing difference on two adjacent frames of the video frame and/or two adjacent samples of the audio sample, and then respectively extracting the audio quality characteristic and/or the video quality characteristic.

Preferably, in S5, performing feature regression on all audio/video quality features extracted in S4 to obtain a single quality score describing audio/video joint quality, where the audio/video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a no-reference audio and video joint quality evaluation method based on natural audio and video statistics, which is inspired by a visual quality evaluation method based on natural video statistics, and realizes no-reference audio and video joint quality evaluation based on natural audio and video statistics by popularizing a related natural video statistical model to natural audio statistics and further constructing a natural audio and video joint statistical model; the method for evaluating the joint quality of the audio and video signals without the reference based on the natural audio and video statistics can effectively estimate the joint quality of the audio and video signals to be measured under the condition that the original audio and video signals are unknown.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a general flowchart of a non-reference audio/video joint quality evaluation method based on natural audio/video statistics according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a natural audio/video joint statistical model construction provided in an embodiment of the present invention;

FIG. 3 is a distribution diagram of normalized audio and video signal sample pairs of different compression levels according to an embodiment of the present invention;

FIG. 4 is a sample pair distribution diagram of neighboring video pixel products and neighboring audio sample products according to an embodiment of the present invention.

Detailed Description

The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

The invention provides a no-reference audio and video joint quality evaluation method based on a natural audio and video statistical model, which comprises the following steps:

the first step is as follows: constructing a natural video statistical model for an input video signal, wherein the natural video statistical model is used for performing statistical modeling on the video signal;

the second step is that: the natural video statistical model obtained in the first step is popularized to natural audio statistics, and a natural audio statistical model is constructed for the input audio signals and used for carrying out statistical modeling on the audio signals;

the third step: constructing a natural audio and video joint statistical model by using the natural video statistical model obtained in the first step and the natural audio statistical model obtained in the second step, wherein the natural audio and video joint statistical model is used for performing joint statistical modeling on a video signal and an audio signal;

the fourth step: respectively extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model;

the fifth step: and performing characteristic regression operation on the audio and video quality characteristics obtained in the fourth step to obtain the final audio and video joint quality estimation.

The detailed steps of the no-reference audio/video joint quality evaluation method based on natural audio/video statistics provided by the embodiment of the invention are further explained below with reference to the accompanying drawings.

As shown in fig. 1, a method provided in an embodiment of the present invention includes:

firstly, constructing a natural video statistical model

The natural video statistical model constructed by the embodiment of the invention is a natural video statistical model of a spatial domain, and the specific process comprises the following steps: regularizing the input video signal and performing natural video statistical modeling by using the regularized video signal.

The regularization process for the input video signal is as follows:

wherein I (I, j) is the original video signal,

for the video signal after regularization, i, j is the pixel index, c is a constant set according to the dynamic range of the video signal for keeping the division equation stable, μ (i, j) and σ (i, j) represent the local mean and standard deviation of the video signal, respectively:

wherein, w_k,lK-K, …, K, l-L, …, L represent a two-dimensional local gaussian window.

Using regularized video signals

The statistical modeling process of the natural video statistical model is as follows:

after the raw natural video is regularized as described above,

generally obey a Gaussian distribution, and video distortion is forced

Deviates from this gaussian distribution, whereas both the gaussian distribution of the natural video and the distribution of the processed distorted video (i.e. the video to be tested) can be described by a generalized gaussian distribution:

wherein

Wherein (. smallcircle.) represents the following gamma function

Where α controls the shape of the distribution and σ²The variance of the distribution is controlled.

Except that

Can be described by a generalized Gaussian distribution, and the product between two adjacent samples of the regularized video signal also follows an asymmetric generalized Gaussian distribution, i.e. the distribution of

Also obeys the following asymmetric generalized Gaussian distribution

Wherein

Wherein the shape parameter v controls the shape of the distribution,

and

the variance of the left and right distributions is controlled separately.

Secondly, the natural video statistical model is popularized to natural audio statistics

The specific process of popularizing a natural video statistical model to natural audio statistics includes regularizing an input audio signal and performing natural audio statistical modeling using the regularized audio signal.

The regularization process for the input audio signal is as follows:

wherein a (t) is an original audio signal,

for the audio signal after regularization, t is a time sequence index, c is a constant set according to the dynamic range of the audio signal for keeping the division equation stable, and μ (t) and σ (t) respectively represent the local mean and standard deviation of the audio signal

Wherein, w_τT denotes a one-dimensional local gaussian window, ….

Using regularized audio signals

The statistical modeling process for the natural audio statistical model is as follows:

after the raw natural audio is regularized as described above,

generally obey a Gaussian distribution, and audio distortion is forced

Deviates from the gaussian distribution, while both the gaussian distribution of natural audio and the distribution of distorted audio can be described by a generalized gaussian distribution

Wherein

Wherein (. smallcircle.) represents the following gamma function

Except that

Can be described by a generalized Gaussian distribution, two samples adjacent to each other after regularization

And

the product between also obeys an asymmetric generalized Gaussian distribution, i.e.

Also obeys the following asymmetric generalized Gaussian distribution

Wherein

Wherein the shape parameter v controls the shape of the distribution,

and

the variance of the left and right distributions is controlled separately.

Thirdly, constructing a natural audio and video combined statistical model by utilizing a natural video statistical model and a natural audio statistical model

The specific process of constructing the natural audio and video joint statistical model by using the natural video statistical model and the natural audio statistical model is as follows:

as shown in fig. 2, for each pixel in each frame of video, a sample is randomly selected from the most adjacent segment of audio samples of the video frame and paired with the pixel to form a sample pair, the sample pair is regularized, and a natural audio and video joint statistical model is constructed by using the regularized sample pair.

Specifically, the normalized audio and video signal sample pairs generally follow a two-dimensional gaussian distribution, and audio and video distortion forces the distribution of the normalized sample pairs to deviate from the two-dimensional gaussian distribution, while the two-dimensional gaussian distribution of natural audio and video and the distribution of distorted audio and video can be described by a two-dimensional generalized gaussian distribution:

where s is a shape parameter, ∑ is a scale parameter, d represents the dimension of x, (. cndot.) represents the gamma function

In the embodiment of the invention the parameter s of the two-dimensional generalized gaussian distribution is a scalar and the parameter Σ is a matrix of 2 × 2 the distribution of pairs of audio and video signal samples after regularization is shown in fig. 3, it can be seen that the two-dimensional generalized gaussian distribution describes the distribution very well.

In addition to the normalized sample pair obeying the two-dimensional generalized gaussian distribution, as shown in fig. 4, the sample pair formed by the adjacent video pixel product and the adjacent audio sample product also obeys the distribution of a certain rule, and the distribution can be described by using the two-dimensional generalized gaussian distribution in four quadrants of the distribution formed by the sample pair, namely, the adjacent video pixel product is greater than zero and the adjacent audio sample product is greater than zero, the video pixel product is greater than zero and the adjacent audio sample product is less than zero, the video pixel product is less than zero and the adjacent audio sample product is less than zero.

Fourthly, extracting audio and video quality characteristics based on a natural audio statistical model, a natural video statistical model and a natural audio and video combined statistical model

First, it is necessary to describe from the second step

Extracting distribution parameters capable of describing audio quality from the obedient generalized Gaussian distribution and the p (t) obedient asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution²The shape parameter v and the left difference parameter of the asymmetric generalized Gaussian distribution can describe the audio quality

Right variance parameter

And the following mean parameters

The audio quality can be described.

Secondly, it is necessary to start from the first step

Extracting distribution parameters capable of describing video quality from obeyed generalized Gaussian distribution and H (i, j), V (i, j), D1(i, j) and D2(i, j) obeyed asymmetric generalized Gaussian distribution, wherein the shape parameter α and the variance parameter sigma of the generalized Gaussian distribution²Shape of asymmetric generalized Gaussian distribution capable of describing video qualityParameter v, left difference parameter

Right variance parameter

And the following mean parameters:

video quality can be described.

Finally, distribution parameters capable of describing the audio and video quality are extracted from the two-dimensional generalized Gaussian distribution obeyed by the regularized audio and video signal sample pairs and the two-dimensional generalized Gaussian distribution obeyed by the adjacent video pixel product and the adjacent audio sample product in the three-dimensional generalized Gaussian distribution obeyed by the four quadrants. Wherein the two-dimensional generalized gaussian distributed shape parameter s, and the scale parameter Σ can both describe the quality of audio and video.

Fifthly, performing characteristic regression to obtain final audio and video joint quality estimation

And finally, performing regression on all audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video combined statistical model in the fourth step to obtain a single quality score for describing audio and video combined quality, wherein the audio and video quality characteristic regression can be a simple machine learning characteristic fusion method for a support vector machine, a random forest and the like, and can also be a complex deep learning characteristic fusion method for a neural network and the like.

The implementation effect is as follows:

in order to verify the no-Reference Audio Quality evaluation method based on natural Audio-Video statistics provided by the above embodiment of the present invention, a correlation algorithm may be tested on a L IVE-sJTU Audio and Video Quality Assessment (A/V-QA) Database, L IVE-sJTU A/V-QA Database is an Audio-Video Quality evaluation Database, and includes 336 pieces of distorted Audio-Video generated from 14 pieces of high-Quality Reference Audio-Video by using 24 Audio-Video distortion types/degrees, wherein 24 distortion conditions include any combination of two Video distortion types (compression and compression plus scaling, both including four levels of distortion) and one Audio distortion type (compression, including three levels of distortion).

The test utilizes 80% of data in L IVE-SJTU A/V-QA database to train, the rest 20% of data to test, the training test can be randomly carried out 1000 times, and the SRCC median value of 1000 tests can be used as the performance test result of the algorithm.

The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistical model provided by the embodiment of the invention comprises the five steps of constructing the natural video statistical model, popularizing the natural video statistical model to natural audio statistics, constructing the natural audio and video joint statistical model by utilizing the natural video statistical model and the natural audio statistical model, extracting audio and video quality characteristics based on the natural audio statistical model, the natural video statistical model and the natural audio and video joint statistical model, and obtaining final audio and video joint quality estimation by characteristic regression, so that the joint quality of the audio and video can be effectively evaluated. According to the method, the related natural video statistical model is popularized to natural audio statistics, and a natural audio and video combined statistical model is further constructed, so that the non-reference audio and video combined quality evaluation based on the natural audio and video statistics is realized.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A no-reference audio and video joint quality evaluation method based on natural audio and video statistics is characterized by comprising the following steps:

2. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 1, wherein in S1, input video signals are regularized, and a natural video statistical model of a spatial domain is constructed by using the regularized video signals; the method comprises the following steps:

carrying out regularization processing on an input video signal:

wherein I (I, j) is an original video signal,

for the video signal after regularization, i, j are the pixel indices, c is the viewA constant set by the dynamic range of the frequency signal is used for keeping the division stable; μ (i, j) and σ (i, j) represent the local mean and standard deviation, respectively, of the video signal:

regularized video signal using natural video statistics

Modeling a natural video statistical model:

regularized video signal using generalized Gaussian distribution

The description is that:

α denotes a parameter for controlling the shape of the distribution, sigma denotes a parameter, sigma²Variance for the control distribution; (. cndot.) represents the gamma function:

β denotes the following parameters;

And

the product between, i.e.

In the formula (I), the compound is shown in the specification,

3. the no-reference audio-video joint quality evaluation method based on natural audio-video statistics as claimed in claim 1, wherein in S2, input audio signals are regularized, and a natural audio statistical model is constructed using the regularized audio signals; the method comprises the following steps:

regularizing an input audio signal:

wherein a (t) is an original audio signal,

in the formula，w_τT denotes a one-dimensional local gaussian window, …;

regularized audio signals using natural audio statistical properties

Modeling a natural audio statistical model:

regularized audio signals using generalized Gaussian distribution

The description is that:

β denotes the following parameters:

And

the product between, i.e.

In the formula (I), the compound is shown in the specification,

4. the method for evaluating the joint quality of the non-reference audio and video based on the natural audio and video statistics as claimed in claim 1, wherein in the step S3, a natural audio and video joint statistical model is constructed, which comprises:

5. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 4, wherein in S3, a two-dimensional generalized Gaussian distribution is adopted to describe the joint statistical model of the natural audio and video:

where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;

where parameter s is a scalar and parameter Σ is a matrix of 2 × 2;

6. The method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 3, wherein in the step S4, the extracting of the audio quality features based on the natural audio statistical model includes:

Right variance parameter

And its mean parameter η for describing audio quality;

wherein:

7. the method for evaluating the joint quality of the non-reference audios and videos based on the natural audio and video statistics as claimed in claim 2, wherein in the step S4, extracting the video quality characteristics based on the natural video statistics model includes:

Right variance parameter

And its mean parameter η is used to describe the video quality;

wherein:

8. the method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in claim 5, wherein in the step S4, the extracting of the audio and video joint characteristics of the natural audio and video joint statistical model comprises the following steps:

9. The method for evaluating the joint quality of the audio and video without reference based on the natural audio and video statistics as claimed in any one of claims 1 to 8, wherein the S4 further comprises: down-sampling an input audio signal, and then extracting audio quality characteristics on a plurality of scales; and/or the presence of a gas in the gas,

10. The no-reference audio and video joint quality evaluation method based on natural audio and video statistics as claimed in any one of claims 1 to 8, wherein in S5, feature regression is performed on all audio and video quality features extracted in S4 to obtain a single quality score describing audio and video joint quality, wherein the audio and video quality feature regression adopts a machine learning feature fusion method or a deep learning feature fusion method of a neural network.