CN113593604B

CN113593604B - Method, device and storage medium for detecting audio quality

Info

Publication number: CN113593604B
Application number: CN202110831738.2A
Authority: CN
Inventors: 张超鹏; 汪璐璐; 姜涛; 胡鹏
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2024-07-19
Anticipated expiration: 2041-07-22
Also published as: CN113593604A

Abstract

The application discloses a method, a device and a storage medium for detecting audio quality, and belongs to the technical field of computers. The method comprises the following steps: according to the power spectrum corresponding to each audio frame to be detected of the target dry audio, determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected; for each audio frame to be detected, carrying out multiplying power treatment on the power value of each frequency point in the power spectrum of the audio frame to be detected; according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplying power processing, determining the existence probability of the voice of each audio frame to be detected; according to the existence probability of the human voice corresponding to each audio frame to be detected, detecting the human voice audio frame and the non-human voice audio frame in the audio frames to be detected of the target dry sound frequency; and determining the audio quality information of the target dry sound frequency according to the power spectrum corresponding to the human sound audio frame and the power spectrum corresponding to the non-human sound audio frame. By adopting the application, the audio quality of the dry audio can be more accurately judged.

Description

Method, device and storage medium for detecting audio quality

Technical Field

The present application relates to the field of audio data processing, and in particular, to a method, an apparatus, and a storage medium for detecting audio quality.

Background

The K song application is an entertainment application that is commonly used by people. The user may make a K song through a K song application, during which the terminal records audio picked up by a microphone, which is commonly referred to as dry audio. The user can further operate to upload the dry voice frequency to the server for storage, and then download the dry voice frequency when playing own singing works, and play the dry voice frequency and accompaniment audio in a combined mode.

In the operation process of the K song application program, hundreds of millions of dry audio are stored in the server, and as time goes on, the dry audio stored in the server is more and more, which is a great test on the storage capacity of the server, so a certain deleting machine is generally arranged in the server. For example, a user who has not logged in too long may have a dry audio deleted, and a low quality audio deleted, etc. In general, when evaluating the audio quality of the dry audio, it is simply to detect the total power of the dry audio, determine the audio quality information from the total power, and if the power is too low (possibly because the user does not sing towards the microphone), determine that the quality of the dry audio is low, and then delete the dry audio.

In carrying out the present application, the inventors have found that the related art has at least the following problems:

the above scheme only judges the audio quality through the total power, however, the total power cannot reflect the audio quality very comprehensively and accurately, which results in poor accuracy of the finally determined audio quality information.

Disclosure of Invention

The embodiment of the application provides a method, a device and a storage medium for detecting audio quality, which can solve the problem of poor accuracy of audio quality information. The technical scheme is as follows:

In a first aspect, there is provided a method of detecting audio quality, the method comprising:

According to a power spectrum corresponding to each audio frame to be detected of the target dry audio, determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected, wherein the power spectrum comprises power values of all frequency points;

for each audio frame to be detected, carrying out multiplying processing on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after the multiplying processing, wherein the weight of the frequency point of positive integer multiple of the voice fundamental frequency estimated value corresponding to the audio frame to be detected is larger than the weight of other frequency points;

according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplying power processing, determining the existence probability of the voice of each audio frame to be detected;

Detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry voice audio according to the human voice existence probability corresponding to each audio frame to be detected;

and determining the audio quality information of the target dry sound frequency according to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame.

In one possible implementation manner, the determining, according to the power spectrum corresponding to each audio frame to be detected of the target dry audio, a voice fundamental frequency estimation value corresponding to each audio frame to be detected includes:

and determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset voice frequency characteristic information.

In a possible implementation manner, the voice frequency characteristic information is a voice fundamental frequency range, and determining, according to a power spectrum corresponding to each audio frame to be detected and preset voice frequency characteristic information, a voice fundamental frequency estimated value corresponding to each audio frame to be detected includes:

smoothing the power spectrum corresponding to each audio frame to be detected by the preset window length;

and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the voice fundamental frequency range as a voice fundamental frequency estimated value corresponding to each audio frame to be detected.

In one possible implementation manner, for each audio frame to be detected, the multiplying power processing is performed on the power value of each frequency point in the power spectrum of the audio frame to be detected, so as to obtain a power spectrum after the multiplying power processing, including:

Constructing a weight coefficient function corresponding to each audio frame to be detected according to the voice fundamental frequency estimated value corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weight values corresponding to different frequency points, a plurality of wave peaks exist in the waveform of the weight coefficient function, and the wave peaks respectively correspond to positive integer multiples of the voice fundamental frequency estimated value;

And multiplying the power spectrum corresponding to the audio frame to be detected by a weight coefficient function for each audio frame to be detected to obtain the power spectrum after the weight processing.

In one possible implementation, the weight coefficient function is a trigonometric function.

In one possible implementation manner, the determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing includes:

For each audio frame to be detected, determining the ratio of the total power of the power spectrum after the multiplication processing to the total power of the power spectrum without the multiplication processing, and carrying out normalization processing on the ratio according to the preset ratio upper limit and ratio lower limit to obtain a normalized ratio which is used as the voice existence probability corresponding to the audio frame to be detected.

In a possible implementation manner, the detecting, according to the existence probability of the voice corresponding to each audio frame to be detected, the voice audio frame and the non-voice audio frame in the audio frame to be detected of the target dry voice audio includes:

Detecting a voice audio frame in the to-be-detected audio frames of the target dry voice audio according to the voice existence probability and the voice detection probability threshold value corresponding to each to-be-detected audio frame;

And detecting a non-human voice audio frame in the audio frames to be detected of the target dry voice audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.

In one possible implementation, the method further includes: according to the power spectrum corresponding to the non-human voice audio frame, determining a noise penalty parameter of the target dry voice frequency; determining a power penalty parameter of the target dry sound frequency according to the power spectrum corresponding to each audio frame to be detected;

the determining the audio quality information of the target dry audio according to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame includes:

According to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame, determining voice quality information of the target dry voice frequency;

And determining the audio quality information of the target dry audio according to the voice quality information, the noise penalty parameter and the power penalty parameter.

In a possible implementation manner, the determining, according to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame, voice quality information of the target dry audio includes:

Determining a signal-to-noise ratio estimated value of the target dry sound audio according to the power spectrum corresponding to the human sound audio frame and the power spectrum corresponding to the non-human sound audio frame;

according to the voice existence probability corresponding to the voice audio frame, determining the voice definition of the target dry audio;

and determining the product of the signal-to-noise ratio estimated value and the voice definition as voice quality information of the target dry voice frequency.

In a possible implementation manner, the determining the signal-to-noise ratio estimated value of the target dry sound audio according to the power spectrum corresponding to the human sound audio frame and the power spectrum corresponding to the non-human sound audio frame includes:

Determining a power average value corresponding to each human voice frequency frame and a power average value corresponding to each non-human voice frequency frame, wherein the power average value is determined according to the average value of the power values of all the frequency points;

Determining a first median value of the power mean values corresponding to all the voice audio frames, and determining a second median value of the power mean values corresponding to all the non-voice audio frames;

And calculating a signal-to-noise ratio estimated value according to the ratio of the first median value to the second median value.

In a possible implementation manner, the determining, according to the voice existence probability corresponding to the voice audio frame, the voice definition of the target dry audio includes:

And determining the voice definition of the target dry voice frequency by using the median value of the voice existence probabilities corresponding to all voice audio frames.

In one possible implementation, the method further includes:

If no voice audio frame is detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining an average value of the existence probabilities of the voice corresponding to all the audio frames to be detected;

and determining the audio quality information of the target dry sound frequency according to the average value of the voice existence probability and the power penalty parameter.

In one possible implementation, the method further includes:

If the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected; determining the median value of the existence probabilities of the voice corresponding to all the audio frames to be detected;

and determining the audio quality information of the target dry sound frequency according to the median value of the voice existence probability and the power penalty parameter.

In a possible implementation manner, the determining the noise penalty parameter of the target dry audio according to the power spectrum corresponding to the non-human voice audio frame includes:

determining a power average value corresponding to each non-human-derived audio frame, wherein the power average value is determined according to the average value of the power values of all the frequency points;

determining the median value of the power mean values corresponding to all the non-human voice audio frames;

And determining a noise penalty parameter of the target dry audio according to the median value of the power mean value, wherein the noise penalty parameter is inversely related to the median value of the power mean value.

In a possible implementation manner, the audio frame to be detected is an audio frame with a power average value greater than a mute power threshold value in the target dry audio, wherein the power average value is determined according to an average value of power values of all frequency points;

the determining the power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected comprises the following steps:

determining a power average value corresponding to each audio frame to be detected, wherein the power average value is determined according to the average value of the power values of all the frequency points;

determining the average value of the power average values corresponding to all the audio frames to be detected, and obtaining the total power average value of the target dry audio;

and determining a power penalty parameter of the target dry sound frequency according to the total power average value and the ratio of the number of the audio frames to be detected in the total number of the audio frames of the target dry sound frequency.

In one possible implementation manner, the determining the power penalty parameter of the target dry audio according to the total number of audio frames of the target dry audio and the total power average value according to the number of audio frames to be detected includes:

Determining a first power penalty subparameter according to the ratio of the number of the audio frames to be detected in the total number of the audio frames of the target dry audio and a preset ratio threshold, wherein the first power penalty subparameter is inversely related to the difference value of the ratio threshold and the ratio subtracted from the ratio when the ratio is smaller than or equal to the ratio threshold, and the first power penalty subparameter is a fixed value when the ratio is larger than the ratio threshold;

Determining a second power penalty subparameter and a third power penalty subparameter according to the total power average value and a preset power upper limit and power lower limit, wherein the second power penalty subparameter is inversely related to the difference value of the total power average value minus the power upper limit when the total power average value is larger than or equal to the power upper limit, the third power penalty subparameter is inversely related to the difference value of the total power average value minus the total power average value when the total power average value is smaller than or equal to the power lower limit, and the second power penalty subparameter and the third power penalty subparameter are both fixed numerical values when the total power average value is smaller than the power upper limit and larger than the power lower limit;

And determining the product of the first power penalty subparameter, the second power penalty subparameter and the third power penalty subparameter as the power penalty parameter of the target dry audio.

In a second aspect, there is provided an apparatus for detecting audio quality, the apparatus comprising:

the determining module is used for determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, wherein the power spectrum comprises power values of all frequency points;

the system comprises a multiplying module, a multiplying module and a multiplying module, wherein the multiplying module is used for multiplying the power value of each frequency point in the power spectrum of each audio frame to be detected for each audio frame to be detected to obtain the power spectrum after the multiplying processing, wherein the weight of the frequency point of positive integer multiple of the voice fundamental frequency estimated value corresponding to the audio frame to be detected is larger than the weight of other frequency points;

The probability module is used for determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing;

The detection module is used for detecting a human voice audio frame and a non-human voice audio frame in the audio frame to be detected of the target dry voice audio according to the human voice existence probability corresponding to each audio frame to be detected;

And the quality module is used for determining the audio quality information of the target dry sound frequency according to the power spectrum corresponding to the human sound audio frame and the power spectrum corresponding to the non-human sound audio frame.

In one possible implementation manner, the determining module is configured to:

In one possible implementation manner, the voice frequency characteristic information is a voice fundamental frequency range, and the determining module is configured to:

In one possible implementation manner, the right-of-way module is configured to:

In one possible implementation, the probability module is configured to:

In one possible implementation manner, the detection module is configured to:

In one possible implementation, the quality module is further configured to: according to the power spectrum corresponding to the non-human voice audio frame, determining a noise penalty parameter of the target dry voice frequency; determining a power penalty parameter of the target dry sound frequency according to the power spectrum corresponding to each audio frame to be detected;

the quality module is used for:

In one possible implementation, the quality module is configured to:

In one possible implementation, the quality module is further configured to:

In one possible implementation, the quality module is configured to:

the quality module is used for:

In one possible implementation, the quality module is configured to:

In a third aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the operations performed by the method of detecting audio quality of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the operations performed by the method of detecting audio quality of the first aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

According to the embodiment of the application, the human voice existence probability of the audio frame is identified through the power spectrum of the audio frame in the dry audio, so that the human voice audio frame and the non-human voice audio frame are identified, the audio quality information of the dry audio is determined based on the power spectrums of the human voice audio frame and the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power conditions of the human voice audio frame and the non-human voice audio frame because the high-quality dry audio is near silence in the non-human voice part.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for detecting audio quality according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for detecting audio quality according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for determining the existence probability of a voice according to an embodiment of the present application;

FIG. 4 is a schematic waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 5 is a schematic waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 6 is a schematic waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 7 is a schematic waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 8 is a flow chart of a method for detecting audio quality according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an apparatus for detecting audio quality according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The method for detecting the audio quality provided by the embodiment of the application can be implemented by a server. The server may be a background server of an application program, and the application program may be an application program with an audio recording function, such as a K song application program, a video application program, a recording application program, and the like. The server may be a single server or a server group, if the server is a single server, the server may be responsible for all the processes in the following schemes, if the server is a server group, different servers in the server group may be respectively responsible for different processes in the following schemes, and specific process allocation conditions may be set by technicians according to actual requirements at will, which will not be described herein.

The server may include a processor, memory, and communication components. The processor is respectively connected with the memory and the communication component.

The processor may be a CPU (Central Processing Unit ).

The Memory may include ROM (Read-Only Memory), RAM (Random Access Memory ), CD-ROM (Compact Disc Read-Only Memory), magnetic disk, optical data storage device, and the like. The memory may be used to detect data that needs to be pre-stored, intermediate data generated, result data generated, etc. in the audio quality process. Such as dry audio, various penalty parameters, audio quality information, etc.

The communication means may be a wired network connector, a WiFi (WIRELESS FIDELITY ) module, a bluetooth module, a cellular network communication module, etc. The communication means may be used for data transmission with other devices, which may be other servers, terminals, etc. For example, the communication component may receive dry audio transmitted by the terminal.

Fig. 1 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 1, this embodiment includes:

101, according to the power spectrum corresponding to each audio frame to be detected of the target dry audio, determining the voice fundamental frequency estimated value corresponding to each audio frame to be detected.

The power spectrum comprises power values of all frequency points.

102, For each audio frame to be detected, carrying out multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected, and obtaining the power spectrum after the multiplication processing.

The weight of the frequency point of the positive integer multiple of the voice fundamental frequency estimated value corresponding to the audio frame to be detected is larger than that of other frequency points.

And 103, determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing.

104, Detecting the human voice audio frames and the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.

And 105, determining the audio quality information of the target dry audio according to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame.

Fig. 2 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 2, this embodiment includes:

and 201, acquiring a power spectrum corresponding to an audio frame to be detected of the target dry sound audio.

The power spectrum comprises power values of all frequency points. The selection method of the audio frame to be detected can be various, for example, selecting according to fixed intervals, or selecting the audio frame meeting certain power requirements, etc.

In practice, during audio recording by the terminal, the typical audio data sampling rate is 44.1kHz (android system) or 48kHz (ios system). The collected audio data is generally subjected to 16kHz down-sampling treatment, so that the subsequent treatment can reduce the calculation force. Downsampling may be performed using open source tool libresample, which is most friendly to human voice audio. And obtaining corresponding dry sound audio after downsampling.

The server stores a large amount of dry audio, and for any dry audio (i.e., target dry audio), the power spectrum of each audio frame can be calculated as follows:

(1) Framing

The audio data of each audio frame may be represented as x _n (i) =x (n·m+i).

Wherein n represents the n-th frame of audio data, M represents frame shift, i.e. the number of samples of the next frame which are shifted relative to the previous frame, i represents the index of the data of the sampling points in the n-th frame, and the value range of i is 0,1,2, …, L-1, wherein L represents frame length, i.e. the number of samples in one audio frame. Here, the duration corresponding to M may be t _frmhop =0.01 s (seconds), t _frmhop may be referred to as a frame interval duration, and the duration corresponding to L may be 0.03s.

(2) Windowing:

the calculation formula of the windowing process can be xw _n(i)＝x_n (i) ·w (i).

Where w (i) represents a window function, a hanning (hanning) window may be used, expressed as follows:

(3) Discrete fourier transform:

The fourier transform result of the n-th frame audio data xw _n (i) is as follows:

Where N represents the number of points of fourier transform, L and N may be set equal.

(4) Calculating a power spectrum:

P (N, k) = |x (N, k) | ²,n＝0,1,...,N_raw -1, where N _raw represents the total number of frames of the current signal after STFT (Short Time Fourier Transform ).

Where k identifies the bin index and P (n, k) represents the power spectrum of the kth bin of the nth frame.

After determining the power spectrum of each audio frame, the audio frames to be detected may be filtered based on the power spectrum of each audio frame. The audio frame to be detected may be an audio frame with a power average value greater than a mute power threshold in the target dry audio, where the power average value is determined according to an average value of power values of all frequency points. The selection process of the band detected audio frames is described in detail below or in the following.

First, an average power sequence is calculatedWherein, N _bins represents the frequency point number. The 1/N term in this formula may also be eliminated.

The minimum power P _min =1e-10 is set as the mute decision threshold. Searching for effective power sequences

The audio frame corresponding to each effective power in Pwr (n) is the audio frame to be detected.

202, According to the power spectrum corresponding to each audio frame to be detected, determining the existence probability of the voice corresponding to each audio frame to be detected.

And determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio. And carrying out multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected for each audio frame to be detected, and obtaining the power spectrum after the multiplication processing. The weight of the frequency point of the positive integer multiple of the voice fundamental frequency estimated value corresponding to the audio frame to be detected is larger than that of other frequency points. And determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing.

The process of determining the existence probability of the human voice may be performed as follows in the steps shown in fig. 3.

2021, Determining a voice fundamental frequency estimated value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset voice frequency characteristic information.

The voice frequency characteristic information is a voice fundamental frequency range. The frequency range of the fundamental frequency of the voice is 40 Hz-1500 Hz, so the minimum value of the fundamental frequency of the voice can be set asMaximum value isThe corresponding frequency points are expressed as: Wherein f _s is the bandwidth.

Firstly, smoothing the power spectrum corresponding to each audio frame to be detected by a preset window length.

The subsequent step is to detect the peak of the fundamental frequency, and the smoothing is to filter out the small peak of the noise distributed on the fundamental frequency and the frequency-doubled main peak.

The power spectrum may be smoothed using a triangular window convolution operation. Triangular window length m+1. Calculation ofWherein the method comprises the steps ofThe representation is rounded up to obtain a smooth kernel function with length m+1 points:

And further carrying out normalization treatment, namely:

the smoothed power spectrum can be expressed as:

The smoothed sequence length M here may be chosen as:

and then, respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the voice fundamental frequency range as a voice fundamental frequency estimated value corresponding to each audio frame to be detected.

And (3) searching the peak position of the smoothed power spectrum corresponding to each audio frame to be detected as follows:

Wherein the arg function is a lookup function. Further, the first range can be found in all k _peak If there are a plurality of k _peak in the range, the smallest k _peak is denoted as k _f0, the corresponding frequency value is taken as the estimated value of the fundamental frequency of the voice, and the fundamental frequency can be obtained by using the frequency resolution parameter, namely the fundamental frequency f ₀ can be expressed as:

2022, constructing a weight coefficient function corresponding to each audio frame to be detected according to the voice fundamental frequency estimated value corresponding to each audio frame to be detected.

The weight coefficient function is used for representing weights corresponding to different frequency points, and the waveform of the weight coefficient function has a plurality of wave crests which respectively correspond to positive integer multiples of the estimated value of the fundamental frequency of the human voice.

Alternatively, the weight coefficient function may be a trigonometric function.

One possible form is given here:

The corresponding discrete expressions are as follows:

2023, determining the existence probability of the voice of each audio frame to be detected according to the power spectrum and the weight coefficient function corresponding to each audio frame to be detected.

In implementation, for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by a weight coefficient function to obtain a power spectrum after the weight processing, determining the ratio of the total power of the power spectrum after the weight processing to the total power of the power spectrum without the weight processing, and normalizing the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the voice corresponding to the audio frame to be detected.

The original power is defined as: p ₀ (n, k) =ps (n, k), and the weighted power spectrum is expressed as: p ₁(n,k)＝Ps(n,k)·W_sin (k), calculating an initial human voice presence parameter: Wherein K represents the frequency index corresponding to the maximum bandwidth.

The lower limit of the ratio may be p _L =0.2, the upper limit of the ratio may be p _U =0.8, and the human voice existence probability may be obtained by normalizing the human voice existence parameters as follows:

where p (n) =max (p _L,min(p_U,prob_v (n)).

The weight coefficient function based on the above construction can be known as follows: if the audio frame is a voice audio frame, ps (n, k) and W _sin (k) can be shown in fig. 4, it can be seen that, because in the voice audio frame, the peaks appear at the positions of fundamental frequency and frequency multiplication (integer multiple of fundamental frequency), so that the peaks are uniformly distributed, the above manner of constructing the weight coefficient function is that the peaks of Ps (n, k) also appear at the positions of fundamental frequency and frequency multiplication, so that the peaks of Ps (n, k) and W _sin (k) correspond to the positions of the peaks and the positions of the troughs, P ₁ (n, k) obtained by multiplying can be shown in fig. 5, and the effect of W _sin (k) is to amplify the peaks and reduce the valleys of Ps (n, k), although the total power is reduced, but the reduced amplitude is smaller; if the audio frame is a non-human audio frame, ps (n, k) and W _sin (k) can be seen as shown in fig. 6, because in the non-human audio frame the peaks are not uniformly distributed, however, in the manner of constructing the weight coefficient function described above, the peaks of Ps (n, k) are uniformly distributed, so that many peaks and valleys of Ps (n, k) and W _sin (k) are not corresponding, P ₁ (n, k) obtained after multiplication can be seen as shown in fig. 7, and the total power reduction of W _sin (k) to Ps (n, k) can be relatively large. Therefore, based on the above characteristics, the above normalized ratio can reflect the existence probability of the human voice of the audio frame. The 4 function diagrams are illustrated using continuous function images for ease of browsing, and discrete function images are not employed.

And 203, detecting the human voice audio frames and the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.

There are various methods for detecting the voice audio frame and the non-voice audio frame based on the voice existence probability, for example, a threshold may be set, and the voice audio frame to be detected is determined to be the voice audio frame if the voice existence probability of the audio frame to be detected is greater than the threshold, and is determined to be the non-voice audio frame if the voice existence probability of the audio frame to be detected is less than the threshold.

Or may also be detected as follows: detecting a voice audio frame in the to-be-detected audio frames of the target dry voice frequency according to the voice existence probability and the voice detection probability threshold value corresponding to each to-be-detected audio frame; and detecting the non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.

The method divides the detection of the voice audio frames and the detection of the non-voice audio frames into two processes, wherein each detection process can adopt one threshold value or two threshold values, and the detection method adopting the two threshold values is given below.

The process of detecting the voice audio frame: (wherein the threshold of the human voice detection probability comprises a first threshold and a second threshold, the first threshold being greater than the second threshold)

And acquiring the existence probability of the voice corresponding to the audio frames to be detected one by one according to the time sequence from first to last.

When the acquired first human voice existence probability is larger than a first threshold value and the human voice existence probability which is smaller than a second threshold value or larger than the first threshold value and is acquired before the first human voice existence probability does not exist, determining an audio frame to be detected corresponding to the first human voice existence probability as a human voice starting audio frame.

After the acquired second voice existence probability is smaller than a second threshold value and the voice existence probability which is smaller than the second threshold value or larger than the first threshold value and is acquired before the second voice existence probability does not exist, when the first preset number of voice existence probabilities which are continuously acquired are all larger than the first threshold value, determining the audio frame to be detected which corresponds to the second voice existence probability as a voice starting audio frame, wherein the second voice existence probability is the voice existence probability which is acquired first in the first preset number of voice existence probabilities.

And each time after the voice starting audio frame is determined, when the existence probabilities of the second preset number of voice which are continuously acquired are smaller than a second threshold value, determining the audio frame to be detected corresponding to the third voice existence probability as a voice ending audio frame, wherein the third voice existence probability is the voice existence probability acquired before the voice existence probability acquired first in the second preset number of voice existence probabilities.

And each time after the voice ending audio frame is determined, when the first preset number of voice existence probabilities which are continuously acquired are all larger than a first threshold value, determining the audio frame to be detected corresponding to the fourth voice existence probability as a voice starting audio frame, wherein the fourth voice existence probability is the voice existence probability which is acquired first in the second preset number of voice existence probabilities.

And determining the voice audio frames in all the audio frames to be detected according to the determined voice start audio frame and voice end audio frame.

The process of detecting the voice audio frame: (wherein the non-human voice detection probability threshold includes a third threshold and a fourth threshold, the third threshold being less than the fourth threshold)

And when the obtained existence probability of the fifth voice is smaller than the third threshold value and the existence probability of the voice which is larger than the fourth threshold value or smaller than the third threshold value and is obtained before the existence probability of the fifth voice does not exist, determining the audio frame to be detected corresponding to the existence probability of the fifth voice as a non-voice starting audio frame.

After the obtained sixth human voice existence probability is greater than the fourth threshold value and the human voice existence probability which is greater than the fourth threshold value or smaller than the third threshold value and is obtained before the sixth human voice existence probability does not exist, determining an audio frame to be detected corresponding to the sixth human voice existence probability as a non-human voice starting audio frame when the third preset number of human voice existence probabilities which are continuously obtained are smaller than the third threshold value, wherein the sixth human voice existence probability is the human voice existence probability which is obtained first in the third preset number of human voice existence probabilities.

And each time after the non-human voice starting audio frame is determined, when the existence probabilities of the fourth preset number of human voices which are continuously acquired are all larger than a fourth threshold value, determining the audio frame to be detected corresponding to the seventh human voice existence probability as a non-human voice ending audio frame, wherein the seventh human voice existence probability is the human voice existence probability acquired before the human voice existence probability acquired first in the fourth preset number of human voice existence probabilities.

And each time after the non-human voice ending audio frame is determined, determining the audio frame to be detected corresponding to the eighth human voice existence probability as the non-human voice starting audio frame when the third preset number of human voice existence probabilities continuously acquired are smaller than a third threshold value, wherein the eighth human voice existence probability is the human voice existence probability acquired first in the fourth preset number of human voice existence probabilities.

And determining the non-human voice audio frames in all the audio frames to be detected according to the determined non-human voice starting audio frames and the non-human voice ending audio frames.

The first threshold and the fourth threshold may be equal, and may be set to 0.6, and the second threshold and the third threshold may be equal, and may be set to 0.5.

The shortest mute time in the voice section can be set asCorresponding frame number ofThis is the second predetermined number. The shortest human voice duration (less than this duration is considered short-term noise) may be set toCorresponding frame number ofThis is the first predetermined number. The longest duration of the occasional abnormal sounds in silence can be set asCorresponding frame numberThis is the fourth predetermined number. The shortest mute period (greater than this period is considered to be entering the mute region) may be set toCorresponding frame number ofThis is the third predetermined number.

Before the detection processing, the sequence of the existence probability of the human voice can be smoothed to realize denoising, and then the detection is performed. The spline curve S _b (m) may be used to perform smoothing to obtain a smoothed probability sequence:

where M is the smooth window length, which may be 30.

The set of detected human voice audio frames described above may be referred to as Seg _voc and the set of detected non-human voice audio frames may be referred to as Seg _sil.

204, Determining the signal-to-noise ratio estimated value of the target dry sound audio according to the power spectrum corresponding to the human sound audio frame and the power spectrum corresponding to the non-human sound audio frame.

Firstly, determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-voice audio frame, wherein the power average value is determined according to the average value of the power values of all the frequency points.

Then, a first median of the power averages corresponding to all the voice audio frames is determined, and a second median of the power averages corresponding to all the non-voice audio frames is determined.

And finally, calculating a signal-to-noise ratio estimated value according to the ratio of the first median value to the second median value.

Specific calculation can calculate log power spectrum statistics of the voice section respectivelyNon-human voice segment log power spectrum statisticsThe base x may continue to be set by requirement, and may be 10 or e, where Q _1/2 (·) represents median fetching operation, i.e., after the current sequence is ordered, the value ranked in the middle is taken as the final output.

Calculating signal to noise ratio parametersSetting upper and lower limits of the signal-to-noise ratio: And further calculating normalized signal-to-noise ratio parameters: wherein g ₁ (·) is the correction function expressed as With sharpening, r _snr can be considered as a signal-to-noise ratio estimate.

Because the sound in the non-human audio frame can be considered noise in the dry audio, the signal-to-noise ratio of the dry audio can be approximated by the ratio of the power information of the human audio frame and the non-human audio frame described above.

205, According to the existence probability of the voice corresponding to the voice audio frame, determining the voice definition of the target dry voice audio.

Specifically, the voice definition of the target dry voice audio can be determined by the median value of the voice existence probabilities corresponding to all voice audio frames. That is, the greater the existence probability of the voice audio frame, the higher the voice clarity.

The corresponding voice clarity can be expressed as:

206, determining the product of the signal-to-noise ratio estimated value and the voice definition as voice quality information of the target dry voice audio.

Wherein the voice quality information may be considered as a subject score in the audio quality information.

207, Determining the noise penalty parameter of the target dry sound audio according to the power spectrum corresponding to the non-human sound audio frame.

The noise penalty parameter is determined by the noise intensity, and the larger the noise is, the larger the noise penalty parameter is, which can be a value in the range of 0 to 1. The present approach may refer to a wide variety of noise, such as common environmental noise.

The back noise calculation process may be as follows: first, a power average value corresponding to each of the non-human-derived audio frames is determined. The power average value is determined according to the average value of the power values of all the frequency points. Then, the median value of the power average values corresponding to all the non-human voice audio frames is determined. And finally, determining a noise penalty parameter of the target dry sound audio according to the median of the power mean value, wherein the noise penalty parameter is inversely related to the median of the power mean value.

The upper and lower logarithmic power limits of ambient noise can be defined:

When the non-human acoustic energy is too large (greater than the lower limit is considered non-negligible), a noise penalty parameter is calculated:

Wherein g (·) is a correction function, expressed as: has sharpening effect.

When the noise penalty parameters are calculated, all the non-human voice audio frames can be divided into a plurality of sections, the noise penalty parameters are calculated for each section according to the mode, and then the largest noise penalty parameter is selected as the noise penalty parameter of the target dry voice audio from the noise penalty parameters corresponding to each section.

The segmentation can be based on a preset frame number, or the continuous non-human voice audio frame can be divided into one segment.

And 208, determining a power penalty parameter of the target dry sound audio according to the power spectrum corresponding to each audio frame to be detected.

The specific process may be as follows:

first, a first power penalty subparameter is determined.

The audio frames to be detected may be audio frames in the target dry audio with a power average value greater than the mute power threshold. And determining a first power penalty subparameter according to the duty ratio of the number of the audio frames to be detected in the total number of the audio frames of the target dry audio and a preset duty ratio threshold, wherein the first power penalty subparameter is inversely related to the difference value of the duty ratio reduced by the duty ratio threshold when the duty ratio is smaller than or equal to the duty ratio threshold, and the first power penalty subparameter is a fixed value when the duty ratio is larger than the duty ratio threshold.

The minimum effective audio length is defined as T _min =1s, and the corresponding frame number is calculatedThe number of frames N _a of the effective power sequence Pwr (N) is obtained, and if N _a<N_frmmin, the whole audio is regarded as having too low input energy (insufficient effective high-energy audio data), so that the audio quality information can be directly determined to be 0.

If N _a≥N_frmmin, the effective power frame number duty cycle can be further calculated:

If the duty cycle is too small, below a duty cycle threshold, such as 0.1, then the first power penalty subparameter β _a＝r_a +0.9 is determined, otherwise there is no penalty here, i.e. the first power penalty subparameter β _a =1.

Then, a second power penalty sub-parameter and a third power penalty sub-parameter are determined.

And determining a power average value corresponding to each audio frame to be detected, wherein the power average value is determined according to the average value of the power values of all the frequency points. And determining the average value of the power average values corresponding to all the audio frames to be detected, and obtaining the total power average value of the target dry audio. And determining a second power punishment subparameter and a third power punishment subparameter according to the total power average value, a preset power upper limit and a preset power lower limit, wherein when the total power average value is larger than or equal to the power upper limit, the second power punishment subparameter is inversely related to the difference value of the total power average value minus the power upper limit, when the total power average value is smaller than or equal to the power lower limit, the third power punishment subparameter is inversely related to the difference value of the power upper limit minus the total power average value, and when the total power average value is smaller than the power upper limit and larger than the power lower limit, the second power punishment subparameter and the third power punishment subparameter are both fixed numerical values.

The calculation process may be as follows:

the calculation of the average power value for the entire audio is expressed as: The base x may be set continuously, and may be 10 or e.

Setting a lower power limitUpper power limit

(1) Excessive power determination

If it isThe overall energy is considered to be greater and the following is done: setting the highest thresholdThe probability of power over-power is expressed asA second power penalty subparameter β _U＝1-r_Uextrem is calculated.

If it isThe second power penalty subparameter is β _U =1.

(2) Determination of too little power

If it isThe overall energy is considered to be smaller and the following is done: setting the lowest thresholdThe probability of power undersize is expressed asA third power penalty subparameter β _L＝1-r_Lextrem is calculated.

If it isThen the three power penalty subparameter β _L =1.

And finally, determining the product of the first power penalty subparameter, the second power penalty subparameter and the third power penalty subparameter as a power penalty parameter beta _W＝β_a·β_U·β_L of the target dry sound audio.

When calculating the power penalty parameter, only one of the first power penalty subparameter, the second power penalty subparameter and the third power penalty subparameter or the product of the two subparameters can be selected. The power punishment parameter is a numerical value in the range of 0-1, and the first power punishment subparameter, the second power punishment subparameter and the third power punishment subparameter are all numerical values in the range of 0-1.

Alternatively, only one or two of the above three power penalty sub-parameters may be used when determining the power penalty parameter, and other power penalty sub-parameters than the above three power penalty sub-parameters may also be used.

And 209, determining the audio quality information of the target dry audio according to the voice quality information, the noise penalty parameter and the power penalty parameter.

After determining the noise penalty parameter and the power penalty parameter, the noise penalty parameter and the power penalty parameter may be multiplied to obtain the final penalty parameter as: beta=beta _W·β_bkn.

The audio quality information of the target dry audio may be expressed as: p _clean＝β·r_snr·r_c.

Alternatively, the scheme may directly use the human voice quality information as the audio quality information of the dry audio without considering the noise penalty parameter and the power penalty parameter.

Fig. 8 is a schematic diagram of the above-described audio quality information detection process.

In addition, in the detection process of the voice audio frame and the non-voice audio frame, two possible situations exist in the embodiment of the application, and the calculation process of the corresponding audio quality information can be as follows:

In case one, no vocal audio frame is detected

And determining a power penalty parameter of the target dry sound audio according to the power spectrum corresponding to each audio frame to be detected. And determining the average value of the existence probabilities of the voice corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry audio according to the average value of the voice existence probability and the power penalty parameter. Since the audio distribution of the human voice is relatively stable (has a short-time stationary characteristic), a mean value with higher processing efficiency can be adopted. Of course, a median value may also be employed.

This situation may be considered to be poor overall sound quality, and the user may not sing, and the audio data obtained may be accompaniment or other noise, etc. In this case, the power penalty parameter may be calculated in the above manner, and in addition, the main score may be calculated based on the existence probability of the non-human voice audio frame (the audio frames to be detected are all non-human voice audio frames), and the power penalty parameter may be multiplied by the main score to obtain the audio quality information. The audio quality information calculation formula may be as follows:

in case two, no non-human voice audio frame is detected

And determining a power penalty parameter of the target dry sound audio according to the power spectrum corresponding to each audio frame to be detected. And determining the median value of the existence probabilities of the voice corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry audio according to the median value of the voice existence probability and the power penalty parameter. There may be some probability parameters of varying anomalies for the non-human audio frames, a median value is used here in order to prevent the individual extreme probability values from having too much influence on the final result.

This may be considered to be the case where the audio frames to be detected are all human voice audio frames, but there is often a non-human voice component in the actual singing process, so that there may be false detection in practice. In this case, the power penalty parameter may be calculated in the above manner, and in addition, the main score may be calculated based on the existence probability of the non-human voice audio frame (the audio frames to be detected are all non-human voice audio frames), and the power penalty parameter may be multiplied by the main score to obtain the audio quality information. The audio quality information calculation formula may be as follows:

p_clean＝β_W·0.9·(Q_1/2(prob^s(n)))。

And calculating audio quality information for the dry audio uploaded by the user stored in the database based on the flow, and then respectively determining whether each dry audio needs to be deleted or not based on the audio quality information. The specific deletion determination mechanism may be set arbitrarily according to requirements, for example, delete the dry audio whose audio quality information is lower than a preset threshold, for example, further acquire the dry audio whose audio quality information is lower than the preset threshold for an account which is not logged in for more than a first preset time period and not accessed for more than a second preset time period, and delete the dry audio if the audio quality information of the dry audio is lower than the preset threshold, for example, further weight and score the dry audio based on the audio quality information, the access amount, the account activity and other dimensions, and delete the dry audio whose score is lower than the preset score threshold.

In addition, the audio quality information may be stored, and the audio quality information may be used as a reference attribute when recommending the dry audio. The method specifically comprises the steps that audio quality information and other attribute information can be input into a first feature extraction model through dry audio, feature information of the dry audio is obtained, account attributes of a target user account are input into a second feature extraction model, feature information of the user account is obtained, then the feature information of the dry audio and the feature information of the user account are input into a scoring model, matching degree scores of the dry audio and the user account are obtained, and dry audio recommended to the user account is determined based on the matching degree scores of the dry audio.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The embodiment of the application also provides a device for detecting audio quality, which can be applied to the server in the above embodiment, as shown in fig. 9, and includes:

A determining module 910, configured to determine, according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, a voice fundamental frequency estimated value corresponding to each audio frame to be detected, where the power spectrum includes power values of frequency points;

the multiplying module 920 is configured to multiply, for each audio frame to be detected, a power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after the multiplying process, where a weight value of a frequency point of a positive integer multiple of a voice fundamental frequency estimation value corresponding to the audio frame to be detected is greater than a weight value of other frequency points;

The probability module 930 is configured to determine a human voice existence probability of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing;

The detection module 940 is configured to detect a human voice audio frame and a non-human voice audio frame in the audio frame to be detected of the target dry voice audio according to the human voice existence probability corresponding to each audio frame to be detected;

and the quality module 950 is configured to determine audio quality information of the target dry audio according to the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame.

In one possible implementation, the determining module 910 is configured to:

In one possible implementation manner, the voice frequency characteristic information is a voice fundamental frequency range, and the determining module 910 is configured to:

In one possible implementation, the ownership module 920 is configured to:

In one possible implementation, the probability module 930 is configured to:

In one possible implementation, the detecting module 940 is configured to:

In one possible implementation, the quality module 950 is further configured to: according to the power spectrum corresponding to the non-human voice audio frame, determining a noise penalty parameter of the target dry voice frequency; determining a power penalty parameter of the target dry sound frequency according to the power spectrum corresponding to each audio frame to be detected;

The quality module 950 is configured to:

In one possible implementation, the quality module 950 is configured to:

In one possible implementation, the quality module 950 is further configured to:

In one possible implementation, the quality module 950 is configured to:

The quality module 950 is configured to:

In one possible implementation, the quality module 950 is configured to:

It should be noted that: the device for detecting audio quality provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the device is divided into different functional modules to perform all or part of the functions described above. In addition, the device for detecting audio quality and the method embodiment for detecting audio quality provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the device for detecting audio quality are detailed in the method embodiment, which is not described herein again.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may have a relatively large difference due to different configurations or performances, and may include one or more processors 1001 and one or more memories 1002, where at least one instruction is stored in the memories 1002, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, e.g., a memory comprising instructions executable by a processor in a terminal to perform the method of detecting audio quality in the above embodiment is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be ROM (Read-Only Memory), RAM (Random Access Memory ), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A method of detecting audio quality, the method comprising:

according to the power spectrum corresponding to the non-human voice audio frame, determining a noise penalty parameter of the target dry voice frequency;

Determining a power penalty parameter of the target dry sound frequency according to the power spectrum corresponding to each audio frame to be detected;

Determining the audio quality information of the target dry audio according to the voice quality information, the noise penalty parameter and the power penalty parameter;

wherein the determining the noise penalty parameter of the target dry audio according to the power spectrum corresponding to the non-human voice audio frame includes:

Determining a noise penalty parameter of the target dry audio according to the median value of the power mean value, wherein the noise penalty parameter is inversely related to the median value of the power mean value;

Determining a product of at least two of a first power penalty subparameter, a second power penalty subparameter and a third power penalty subparameter as a power penalty parameter of the target dry audio, or determining the first power penalty subparameter, the second power penalty subparameter or the third power penalty subparameter as a power penalty parameter of the target dry audio; the first power penalty subparameter is determined according to the duty ratio of the number of the audio frames to be detected in the total number of the audio frames of the target dry audio and a preset duty ratio threshold, and is in negative correlation with the difference value of the duty ratio threshold and the duty ratio subtracted from the duty ratio when the duty ratio is smaller than or equal to the duty ratio threshold, and is a fixed value when the duty ratio is larger than the duty ratio threshold; the second power punishment subparameter and the third power punishment subparameter are determined according to a total power average value, a preset power upper limit and a preset power lower limit, when the total power average value is larger than or equal to the power upper limit, the second power punishment subparameter and the difference value of the total power average value minus the power upper limit are inversely related, when the total power average value is smaller than or equal to the power lower limit, the third power punishment subparameter and the difference value of the total power average value minus the power average value are inversely related, when the total power average value is smaller than the power upper limit and larger than the power lower limit, the second power punishment subparameter and the third power punishment subparameter are both fixed values, the total power average value is the average value of the power average values corresponding to all audio frames to be detected, the power average value of each audio frame to be detected is determined according to the average value of power values of all frequency points, and the audio frames to be detected are audio frames in which the power average value of the target dry sound audio is larger than a mute power threshold.

2. The method according to claim 1, wherein the determining the estimated value of the fundamental frequency of the voice corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio comprises:

3. The method according to claim 2, wherein the voice frequency characteristic information is a voice fundamental frequency range, and the determining, according to a power spectrum corresponding to each audio frame to be detected and preset voice frequency characteristic information, a voice fundamental frequency estimated value corresponding to each audio frame to be detected includes:

4. The method of claim 1, wherein for each audio frame to be detected, performing a multiplication process on a power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after the multiplication process, including:

5. The method of claim 4, wherein the weight coefficient function is a trigonometric function.

6. The method according to claim 1, wherein the determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the multiplication processing includes:

7. The method according to claim 1, wherein the detecting the human voice audio frame and the non-human voice audio frame in the audio frames to be detected of the target dry voice audio according to the human voice existence probability corresponding to each audio frame to be detected comprises:

8. The method of claim 1, wherein the determining the voice quality information of the target dry audio from the power spectrum corresponding to the voice audio frame and the power spectrum corresponding to the non-voice audio frame comprises:

9. The method of claim 8, wherein the determining the signal-to-noise ratio estimate of the target dry audio from the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame comprises:

10. The method of claim 8, wherein the determining the voice clarity of the target dry audio according to the voice existence probability corresponding to the voice audio frame comprises:

11. The method according to claim 1, wherein the method further comprises:

12. The method according to claim 1, wherein the method further comprises:

13. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to perform the operations performed by the method of detecting audio quality of any of claims 1 to 12.

14. A computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement operations performed by the method of detecting audio quality of any one of claims 1 to 12.