CN100483509C

CN100483509C - Aural signal classification method and device

Info

Publication number: CN100483509C
Application number: CN 200610164456
Authority: CN
Inventors: 严勤; 邓浩江; 王珺; 许剑峰; 许丽净; 李伟; 张清; 桑盛虎; 杜正中
Original assignee: Huawei Technologies Co Ltd; Institute of Acoustics CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Acoustics CAS
Priority date: 2006-12-05
Filing date: 2006-12-05
Publication date: 2009-04-29
Anticipated expiration: 2026-12-05
Also published as: CN101197135A; WO2008067735A1; EP2096629A1; EP2096629B1; EP2096629A4

Abstract

The invention discloses a sound signal classifying method, which comprises the following steps: receiving sound signal and determining updating rate of background noise according to frequency spectrum distribution parameter of the background noise and the frequency spectrum distribution parameter of the sound signal; and updating the noise parameter according to the updating rate and classifying the sound signal according to sub-band energy parameter and updated noise parameter. The invention further discloses a sound signal classifying device, which comprises the following parts: a background noise parameter updating module, which is used for determining the updating rate of the background noise according to frequency spectrum distribution parameter of the background noise and the frequency spectrum distribution parameter of the current sound signal and sending the determined updating rate; and a PSC module, which is used for receiving the updating rate from the background noise parameter updating module to update the noise parameter, classifying the current sound signal according to sub-band energy parameter and updated noise parameter, and sending the sound signal type determined by classification.

Description

Sound signal classification method and device

Technical Field

The present invention relates to the field of speech coding technology, and in particular, to a sound signal classification method and a sound signal classification device.

Background

In order to save transmission bandwidth, in the speech coding in the speech signal processing field, a Voice Activity Detection (VAD) technique is used, so that the encoder can encode background noise and active speech at different rates, i.e. encode background noise at a lower rate and encode active speech at a higher rate, thereby reducing average bit rate and greatly promoting the development of variable rate speech coding technique.

Existing signal detectors (VAD) are developed for speech signals, and only the input audio signal is divided into two types: noise and non-noise. Newer encoders such as AMR WB + and SMV incorporate detection of music signals as a modification and complement to VAD decisions. The important characteristic of the AMR-WB + encoder is that after VAD detection, the input audio signal is coded by different modes according to whether the input audio signal is voice or music, so as to reduce the code rate to the maximum extent and ensure the coding quality.

Two different coding modes in AMR-WB + include: an algebraic codebook excitation Linear prediction speech coder ACELP (adaptive Code Excited Linear prediction) and a transform excitation coding TCX (transform coded excitation) mode. ACELP is characterized in that the characteristics of voice are fully utilized by establishing a voice sounding model, the coding efficiency of voice signals is very high, and the technology is quite mature, so that the voice coding quality of the ACELP can be greatly improved by expanding and using the ACELP on a general audio coder. Similarly, the encoding quality of its wideband music is improved by extending the use of TCX encoding over low bit rate speech coders.

The ACELP and TCX mode selection algorithms of the AMR-WB + coding algorithm have two types depending on complexity: an open-loop selection algorithm and a closed-loop selection algorithm. The closed loop selection corresponds to high complexity, is a default option, and is a traversing search selection mode based on the perception weighted signal-to-noise ratio.

The open loop selection comprises the following steps:

firstly, in step 101, the VAD module determines whether the signal is a non-useful signal or a useful signal according to a Tone identifier (Tone _ flag) and a sub-band energy parameter (Level [ n ]).

Then, at step 102, a preliminary mode selection (EC) is performed;

in step 103, the mode preliminarily determined in step 102 is modified and refined mode selection (ESC) is performed to determine the selected coding mode, specifically based on the open-loop pitch parameter and the ISF parameter.

In step 104, TCXS processing is performed, that is, when the number of times of continuously selecting the speech signal coding mode is less than three, a small-scale closed loop traversal search is performed, and finally the coding mode is determined, where the speech signal coding mode is ACELP and the music signal coding mode is TCX.

The AMR-WB + speech signal selection algorithm has the following disadvantages:

1. when the existing VAD module classifies signals, the noise and some kinds of music signals are not distinguished ideally enough, so that the accuracy of sound signal classification is reduced;

2. calculating the open-loop pitch parameters is a necessary operation for the ACELP coding mode, but is not necessary for the TCX coding mode. According to the structural design of AMR-WB +, the VAD and open-loop mode selection algorithm needs to use the open-loop pitch parameter, so that the open-loop pitch needs to be calculated for all frames, and for other non-ACELP coding modes (such as TCX), the complexity of redundancy exists, the calculation amount of coding mode selection is increased, and the efficiency is reduced.

3. Although the VAD detection algorithm is superior to the current encoders in terms of speech detection and noise immunity, the music signal may be mistakenly judged as noise in the tail part of some special music signals, which results in the tail sound of the music being cut off and unnatural to sound.

4. The mode selection algorithm of AMR-WB + does not consider the snr environment in which the signal is located, and the performance of distinguishing speech and music is further deteriorated under low snr conditions.

Disclosure of Invention

In view of the above, the present invention provides a sound signal classification method and a sound signal classification apparatus, which can improve the accuracy of sound signal classification detection.

The invention provides a sound signal classification detection method, which comprises the following steps:

receiving a sound signal, and determining the update rate of background noise according to the background noise spectrum distribution parameter and the sound signal spectrum distribution parameter; and updating the noise parameters according to the updating rate, classifying the sound signals according to the sub-band energy parameters and the updated noise parameters, and classifying to obtain useful signals and non-useful signals.

The invention provides a sound signal classification device, comprising: a background noise parameter updating module and a signal initial classification PSC module;

the background noise parameter updating module is used for determining the updating rate of the background noise according to the background noise spectrum distribution parameters and the spectrum distribution parameters of the current sound signal and sending the determined updating rate;

the PSC module is used for receiving the updating rate from the background noise parameter updating module, updating the noise parameters, classifying the current sound signals according to the sub-band energy parameters and the updated noise parameters, and sending the sound signal types determined by classification.

According to the scheme, the update rate of the background noise is determined, the noise parameters are updated according to the update rate, the signals are initially classified according to the sub-band energy parameters and the updated noise parameters, the non-useful signals and the useful signals in the received voice signals are determined, the misjudgment that the useful signals are judged as the noise signals is reduced, and the accuracy of sound signal classification is improved.

Drawings

FIG. 1 is a schematic diagram of an AMR-WB + encoding algorithm open loop selection in the prior art;

FIG. 2 is a general flowchart of the classification detection method of sound signals according to the present invention;

FIG. 3 is a schematic diagram of the sound signal classifying apparatus according to the present invention;

FIG. 4 is a schematic diagram of a system on which an embodiment of the present invention is based;

FIG. 5 is a flow chart of an encoder parameter extraction module calculating various parameters according to an embodiment of the present invention;

FIG. 6 is a flow chart of another encoder parameter extraction module for calculating various parameters according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the components of a PSC module according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a signal classification decision module determining a feature parameter according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating a signal classification decision module performing speech decision according to an embodiment of the present invention;

fig. 10 is a diagram illustrating a signal classification decision module for music decision according to an embodiment of the present invention;

fig. 11 is a schematic diagram illustrating a signal classification decision module correcting an initial decision result according to an embodiment of the present invention;

fig. 12 is a schematic diagram illustrating a signal classification decision module performing preliminary correction classification on an uncertain signal according to an embodiment of the present invention;

fig. 13 is a schematic diagram illustrating a final classification and correction of a signal by a signal classification and decision module according to an embodiment of the present invention;

fig. 14 is a schematic diagram illustrating parameter updating performed by the signal classification decision module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

The method has the main idea that the updating rate of the background noise is determined according to the spectrum distribution parameters of the current sound signals and the spectrum distribution parameters of the background noise, and the noise parameters are updated according to the updating rate, so that the method is performed according to the updated noise parameters when the useful signals and the non-useful signals in the received sound signals are determined, the accuracy of the noise parameters is higher when the useful signals and the non-useful signals are determined, and the accuracy of sound signal classification is improved.

As shown in fig. 2, the present invention firstly provides a sound signal classification detection method, which includes:

step 201, receiving a sound signal, and determining an update rate of a background noise according to a background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal;

and step 202, updating the noise parameters according to the updating rate, and classifying the sound signals according to the sub-band energy parameters and the updated noise parameters.

In step 202, the sound signals are classified into types of useful signals and types of non-useful signals. Thereafter, the type of the useful signal, which includes a speech signal and a music signal, may be further determined, and in the determining, the selection is determined based on the open-loop pitch parameter, the pilot frequency parameter and the sub-band energy parameter, or the selection is determined based on the pilot frequency parameter and the sub-band energy parameter, according to whether the noise is converged.

In addition, in order to prevent the music signal tail from being wrongly judged as a non-useful signal and reduce the sound effect, the method also obtains the determined useful signal type, determines the signal tail length according to the useful signal type, and further determines the useful signal and the non-useful signal in the received voice signal according to the signal tail length. Here, the smear to the music signal can be set large, thereby improving the sound effect of the music signal.

When the useful signal is determined as a speech signal or a music signal, a signal which cannot be determined very accurately can be set as an uncertain type, and then the uncertain type is corrected according to other parameters, so that the type of the useful signal is finally determined.

Because the coding modes of the non-useful signals do not all need to calculate the derivative spectrum frequency parameters, in order to reduce the calculation amount in the classification process and improve the classification efficiency, for the determined non-useful signals, if the corresponding coding mode does not need to calculate the derivative spectrum frequency parameters, the derivative spectrum frequency parameters are not calculated.

As shown in fig. 3, the present invention further provides a sound signal classification apparatus, which includes a background noise parameter updating module and a signal initial classification (PSC) module. The background noise parameter updating module is used for determining the updating rate of background noise according to the spectrum distribution parameter of the current sound signal and the background noise spectrum distribution parameter, and transmitting the determined updating rate to the PSC module; the PSC module is used for updating the noise parameters according to the updating rate from the background noise parameter updating module, initially classifying the signals according to the sub-band energy parameters and the updated noise parameters, and determining the received voice signals as useful signal types or non-useful signal types.

The sound signal classification apparatus may further include: a signal classification decision module; the PSC module also transmits the determined signal type to a signal classification decision module; the signal classification decision module determines the type of the useful signal based on the open-loop pitch parameter, the pilot frequency parameter and the sub-band energy parameter, or based on the pilot frequency parameter and the sub-band energy parameter, wherein the type comprises a speech signal and a music signal.

The sound signal classification apparatus may further include: a classification parameter extraction module; the PSC module transmits the determined signal type to the signal classification decision module through the classification parameter extraction module; the classification parameter extraction module is further used for acquiring a guide spectrum frequency parameter and a sub-band energy parameter, or further acquiring an open-loop pitch parameter, processing the acquired parameter into a signal classification characteristic parameter, and transmitting the signal classification characteristic parameter to the classification judgment module; processing the acquired parameters into a frequency spectrum distribution parameter of the sound signal and a background noise frequency spectrum distribution parameter, and transmitting the frequency spectrum distribution parameters to the background noise parameter updating module; the classification decision module determines the type of the useful signal according to the signal classification characteristic parameters and the signal type determined by the PSC module, wherein the type comprises a voice signal and a music signal.

The PSC module can be further used for transmitting the signal-to-noise ratio of the sound signal calculated in the process of determining the signal type to the signal classification decision module; the signal classification decision module further determines the useful signal as a speech signal or a music signal according to the signal-to-noise ratio.

The sound signal classification apparatus may further include: an encoder mode and rate selection module; the signal classification decision module transmits the determined signal type to the encoder mode and rate selection module; the coder mode and rate selection module determines a coding mode and rate of a sound signal according to the received signal type.

The sound signal classification apparatus may further include: and the encoder parameter extraction module is used for extracting the pilot frequency parameter and the sub-band energy parameter or further extracting the open-loop pitch parameter, transmitting the extracted parameters to the classification parameter extraction module and transmitting the extracted sub-band energy parameter to the PSC module.

The following describes a sound signal classification detection method and a sound signal classification apparatus according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a system according to an embodiment of the present invention. The digital audio encoder comprises a sound signal classification detector (SAD) which divides an input audio digital signal into different classes according to the requirements of the encoder, wherein the classes can be divided into three classes of non-useful signals, voice and music, so that the encoder is provided with a basis for coding mode selection and rate selection.

As can be seen in fig. 4, the SAD module internally comprises: the background noise estimation control module, the signal initial classification module, the classification parameter extraction module and the signal classification judgment module are 4 sub-modules. The SAD is used as a signal classifier used in the encoder, and in order to reduce resource consumption and calculation complexity, the parameters of the encoder are fully utilized, so that the sub-band energy parameters and the encoder parameters are calculated by an encoder parameter extraction module in the encoder, and the calculated parameters are provided for the SAD module. In addition, the final output of the SAD module is of the signal decision type, including three classes of non-useful signals, speech and music, and is provided to the coder mode and rate selection module for selecting the coder mode and rate.

The modules related to SAD in the encoder, the sub-modules in SAD, and the interaction process between the modules are described in detail below.

An encoder parameter extraction module in the encoder calculates the sub-band energy parameters and encoder parameters and provides the calculated parameters to the SAD module. The subband energy parameter may be calculated by using a filter bank filtering method, and the specific number of subbands is determined according to the calculation complexity requirement and the classification accuracy requirement, which is described below as being divided into 12 subbands in this embodiment.

In this embodiment, the process of the encoder parameter extraction module calculating the parameters needed by the various SAD modules can be as shown in figure 5 or figure 6,

the process shown in fig. 5 includes the following steps:

in step 501, the encoder parameter extraction module first calculates the subband energy parameter.

Step 502, the encoder parameter extraction module determines whether to perform the guided spectrum frequency (ISF) operation according to the signal initial decision result (Vad _ flag) from the PSC module, and if so, performs step 503; otherwise, step 504 is performed.

The step of determining whether ISF operation is required includes: if the current frame is a non-useful signal, then according to the encoder's mechanism: if the encoder needs ISF parameters for the encoding of the non-useful signals, ISF operation is carried out; if not, the encoder parameter extraction module ends. If the current frame is a useful signal, ISF operation is performed. The calculation of ISF parameters for a useful signal is required for most coding modes and therefore does not introduce redundant complexity to the encoder. The technical solution of calculating the ISF parameter can refer to the data of various encoders, which is not described herein.

Step 503, the encoder parameter extraction module calculates the ISF parameter, and then executes step 504.

Step 504, the encoder parameter extraction module calculates the open-loop pitch parameter.

The subband energy parameters calculated by the above-described procedure of fig. 5 are provided to the PSC module and the classification parameter extraction module in the SAD, and the remaining parameters are provided to the classification parameter extraction module in the SAD.

In the flowchart shown in fig. 6, a step of determining whether to calculate the open-loop pitch parameter according to whether the initial noise converges is added to the flowchart shown in fig. 5. Steps 601 to 603 are substantially the same as steps 501 to 503 in fig. 5, and in step 604, it is determined whether the initialized noise parameter, i.e. the noise estimate, converges, and if so, the open-loop pitch parameter is calculated in step 605; otherwise, the open-loop pitch parameter is not calculated.

Because the open-loop pitch parameter belongs to redundant calculation for some coding modes, such as a TCX coding mode, in order to reduce the calculation complexity, after the noise estimation is converged, it can be basically determined that the coding mode corresponding to the signal does not need to calculate the open-loop pitch parameter, and therefore, the open-loop pitch parameter is not calculated any more.

Before the noise estimation converges, to ensure the convergence of the noise estimation and its convergence speed, the open-loop pitch parameters need to be calculated, but this belongs to the calculation in the startup phase, and its complexity can be ignored. The technical solution of the open-loop pitch parameter calculation may refer to ACELP-based coding, which is not described herein in detail. The basis for determining whether the noise estimate converges may be to continuously decide that the number of times of the noise frames exceeds a threshold noise convergence threshold (THR1), in which the THR1 value takes 20 in one example of the embodiment.

The extracted subband energy parameters are as follows: level [ i ]. Wherein, i represents the member index of the vector, in this embodiment, 1.. 12 is taken, which corresponds to 0-200hz, 200-.

The extracted ISF parameters are as follows: is_n[i]Whereinn denotes the frame index and i takes 1.. 16 denotes the member index in the vector.

The extracted open-loop pitch parameters include:

open loop pitch gain (ol _ gain) and open loop pitch lag (ol _ lag), and tone flag (tone _ flag). Wherein, if the value of ol _ gain is greater than the TONE threshold (TONE _ THR), the TONE flag, TONE _ flag, is set to 1.

The signal initial classification module (PSC) may be implemented by using various existing VAD algorithm schemes, and specifically includes a background noise estimation sub-module, a signal-to-noise ratio calculation sub-module, a useful signal estimation sub-module, a decision threshold adjustment word module, a comparison sub-module, and a hangover protection useful signal sub-module. In this embodiment, as shown in fig. 7, the specific implementation of the PSC module may also be different from the conventional VAD algorithm module by the following three points:

I. the signal-to-noise ratio calculation submodule calculates the signal-to-noise ratio according to the parameter and the sub-band energy parameter, and the calculated signal-to-noise ratio parameter (snr) is used in the PSC module and is also transmitted to the signal classification judgment module, so that the signal classification judgment module can more accurately distinguish voice and music under the condition of low signal-to-noise ratio.

II. Since existing VADs do not distinguish noise from certain kinds of music perfectly, the present embodiment improves VADs as follows: first the calculation of the background noise parameter is controlled by the update rate acc provided by the background noise parameter update module. The background noise estimation submodule receives the updating rate from the background noise parameter updating module, updates the noise parameter and transmits the background noise sub-band energy estimation parameter calculated according to the updated noise parameter to the signal-to-noise ratio calculating submodule. For the calculation of the update rate, refer to the following description of the background noise parameter update module, in an example of this embodiment, the update rate may take 4 steps: acc1, acc2, acc3, acc 4. For different update rates, different up update parameters (update _ up) and down update parameters (update _ down) are determined, the up _ up and down update parameters corresponding to the up and down update rates of the background noise, respectively.

Then, the scheme for updating the noise parameter may specifically adopt the scheme in AMR _ WB +:

If(bckr_est_m[n]<level_m-1[n])

update＝update_up

else

update＝update_down

the formula for the noise estimate update is:

bckr_est_m+1[n]＝(1-update)*bckr_est_m[n]+update*level_m-1[n]

the formula for updating the noise spectrum distribution parameter vector is as follows:

{\tilde{p}}_{m + 1} [i] = (1 - update) * {\tilde{p}}_{m} [i] + update * p_{m} [i]

wherein,

m: frame index

n: sub-band index

i: index of element of spectral distribution parameter vector, i ═ 1, 2, 3, 4

bckr _ est: background noise estimation sub-band energy

: background noise spectral distribution parameter vector estimation

P: current signal spectral distribution parameter vector

III, in the existing VAD, the hangover is generally used to protect the useful signal from being misjudged as noise, and the length of the hangover should be a compromise in terms of both protecting the signal and improving the transmission efficiency. For a conventional speech coder, the length of the tail can be learned to be a constant. Whereas for multi-rate encoders audio signals comprising music are targeted, such signals often show long low-energy hangover portions, which are difficult to detect by conventional VADs and therefore require long hangover portions to protect them. In an embodiment, the smear length in the tail-biting protection useful signal sub-module is designed to be adaptive according to the SAD signal decision result, if the SAD _ flag is determined to be a MUSIC signal (SAD _ flag is MUSIC), a longer smear parameter (hand _ len is LONG _ LONG) is set, and if the SAD _ flag is determined to be a SPEECH signal (SAD _ flag is SPEECH), a shorter smear parameter (hand _ len is LONG _ SHORT) is set, specifically set as follows:

If(SAD_flag＝MUSIC)

hang_len＝HANG_LONG

else if(SAD_flag＝SPEECH)

hang_len＝HANG_SHORT

else

hang_len＝0

wherein:

SAD _ flag SAD decision flag

hang _ len trailing guard length

In one example of this embodiment, the unit may be the number of frames, and the LONG _ LONG is 100 and the LONG _ SHORT is 20.

The classification parameter extraction module is used for calculating parameters required by the signal classification judgment module and the background noise parameter updating module according to the Vad _ flag parameter determined by the signal initial classification module and the sub-band energy parameter, the ISF parameter and the open-loop pitch parameter provided by the encoder parameter extraction module, and correspondingly providing the sub-band energy parameter, the ISF parameter, the open-loop pitch parameter and the calculated parameters to the signal classification judgment module and the background noise parameter. The parameters calculated by the classification parameter extraction module comprise:

1. pitch parameter (pitch)

Comparing the difference value of continuous open-loop pitch delay, if the increment of the open-loop pitch delay is smaller than a set threshold value, accumulating delay counts; if the sum of the delay counts of two consecutive frames is large enough, then the value of pitch is set to 1, otherwise the value of pitch is set to 0. The calculation formula of the open loop pitch delay can be seen in AMR-WB +/AMR-WB standard document.

2. Long signal correlation parameter (meangain)

meangain is the running average of three adjacent frames of tone, where tone _ fig 1000; the tone _ fig definition is the same as in AMR-WB +.

3. Zero crossing rate (zcr)

II { A } is 1 when A is truth and 0 when false.

4. Sub-band energy time-domain fluctuation (t _ flux)

<math> <mrow> <mi>t</mi> <mo>_</mo> <mi>flux</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>12</mn> </munderover> <mrow> <mo>|</mo> <msub> <mi>level</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>level</mi> <mrow> <mi>m</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mrow> <mrow> <mi>short</mi> <mo>_</mo> <mi>mean</mi> <mo>_</mo> <mi>level</mi> <mo>_</mo> <mi>energy</mi> </mrow> </mfrac> </mrow></math>

Wherein short mean level energy represents the short-time average energy

5. High and low sub-band energy ratio (ra)

ra = \frac{sublevel_high_energy}{sublevel_low_energy}

Among them, an example of the invention of this patent:

sublevel_high_energy＝level[10]+level[11]；

sublevel_low_energy＝level[0]+level[1]+level[2]+level[3]+level[4]+level[5]+level[6]+level[7]+level[8]+level[9]；

6. sub-band energy frequency domain fluctuation (f _ flux)

<math> <mrow> <mi>f</mi> <mo>_</mo> <mi>flux</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>2</mn> </mrow> <mn>12</mn> </munderover> <mrow> <mo>|</mo> <msub> <mi>level</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>level</mi> <mi>m</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>|</mo> </mrow> </mrow> <mrow> <mi>short</mi> <mo>_</mo> <mi>mean</mi> <mo>_</mo> <mi>level</mi> <mo>_</mo> <mi>energy</mi> </mrow> </mfrac> </mrow></math>

7. Short-term average of derivative spectral distance (isf _ meanSD): is the average of the spectral distances Isf _ SD of five adjacent frames, wherein

8. The sub-band energy standard deviation average parameter (level _ means SD) represents an average value of sub-band energy standard deviations (level _ SD) of two adjacent frames, and the calculation method of the level _ SD parameter refers to the calculation method of the Isf _ SD.

Among the above 8 parameters, the parameters provided to the background noise parameter updating module include: zcr, ra, f _ flux, and t _ flux. The parameters provided to the signal classification decision module include: pitch, meangain, isf _ meanSD, and level _ meanSD.

The signal classification decision module is used for distinguishing the signal into the following final regions according to snr and Vad _ flag from the signal initial classification module PSC, and sub-band energy parameters, pitch, meangain, Isf _ meanSD and level _ meanSD from the classification parameter extraction module: non-useful signals (NOISE), SPEECH Signals (SPEECH) and MUSIC signals (MUSIC). The signal classification decision module may include: a parameter updating submodule and a judgment submodule; the parameter updating submodule is used for updating the threshold in the signal classification judgment process according to the signal-to-noise ratio and providing the updated threshold to the judgment submodule; the decision sub-module is used for receiving the sound signal type from the PSC module, determining the type of the useful signal based on the open-loop pitch parameter, the pilot frequency parameter, the sub-band energy parameter and the updated threshold or based on the pilot frequency parameter, the sub-band energy parameter and the updated threshold, and sending the determined type of the useful signal to the encoder mode and rate selection module.

Determining the useful signal as a speech signal or a music signal comprises: firstly setting the value of a voice identification bit and the value of a music identification bit to be 0, then preliminarily determining a signal as a voice type, a music type or an uncertain type according to a fundamental tone parameter identification, a long-term signal correlation value, a pilot spectrum distance short-term average parameter and a sub-band energy sub-standard deviation average parameter, and correspondingly modifying the value of the voice identification bit or the music identification bit according to the preliminarily determined voice type or music type; and then modifying the preliminarily determined voice type, music type or uncertain type according to whether the sub-band energy, the long-term signal correlation value, the sub-band energy sub-standard deviation average parameter, the speed _ flag, the music _ flag, the continuous frame number with the pitch value of 1 exceeds a preset tailing frame number threshold, the continuous music frame number, the continuous voice frame number and the type of the previous frame, and determining the type of a useful signal, wherein the type comprises a voice signal and a music signal.

The following is a detailed description of the process of determining the useful signal as a speech signal or a music signal:

in order to ensure the stability of signal decision and avoid frequent conversion of decision results, the present embodiment provides a flag tailing mechanism for parameters, including determining characteristic parameter values of a pitch _ flag, a level _ means sd _ high _ flag, an ISF _ means sd _ low _ flag, a level _ means sd _ low _ flag, and a means _ flag according to the tailing mechanism, where the specific determination of the characteristic parameter values is as shown in fig. 8.

The length of the hangover period in fig. 8 is determined according to the hangover parameter identification value, and two hangover settings are provided in this embodiment, that is, a scheme for determining the hangover parameter identification value:

in the first tailing setting scheme, when a parameter value is higher than or lower than a certain threshold, a corresponding parameter tailing counter value is increased by one; otherwise, the corresponding parameter tailing counter value is set to be 0, and different parameter tailing marks are set according to the parameter tailing counter value. The larger the value of the parameter tailing counter is, the longer the length of the parameter tailing identification value is, and the parameter tailing identification value is specifically determined according to the actual situation when the parameter tailing identification value is set according to the parameter counter, which is not described herein again.

In the second tailing setting scheme, the tailing length is controlled according to the error rate ER of each internal node of the decision tree corresponding to the training parameter, the error rate is small, and the tailing is short; the error rate is large and the tail is long.

Thereafter, if the current signal is classified as a useful signal, an initial classification of speech and music is made:

first, making a voice initial decision, as shown in fig. 9, setting a voice flag bit to be 0 in step 901, then, in step 902, determining whether Isf _ means sd is greater than a preset first pilot spectrum voice threshold (for example, 1500), and if so, setting the value of the voice flag bit to be 1; if not, then,

in step 903, it is determined whether the pitch value is 1, and the pitch delay value t _ top _ mean obtained by the switch pitch search is smaller than the pitch speech threshold (e.g., 40), and if so, the value of the speech flag bit is set to 1; if not, then,

in step 904, it is determined whether the consecutive frame number with the pitch value of 1 exceeds a preset trailing frame number threshold (for example, 2 frames), and if so, the value of the voice flag is set to 1; if not, then,

in step 905, it is determined whether meangain is greater than a preset long-term associated speech threshold (e.g., 8000), and if so, the value of the speech flag is set to 1; if not, then,

in step 906, it is determined whether one or both of the level _ means sd _ high _ flag and ISF _ means sd _ high _ flag have a value of 1, and if so, the value of the voice flag is set to 1; otherwise, the value of the voice identification bit is not changed.

Then, a music initial decision is made, as shown in fig. 10:

in step 1001, firstly setting the music flag bit to 0, then in step 1002, judging that the signal simultaneously meets the flags ISF _ meanSD _ low _ flag equal to 1 and level _ meanSD _ low _ flag equal to 1, if yes, setting the music signal flag music _ flag; otherwise, the value of the music flag bit is not changed.

Thereafter, as shown in fig. 11, the initial decision result is corrected:

firstly, in step 1101, determining whether the instant energy of the sub-band is smaller than a sub-band energy threshold (for example, 5000), if yes, executing step 1102; otherwise, determining the signal as UNCERTAIN class (UNCERTAIN);

in step 1102, it is determined whether the mean _ flag is 1 and the music duration counter is less than the music duration count voice determination threshold (e.g., 3), and if so, the signal is determined to be a voice signal; if not, then,

in step 1103, it is determined that the value of ISF _ meanSD is greater than a second predefined cepstral speech threshold (e.g., 2000), and if so, the signal is determined to be a speech signal; if not, then,

in step 1104, it is determined whether level _ energy is less than 10000, and it is determined that the number of frames of noise exceeds five frames, if so, the current signal class is set as an uncertain class, which is to reduce the misjudgment of classifying the noise as music class; if not, then,

in step 1105, it is determined whether the values of the music flag and the voice flag are both 1, and if yes, the current signal category is determined as the uncertain category; if not, then,

in step 1106, judging whether the values of the music flag bit and the voice flag bit are both 0, if yes, determining the current signal type as a bit uncertain type; if not, then,

in step 1107, it is determined whether the music flag is 0 and the speech flag is 1, if yes, the current signal type is determined to be speech; if not, then,

in step 1108, since the music flag is 1 and the voice flag is 0, the current signal type is determined as music.

After the signal is determined to be of the uncertain class in

steps

1104 and 1105, i.e., step 1106, step 1109 is executed: judging whether pitch _ flag is 1, ISF _ meanSD is smaller than a score music threshold (for example, 900), the number of continuous voice frames is smaller than 3, and if so, determining the signal as a music class; otherwise, the signal is still determined as an uncertain class;

after the signal is determined to be a voice class in the

above steps

1103 and 1107, step 1110 is executed: whether the number of continuous music frames is more than 3 and ISF _ means SD is less than a music threshold of the guide spectrum, if so, determining the signal as a music signal; otherwise, the signal is determined to be a speech signal.

After the speech signal and the music signal are determined through the above process, the process shown in fig. 12 is executed for the signals still in the uncertain class, and the preliminary correction classification is performed, including: firstly, in step 1201, it is determined whether level _ energy is smaller than a sub-band energy uncertainty class threshold (e.g. 5000), and if so, the signal type is still determined as an uncertainty class; otherwise, in step 1202, determine if the number of sustained frames of music is greater than 1 and ISF _ means sd is less than the music threshold of the guide spectrum, if yes, determine the signal as music class; otherwise:

clearing the voice and music trailing flags, if the frame is a continuous voice type before and has strong continuity, judging the voice according to the characteristic parameters of the voice, and if the voice condition is met, setting a voice trailing flag speech _ flag to 1, specifically including steps 1203 to 1206 in fig. 12; if the frame is a continuous music category before and the continuity is strong, the music is determined according to the characteristic parameters of the music, and if the music condition is satisfied, a flag music _ flag of music tailing is set to 1, specifically including steps 1207 to 1210 in fig. 12.

Thereafter, as shown in steps 1211 to 1216 in fig. 12, if the speech trailing flag is 1 and the music trailing flag is 0, the current signal class is set to speech class; if the music trailing mark is 1 and the voice trailing mark is 0, setting the current signal class as a music class; if the music trailing flag and the music trailing flag are 1 or 0 at the same time, the signal class is set as an uncertain class, and at this time, if the continuity of the previous music exceeds 20 frames, the signal is determined as a music class, and if the continuity of the previous speech exceeds 20 frames, the signal is determined as a speech class.

After the above-mentioned preliminary correction, the final correction is performed on the type of the useful signal in fig. 13, and the category correction is continued according to the current context, and in step 1301, if the current context is music and has a strong persistence exceeding 3 seconds, that is, the number of the current continuous music frames exceeds 150 frames, the music signal can be determined by performing a forced correction according to the value of ISF _ meanSD. In step 1302, if the current context is speech and the persistence is strong and exceeds 3 seconds, that is, the number of the current continuous speech frames exceeds 150 frames, then the forced modification can be performed according to the value of ISF _ meanSD to determine the type of the speech signal; thereafter, if the signal class is not yet uncertain, the signal class is modified in step 1303 according to the previous context, i.e., the current uncertain signal class is reduced to the previous signal class.

After the category of the useful signal is determined through the above procedure, the three category counters and the threshold values in the signal category judgment module need to be updated. For the three class counters, if the current class is music signal _ sort ═ music, the music counter music _ count _ counter is increased by 1, otherwise, the three class counters are cleared; the processing of the other class counters is similar as shown in fig. 14 and will not be described in detail here. The threshold value is updated according to the signal-to-noise ratio outputted by the signal initial classification module, and each threshold example listed in the embodiment is a value learned under the condition of 20db signal-to-noise ratio.

The background noise parameter updating module controls the updating rate of the background noise by using some spectral distribution parameters calculated in the classification parameter extracting module in the SAD. Because the energy level of the background noise is increased suddenly in the practical application environment, the state that the background noise estimation can not be updated all the time because the signal is judged to be the useful signal continuously is easy to occur, and the background noise parameter updating module is arranged to solve the problem.

The background noise parameter updating module calculates a parameter vector related to the spectral distribution according to the parameters from the classification parameter extracting module, wherein the parameter vector related to the spectral distribution comprises the following elements:

short-time averaging of zero crossing rates zcr

Short-time averaging of high and low sub-band energy ratios ra

Short-time average of sub-band energy frequency domain fluctuations f _ flux

Short-time average of sub-band energy temporal fluctuations t _ flux

Wherein, the calculation method of zcr _ mean short-time average is as follows, and the other steps are similar:

zcr_mean_m＝ALPHA□zcr_mean_m-1+(1-ALPHA)□zcr_m

where ALPHA is 0.96 and m denotes a frame index.

The present embodiment utilizes the characteristic that the spectral characteristics of the background noise are stable, wherein the members of the spectral distribution parameter vector may not be limited to the above-listed 4. The update rate of the current background noise is estimated from the difference d between the current spectral distribution parameter and the background noise spectral distribution parameter_cbTo control. The difference can be realized by algorithms such as Euclidean distance and Manhattan distance. An example of the invention of this patent uses the Manhattan distance (a name for distance calculation, similar to Euclidean distance), that is:

where p is the spectral distribution parameter vector of the current signal,

is the background noise spectrum distribution parameter vector estimation.

In one example of this embodiment, when d_cb<At TH1, the module outputs an update rate acc1, representing the fastest update rate; otherwise, when d_cb<At TH2, the update rate acc2 is output; otherwise, when d_cb<At TH3, the update rate acc3 is output; otherwise, the update rate acc4 is output. Here, TH1, TH2, TH3 and TH4 are update thresholds, and are specifically determined according to actual environmental conditions.

The foregoing is a description of specific embodiments of the invention, and the method of the invention may be modified, as appropriate, during the course of particular implementations to suit the particular needs of particular situations. It is therefore to be understood that the particular embodiments in accordance with the invention are illustrative only and are not intended to limit the scope of the invention.

Claims

1. A method of classifying a sound signal, the method comprising:

A. receiving a sound signal, and determining the update rate of background noise according to the background noise spectrum distribution parameter and the sound signal spectrum distribution parameter;

B. and updating the noise parameters according to the updating rate, classifying the sound signals according to the sub-band energy parameters and the updated noise parameters, and classifying to obtain useful signals and non-useful signals.

2. The method of claim 1, wherein step B is followed by further comprising:

C. and determining the type of the useful signal obtained by the classification based on the open-loop pitch parameter, the guide spectrum frequency parameter and the sub-band energy parameter, wherein the type comprises a voice signal and a music signal.

3. The method of claim 2, wherein step C is preceded by the further step of:

c0, detecting whether the noise estimation is converged, if yes, executing the step C1; otherwise, executing the step C;

and C1, determining the type of the useful signal obtained by the classification based on the guide spectrum frequency parameter and the sub-band energy parameter, wherein the type comprises a speech signal and a music signal.

4. The method according to claim 3, wherein in step C0, it is detected whether the initial noise converges as: judging whether the number of continuous noise frames before the received sound signal exceeds a preset noise convergence threshold, if so, determining noise estimation convergence; otherwise, it is determined that the noise estimate does not converge.

5. The method according to claim 2, wherein said step B further obtains said determined type of useful signal, determines a signal tail length according to the type of useful signal, and further classifies said sound signal according to the signal tail length.

6. The method of claim 2, wherein step C comprises:

initializing a voice identification position and a music identification position, preliminarily determining the type of a useful signal comprising a voice type, a music type or an uncertain type according to a fundamental tone parameter identification, a long-term signal related parameter, a guide spectrum distance short-term average parameter and a sub-band energy standard deviation average parameter, and correspondingly modifying the voice identification position and the music identification position according to the preliminarily determined voice type and music type;

and correcting the preliminarily determined voice type, music type or uncertain type according to whether the number of continuous frames with the sub-band energy, long-term signal related parameters, sub-band energy standard deviation average parameters, voice identification bits, music identification bits and fundamental tone parameter identification value of 1 exceeds a preset tailing frame number threshold, continuous music frame number, continuous voice frame number and the type of the previous frame, and finally determining the type of the useful signal, including the voice signal and the music signal.

7. The method of claim 6, wherein the hangover frame number threshold is adjusted based on a signal-to-noise ratio of the audio signal.

8. The method of claim 1, wherein after step B, further comprising:

D. and determining the corresponding coding mode of the classified non-useful signals, and determining whether the pilot frequency parameters need to be calculated according to the determined coding mode.

9. The method of claim 1, wherein the noise parameters in step B comprise: a noise estimation parameter and a noise spectral distribution parameter.

10. The method according to claim 1 or 9, wherein the step a comprises: calculating a difference parameter between the sound signal spectral distribution parameter and the background noise spectral distribution parameter, and then determining an update rate according to the difference parameter.

11. The method of claim 10, wherein calculating the spectral distribution parameter to which the difference parameter relates comprises: the method comprises a zero crossing rate short-time average parameter, a high-low sub-band energy ratio short-time average parameter, a sub-band energy frequency domain fluctuation short-time average parameter and a sub-band energy time domain fluctuation short-time average parameter.

12. An apparatus for classifying a sound signal, the apparatus comprising: a background noise parameter updating module and a signal initial classification PSC module;

13. The apparatus of claim 12, further comprising: and the signal classification decision module is used for receiving the sound signal type from the PSC module, determining the type of the useful signal based on the open-loop pitch parameter, the guide spectrum frequency parameter and the sub-band energy parameter or the guide spectrum frequency parameter and the sub-band energy parameter, and transmitting the determined type of the useful signal.

14. The apparatus of claim 13, further comprising: the classification parameter extraction module is used for receiving the sound signal type from the PSC module and transmitting the sound signal type to the signal classification judgment module; acquiring a pilot frequency parameter and a sub-band energy parameter, or acquiring an open-loop pitch parameter, a pilot frequency parameter and a sub-band energy parameter, processing the acquired parameters into signal classification characteristic parameters, and transmitting the signal classification characteristic parameters to the signal classification judgment module; processing the acquired parameters into the frequency spectrum distribution parameters of the sound signals and the frequency spectrum distribution parameters of the background noise, and transmitting the frequency spectrum distribution parameters to the background noise parameter updating module;

the signal classification decision module determines the type of the useful signal according to the signal classification characteristic parameters and the type of the sound signal determined by the PSC module, wherein the type of the useful signal comprises a speech signal and a music signal.

15. The apparatus of claim 13 or 14, wherein the signal classification decision module comprises: a parameter updating submodule and a judgment submodule; the parameter updating submodule is used for updating the threshold in the signal classification judgment process according to the signal-to-noise ratio and providing the updated threshold to the judgment submodule;

the decision sub-module is used for receiving the sound signal type from the PSC module, determining the type of the useful signal based on the open-loop pitch parameter, the pilot frequency parameter, the sub-band energy parameter and the updated threshold or based on the pilot frequency parameter, the sub-band energy parameter and the updated threshold, and sending the determined type of the useful signal.

16. The apparatus of claim 13, further comprising: and the coder mode and rate selection module is used for receiving the type of the useful signal from the signal classification judgment module and determining the coding mode and the rate of the sound signal according to the type of the received useful signal.

17. The apparatus of claim 14, further comprising: and the encoder parameter extraction module is used for extracting sub-band energy parameters and transmitting the sub-band energy parameters to the classification parameter extraction module or extracting the sub-band energy parameters and the encoder parameters and transmitting the sub-band energy parameters and the encoder parameters to the classification parameter extraction module, and extracting the sub-band energy parameters and transmitting the sub-band energy parameters to the PSC module, wherein the encoder parameters comprise a guide spectrum frequency parameter and an open loop pitch parameter.