WO2008067735A1

WO2008067735A1 - A classing method and device for sound signal

Info

Publication number: WO2008067735A1
Application number: PCT/CN2007/003798
Authority: WO
Inventors: Wei Li; Lijing Xu; Qing Zhang; Jianfeng Xu; Shenghu Sang; Zhengzhong Du; Qin Yan; Haojiang Deng; Jun Wang
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2006-12-05
Filing date: 2007-12-26
Publication date: 2008-06-12
Also published as: CN100483509C; EP2096629A1; EP2096629B1; EP2096629A4; CN101197135A

Abstract

A classing method for sound signal includes: receiving the sound signal, determining updating rate of background noise according to a spectral distribution parameter of a background noise and a spectral distribution parameter of the sound signal; updating the noise parameter according the updating rate, and classing the sound signal according to a sub-band energy parameter and the updated noise parameter. A classing device for sound signal applies above method.

Description

Sound signal classification method and device

The present invention relates to the field of speech coding technologies, and in particular, to a sound signal classification method and a sound signal classification device. Background technique

In voice communication, only about 40% of the signals are voice-containing, and other times are muted or background noise. In order to save transmission bandwidth, voice activity detection (VAD, Voice Activity Detection) is used in speech coding. Technology that allows the encoder to encode background noise and active speech at different rates, encoding background noise at a lower rate, and encoding the active speech at a higher rate, thereby reducing the average The code rate greatly promotes the development of variable rate speech coding technology.

Existing signal detectors (VADs) have been developed for speech signals, dividing only the input audio signals into two types: noise and non-noise. Newer encoders such as AMR-WB+ and SMV contain detection of music signals as a correction and supplement to VAD decisions.

An important feature of the AMR-WB+ encoder is that it is coded in different modes depending on whether the input audio signal is speech or music after VAD detection to minimize the bit rate and ensure the encoding quality.

The two different coding modes in AMR-WB+ include: Algebraic Code Excited Linear Prediction and TCX (Transformation Coded Excited) two core coding algorithms. ACELP belongs to the voice vocalization model, which makes full use of the characteristics of speech. It has high coding efficiency for speech signals, and its technology is quite mature. Therefore, it can be extended by using the former on the universal audio encoder to make the speech coding quality very good. Great improvement. Similarly, the encoding quality of wideband music is improved by extending the use of TCX encoding on a low bit rate speech coder.

AMR-WB+ encoding algorithm for ACELP and TCX mode selection algorithms based on complexity

1

Confirmation There are two types: open loop selection algorithm and closed loop selection algorithm. Closed-loop selection corresponds to high complexity, which is the default option. It is a choice of ergodic search based on perceptually weighted SNR. Obviously, this selection method is very accurate, but its computational complexity is very high, and the code size is also very high. Larger.

The open loop selection includes the following steps:

First, in step 101, the VAD module determines whether the signal is a non-useful signal or a useful signal based on the tone identification (Tone_flag) and the sub-band energy parameter (Level[n]).

Then at step 102, preliminary mode selection (EC) is performed;

At step 103, the mode initially determined in step 102 is modified and refined mode selection (ESC) to determine the selected coding mode, based on the open loop pitch parameters and the ISF parameters.

In step 104, TCXS processing is performed, that is, when the number of consecutively selecting the speech signal encoding mode is less than three times, a small-scale closed loop traversal search is performed, and finally the encoding mode is determined, wherein the speech signal encoding mode is ACELP, and the music signal encoding mode is TCX.

In carrying out the invention, the inventors have found that the above AMR-WB+ speech signal selection algorithm has the following disadvantages:

1. When the existing VAD module classifies signals, it is not ideal for distinguishing between noise and some kinds of music signals, which reduces the accuracy of sound signal classification;

2. Calculating the open-loop pitch parameters is necessary for the ACELP coding mode, but is not necessary for the TCX coding mode. According to the structural design of AMR-WB+, the VAD and open-loop mode selection algorithms require the use of open-loop pitch parameters, so the open-loop pitch is calculated for all frames, which is true for other non-ACELP coding modes (eg TCX). The complexity of redundancy increases the amount of computation for coding mode selection and reduces efficiency.

3. Although the performance of VAD detection algorithm in speech detection and noise immunity is better in various current encoders, in some special music signal tailing parts, it is possible to mistake the music signal into noise, which will result in The ending of the music is truncated, which sounds unnatural.

4. The AMR-WB+ mode selection algorithm does not consider the signal-to-noise ratio environment in which the signal is located, and the performance of distinguishing between speech and music is further deteriorated under low SNR conditions. Summary of the invention

In view of this, the embodiments of the present invention provide a sound signal classification method and a sound signal classification device, which can improve the accuracy of classification and detection of sound signals.

A sound signal classification detection method provided by an embodiment of the present invention includes: receiving a sound signal, determining an update rate of the background noise according to a background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal; and performing noise parameters according to the update rate Updating, and classifying the sound signal based on the subband energy parameter and the updated noise parameter.

A sound signal classification device provided by an embodiment of the present invention includes: a background noise parameter update module and a signal initial classification PSC module;

The background noise parameter updating module is configured to determine an update rate of the background noise according to the background noise spectrum distribution parameter and the frequency distribution parameter of the current sound signal, and send the determined update rate;

The PSC module is configured to receive an update rate from the background noise parameter update module, update the noise parameter, and classify the current sound signal according to the subband energy parameter and the updated noise parameter, and send the classified sound signal type. . .

In the embodiment of the present invention, the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined. The non-useful signal and the useful signal reduce the misjudgment of determining the useful signal as a noise signal, and improve the accuracy of the classification of the sound signal. DRAWINGS

1 is a schematic diagram of an open loop selection of an AMR-WB+ encoding algorithm in the prior art; FIG. 2 is a general flowchart of a sound signal classification detecting method according to an embodiment of the present invention; FIG. 3 is a composition of a sound signal sorting apparatus according to an embodiment of the present invention; 4 is a schematic diagram of a system composition based on a specific embodiment of the present invention;

FIG. 5 is a diagram of an encoder parameter extraction module for calculating various types according to an embodiment of the present invention; Flow chart of parameters;

6 is a flow chart of another encoder parameter extraction module for calculating various parameters according to an embodiment of the present invention;

7 is a schematic structural diagram of a PSC module according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of determining a feature parameter by a signal classification decision module according to an embodiment of the present invention; FIG.

9 is a schematic diagram of a voice classification decision module performing voice decision according to an embodiment of the present invention;

10 is a schematic diagram of a signal classification decision module performing music decision according to an embodiment of the present invention;

11 is a schematic diagram of a signal classification decision module for correcting an initial decision result according to an embodiment of the present invention;

FIG. 12 is a schematic diagram showing a preliminary classification of an uncertain signal by a signal classification decision module according to an embodiment of the present invention; FIG.

13 is a schematic diagram of a final classification and correction of a signal by a signal classification decision module according to an embodiment of the present invention;

Figure 14 is a diagram showing the parameter update of the signal classification decision module in the embodiment of the present invention. detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

In the embodiment of the present invention, the update rate of the background noise is determined according to the spectrum distribution parameter of the current sound signal and the background noise spectrum distribution parameter, and the noise parameter is updated according to the update rate, and the useful signal in the received voice signal is determined. When the non-useful signal is used, it is performed according to the updated noise parameter, so that the accuracy of the noise parameter is higher when determining the useful signal and the non-useful signal, and the accuracy of the sound signal classification is improved.

As shown in FIG. 2, in the embodiment of the present invention, a voice signal classification detection is first provided. Method, the method includes:

Step 201: Receive a sound signal, and determine an update rate of the background noise according to the background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal.

Step 202: Update a noise parameter according to the update rate, and classify the sound signal according to the subband energy parameter and the updated noise parameter.

In step 202, the classification of the sound signals is mainly divided into useful signal types and non-useful signal types. Thereafter, the type of the useful signal may further be determined, the type including the voice signal and the music signal, and when determined, based on whether the noise converges, the selection is determined based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, or the selection is based on The spectral frequency parameter and the sub-band energy parameter are determined.

In addition, in order to prevent the smearing of the music signal as a non-useful signal, the P-segment has a low-sound effect, and in the embodiment of the present invention, the determined useful signal type is also obtained, and the signal smear length is determined according to the useful signal type, and further The useful signal and the non-useful signal in the received speech signal are determined based on the tail length of the signal. Here, the smearing of the music signal can be set larger, thereby improving the sound effect of the music signal.

When the useful signal is determined to be a speech signal or a music signal, the signal that cannot be determined very accurately can be first set to an indeterminate type, and then the undetermined type is corrected according to other parameters, and finally the type of the useful signal is determined.

Since the encoding method of the non-useful signal does not need to calculate the spectral frequency parameter, in order to reduce the calculation amount in the classification process and improve the classification efficiency, if the corresponding non-useful signal is determined, the corresponding coding mode does not need to calculate the spectral frequency. For parameters, the lead frequency parameter is not calculated.

As shown in FIG. 3, an embodiment of the present invention further provides an audio signal classification apparatus, including a background noise parameter update module and a signal initial classification (PSC) module. The background noise parameter update module is configured to use a spectrum distribution parameter and a background of the current sound signal. The noise spectrum distribution parameter determines an update rate of the background noise, and transmits the determined update rate to the PSC module; the PSC module is configured to update the noise parameter according to an update rate from the background noise parameter update module, and according to the sub The signal is initially classified with an energy parameter and an updated noise parameter, and the received speech signal is determined to be a useful signal type or a non-useful signal type.

The sound signal classification device may further include: a signal classification decision module;

The PSC module also transmits the determined signal type to the signal classification decision module; the signal classification decision module determines the type of the useful signal based on the open loop pitch parameter, the guided spectral frequency parameter, and the subband energy parameter, or based on the guided spectral frequency parameter and the subband energy parameter. The type includes a voice signal and a music signal.

The sound signal classification device may further include: a classification parameter extraction module;

The PSC module transmits the determined signal type to the signal classification decision module by using a classification parameter extraction module; the classification parameter extraction module is further configured to acquire the included spectral frequency parameter and the sub-band energy parameter, or further obtain an open-loop pitch parameter, which will be obtained. Parameter processing is transmitted to the classification decision module for signal classification feature parameters; and processing the parameters to be acquired as a spectral distribution parameter and a background noise spectral distribution parameter of the sound signal, and transmitting the spectral distribution parameters to the background noise parameter update Module; the classification decision module determines the type of the useful signal according to the signal classification feature parameter and the signal type determined by the PSC module, the type including the voice signal and the music signal.

The PSC module is further operable to transmit a signal to noise ratio of the sound signal calculated in the process of determining the signal type to the signal classification decision module; the signal classification decision module further determines the useful signal as a voice signal or music according to the signal to noise ratio signal.

The sound signal classification device may further include: an encoder mode and a rate selection module; the signal classification decision module transmits the determined signal type to the encoder mode and the rate selection module; and the encoder mode and rate selection module 4 receives the data The signal type is indeed The encoding mode and rate of the sound signal.

The sound signal classification device may further include: an encoder parameter extraction module, configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module And transmitting the extracted subband energy parameters to the PSC module.

The sound signal classification detecting method and the sound signal sorting apparatus provided in the embodiments of the present invention will be described below by way of a specific embodiment.

As shown in FIG. 4, it is a schematic diagram of a system composition based on a specific embodiment of the present invention. These include a sound activity detector (SARD) which divides the input audio digital signal into different classes according to the needs of the encoder. It can be divided into non-useful signals, voice and music to provide encoders. The basis for coding mode selection and rate selection.

As can be seen in FIG. 4, the SAD module internally includes: a background noise estimation control module, a signal initial classification module, a classification parameter extraction module, and a signal classification decision module. As a signal classifier used internally by the encoder, SAD will make full use of the encoder's own parameters in order to reduce resource consumption and computational complexity. Therefore, the subband energy parameter and encoder are calculated by the encoder parameter extraction module in the encoder. Parameters and provide the calculated parameters to the SAD module. In addition, the final output of the SAD module is a signal decision type, including non-useful signals, speech, and music, which are provided to the encoder mode and rate selection module for selecting the encoder mode and rate.

The following describes the interaction process between the modules related to the SAD in the encoder, the submodules in the SAD, and the respective modules. The encoder parameter extraction module in the encoder calculates the subband energy parameters and the encoder parameters, and provides the calculated parameters to the SAD module. Wherein, the calculation of the sub-band energy parameter can adopt the filter group filtering method, and the specific number of sub-bands is required according to the calculation complexity. The classification accuracy requirement is determined, and in the present embodiment, the following description is divided into 12 sub-bands.

In this embodiment, the process of the encoder parameter extraction module calculating the parameters required by the various SAD modules may be as shown in FIG. 5 or FIG. 6.

The process shown in Figure 5 includes the following steps:

Step 501: The encoder parameter extraction module first calculates a subband energy parameter.

Step 502: The encoder parameter extraction module determines, according to a signal initial judgment result (Vad_flag) from the PSC module, whether an pilot frequency (ISF) operation is required, if necessary, step 503 is performed; otherwise, step 504 is performed.

Determining whether to perform an ISF operation in this step includes: if the current frame is a non-useful signal, according to the mechanism of the encoder: if the encoder requires an ISF parameter for encoding the non-useful signal, performing an ISF operation; if not, the encoder The parameter extraction module ends. If the current frame is a useful signal, an ISF operation is performed. Calculating ISF parameters for useful signals is required for most coding modes and therefore does not introduce redundant complexity into the encoder. The technical solution of the ISF parameter calculation can refer to the data of various encoders, and will not be described here.

Step 503: The encoder parameter extraction module calculates an ISF parameter, and then performs step 504. Step 504: The encoder parameter extraction module calculates an open loop pitch parameter.

The sub-band energy parameters calculated by the above process of Figure 5 are provided to the SAD.

The PSC module and the classification parameter extraction module, and the remaining parameters are provided to the classification parameters in the SAD.

In the flow shown in FIG. 6, on the basis of the flow of FIG. 5, a step of determining whether to calculate an open-loop pitch parameter based on whether the initial noise converges is added. Step 601 to step 603 are substantially the same as steps 501 to 503 in FIG. 5, and in step 604, it is determined whether the noise parameter is initialized, that is, whether the noise estimate converges, and if so, it is calculated at step 605. Open loop pitch parameter; otherwise the open loop pitch parameter is not calculated.

Since the open-loop pitch parameter is a redundant coding algorithm, such as the TCX coding mode, in order to reduce the computational complexity, after the noise estimation converges, it is basically determined that the coding mode corresponding to the signal does not need to calculate the open-loop pitch parameter. Therefore, the open loop pitch parameters are no longer calculated.

Before the noise estimation converges, in order to ensure that the noise estimate can converge and its convergence speed, the open-loop pitch parameters need to be calculated, but this is the calculation of the startup phase, and its complexity can be ignored. The technical solution for calculating the open-loop pitch parameters can refer to the ACELP-based coding, and will not be described here. The basis for determining whether the noise estimate converges may be that the number of consecutively determined noise frames exceeds a threshold noise convergence threshold (THR1). In one example of this embodiment, the value of THR1 is 20.

The extracted sub-band energy parameter is: level[i]. Where i denotes the member index of the vector, in this embodiment, 1-12, corresponding to 0-200hz, 200-400hz, 400-600hz, 600-800hz, 800-1200hz, 1200-1600hz, 1600-2000hz, 2000- 2400hz, 2400-3200hz, 3200-40000hz, 4000-4800hz, 4800-6400hz The above extracted ISF parameters are: , where n represents the frame index, and i takes 1 ... 16 to represent the member index in the vector.

The extracted open loop pitch parameters include:

Open-loop pitch gain (ol_gain) and open-loop pitch lag (ol-lag), and tone-flag. Wherein, if the value of ol^gain is greater than the pitch threshold (TONEJTHR), the tone flag tone_flag is set to 1.

The signal initial classification module (PSC) can be implemented by using various existing VAD algorithm schemes, including a background noise estimation sub-module, a computational signal-to-noise ratio sub-module, a useful signal estimation sub-module, a decision threshold adjustment word module, and a comparison sub-module. , trailing protection useful letter Number submodule. In this embodiment, as shown in FIG. 7, the specific implementation of the PSC module may be different from the existing VAD algorithm module in the following three points:

I. Calculating the signal-to-noise ratio sub-module calculates the signal-to-noise ratio according to the parameter and the sub-band energy parameter, and the calculated signal-to-noise ratio ^s (sr) is transmitted to the signal classification decision module in addition to the internal use of the PSC module. So that the signal classification decision module is more accurate in distinguishing between voice and music under low SNR conditions.

II. Since the existing VAD distinguishes between noise and certain kinds of music is not ideal, the present embodiment improves the VAD by the following: First, the calculation of the background noise parameter is controlled by the update rate acc provided by the background noise parameter update module. The background noise estimation sub-module receives the update rate from the background noise parameter update module, updates the noise parameter, and transmits the background noise sub-band energy estimation parameter calculated by the updated noise parameter to the calculation signal-to-noise ratio sub-module. For the calculation of the update rate, refer to the description of the background noise parameter update module. In an example of this embodiment, the update rate can take 4 files: accl, acc2, acc3, acc4. For different update rates, different update up and update_down parameters are determined, and update_up and update_down correspond to the update rate of background noise up and down, respectively.

Then the noise parameter update scheme can specifically adopt the scheme in AMR_WB+: If ( bckr _ est _m [n] level ^ [n] ) update=update_up

Else

Update=update_down

Then the formula for updating the noise estimate is:

Bckr _

Then the formula for updating the noise spectrum distribution parameter vector is:

P [! ] = (1- pdate) * p _m [i] + update * p _m [/] where, m: frame index

n: subband index

i: element index of the spectral distribution parameter vector, i=l, 2, 3, 4

Bckr- est: background noise estimation subband energy

: Vector estimation of background noise spectral distribution parameters

: Current Signal Spectrum Distribution Parameter Vector

III. In the existing VAD, the useful signal is generally protected from noise by smearing, and the length of the smear should be compromised between protecting the signal and improving the transmission efficiency. For traditional speech coder, the length of the smear can be learned to take a constant. For multi-rate encoders, it is oriented to audio signals including music. Such signals often have long low-energy tails. It is difficult for conventional VAD to detect this part of the tail, so it requires a long tailing pair. It is protected. In an embodiment, the trailing length in the trailer protection useful signal sub-module is designed to be adaptive according to the SAD signal decision result, and if the music signal is determined (SAD_flag=MUSIC), a longer smearing parameter is set ( Hang_ len=HANG_LONG ) , if the decision is a speech signal ( SAD_flag=SPEECH ), set a shorter trailing parameter ( hang_len=HANG_ SHORT ), the specific setting is as follows:

If ( SAD_flag=MUSIC )

Hang- len=HANG— LONG

Else if ( SAD_flag=SPEECH )

Hang- len=HANG— SHORT

Else

Hang- len=0

among them:

SAD-flag SAD judgment flag

Hangjen trailing protection length In an example of this embodiment, HANG_LONG=100, HANG_SHORT=20, and the unit may be the number of frames. The classification parameter extraction module is configured to calculate the signal classification decision module and the background noise parameter update module according to the Vad_flag parameter determined by the signal initial classification module and the subband energy parameter, the ISF parameter, and the open loop pitch parameter provided by the encoder parameter extraction module. The parameters, and the subband energy ^:, the ISF parameter, the open loop pitch parameter, and the calculated parameter are provided to the signal classification decision module and the background noise parameter. The parameters calculated by the classification parameter extraction module include:

1, pitch parameters ( pitch )

Comparing the difference of consecutive open-loop pitch delays, if the increment of the open-loop pitch delay is less than the set threshold, the delay count is accumulated; if the sum of the delay counts of two consecutive frames is sufficiently large, set pitch=l, otherwise pitch =0. The calculation formula for the open loop pitch delay can be found in the AMR-WB+/AMR-WB standard document.

2, long-term signal correlation value parameter (meangain)

Meangain is the moving average of adjacent three-frame tonal tone, where tone=1000*tone_flg; tone flg is defined the same as in AMR-WB+.

3. Zero crossing rate ( zcr )

^H W is 1 when A is truth and 0 when it is false.

4, subband energy time domain fluctuations (t_flux)

― short _ mean _ eve _ energy

Where short-mean-level-energy represents short-term average energy

5, high and low sub-band energy ratio (ra) Sublevel high energy

Ra - = ~ ^ ―

Sublevel _ low- energy

Among them, an example of the patented invention:

Sublevel— high— energy = level [10]+ level[l l];

Sublevel_low_energy =level[0]+ level [1]+ level[2]+ level [3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9 ];

6, subband energy frequency domain fluctuations (f_flux)

∑|/eve/ _m ( - eve/ _m ( - l)|

f _ flux =—

Short _ mean _ level _ energy

7. The short-term average (isf-meanSD) of the distance is: the average value of the distance Isf_SD for five adjacent frames, where

8. The sub-band energy standard deviation average parameter (level_meanSD), which represents the average value of the energy standard deviation (level_SD) of two adjacent frames, and the calculation method of the level-SD parameter refers to the above calculation method of Isf_SD.

Among the above eight parameters, the parameters provided to the background noise update module include: zcr, ra, f-flux, and 1_3 «. The parameters provided to the signal classification decision module include: pitch, meangain, isf-meanSD, and level-meanSD. The signal classification decision module is used to derive the sub-band energy from the sampling parameter extraction module based on the snr, Vad_flag from the signal initial classification module PSC. The parameters, pitch, meangain, Isf- meanSD, level-meanSD finally distinguish the signals into: non-useful signal (NOISE), speech signal (SPEECH) and music signal (MUSIC;). The signal classification decision module can include: parameter updater a module and a decision sub-module; the parameter update sub-module is configured to update a threshold in a signal classification decision process according to the signal-to-noise ratio, and provide an updated threshold to the decision sub-module; the Received from The sound signal type of the PSC module, and the useful signal therein is based on the open loop pitch parameter, the guide frequency parameter, the subband energy parameter and the updated threshold, or based on the spectral frequency parameter and the subband energy parameter and the update The latter threshold determines the type of the useful signal and transmits the determined type of useful signal to the encoder mode and rate selection module.

Determining the useful signal as a speech signal or a music signal comprises: first setting a value of the speech identifier bit and a value of the music identification bit to be 0, and then according to the pitch parameter identification, the long-term signal correlation value, the lead distance short-term average parameter, and the sub-band energy The sub-standard deviation average parameter preliminarily determines the signal as a voice type, a music type, or an indeterminate type, and then modifies the value of the voice flag or the music flag according to the initially determined voice type or music type; Whether the number of consecutive frames with energy, long-term signal correlation value, sub-band energy sub-standard deviation average parameter, speech_flag, music_flag, and pitch value of 1 exceeds the preset threshold of the number of trailing frames, the number of consecutive music frames, and continuous The number of speech frames, and the type of the previous frame, are corrected for the initially determined speech type, music type, or uncertainty type to determine the type of useful signal, including speech signals and music signals.

The following describes the specific process of determining a useful signal as a speech signal or a music signal:

In order to ensure the stability of the signal decision and avoid frequent conversion of the decision result, the embodiment provides a flag tailing mechanism for the parameter, including pitch_flag, level_meanSD_high_flag, ISF-meanSD_high_flag, ISF-meanSD-low — flag, level—meanSD—low—flag, meangain—flag These feature parameter values are determined according to the trailing mechanism. The specific determination of these feature parameter values is shown in Figure 8.

The length of the trailing period in Fig. 8 is determined according to the trailing parameter identification value. In this embodiment, two kinds of trailing settings are provided, that is, a scheme for determining the trailing parameter identification value:

In the first type of tailing setting, when the parameter value is higher or lower than a certain threshold, the corresponding The parameter tailing counter value is incremented by one; otherwise, the corresponding parameter trailing counter value is set to 0, and different parameter trailing identifiers are set according to the value of the parameter trailing counter. The larger the value of the parameter smear counter is, the longer the length of the parameter smear identification value is. The specific value is determined according to the actual situation when setting the parameter smear identification value according to the parameter counter, and details are not described herein again.

In the second tailing setting scheme, the length of the trailing length is controlled according to the error rate ER of each internal node of the decision tree corresponding to the training parameter, and the parameter with a small error rate is short; the parameter with a large error rate is long.

Thereafter, if the current signal is classified as a useful signal, an initial classification of speech and music is performed:

First, the initial voice is determined. As shown in FIG. 9, the voice flag is set to 0 in step 901. Then, in step 902, it is determined whether Isf_meanSD is greater than a preset first voice voice threshold (for example, 1500). If yes, the setting is set. The value of the voice flag is 1; otherwise,

In step 903, it is determined whether the pitch value is 1, and the pitch delay value t_top-mean obtained by the switch pitch search is smaller than the pitch voice threshold (for example, 40), and if so, the value of the voice flag is set to 1; otherwise,

In step 904, it is determined whether the number of consecutive frames whose pitch value is 1 exceeds a preset threshold of the number of trailing frames (for example, 2 frames), and if so, the value of the voice flag is set to 1; otherwise,

In step 905, it is determined whether the meangain is greater than a preset long-term related speech threshold (for example, 8000), and if so, the value of the voice flag is set to 1; otherwise, in step 906, the level_meanSD_high_flag and the ISF_meanSD_high_flag are determined. Whether one or both of them have a value of 1, and if so, the value of the voice flag is set to 1; otherwise, the value of the voice flag is not changed.

Then, the initial decision of the music is performed, as shown in Figure 10: In step 1001, the music flag is first set to 0, and then in step 1002, the decision signal satisfies the flags ISF_meanSD_low_flag = 1 and level- meanSD_low_flag = 1 at the same time, and if so, the music signal flag music_ Flag; Otherwise, the value of the music flag is not changed.

Thereafter, as shown in Figure 11, the initial decision result is corrected:

First, in step 1101, it is determined whether the instantaneous energy of the subband is less than the subband energy threshold (for example, 5000), and if yes, step 1102 is performed; otherwise, the signal is determined to be an indeterminate class (UNCERTAIN);

In step 1102, it is judged whether meangain_flag = l, and the music duration counter is smaller than the music continuous counting voice judgment threshold (for example, 3), and if yes, the signal is determined as a voice signal; otherwise,

In step 1103, it is determined that the value of ISF_meanSD is greater than a preset second pilot voice threshold (for example, 2000). If yes, the signal is determined to be a voice signal; otherwise, in step 1104, it is determined whether level_energy is less than 10000, and the previous decision is made. The number of frames that are noisy exceeds five frames. If so, the current signal class is set to an indeterminate class. This is to reduce the misjudgment of classifying noise into music; otherwise,

In step 1105, it is determined whether the value of the music flag and the voice flag are both 1, and if so, the current signal class is determined to be an indeterminate class; otherwise,

In step 1106, it is determined whether the values of the music flag and the voice flag are both 0. If yes, the current signal class is determined to be a bit uncertainty class; otherwise,

In step 1107, it is determined whether the music flag is 0, the voice flag is 1, and if so, the current signal type is determined to be a voice class; otherwise,

In step 1108, since the music flag is 1 and the voice flag is 0, the current signal type is determined to be a music class.

After the above steps 1104, 1105, step 1106, determine that the signal is an indeterminate class, Step 1109: Determine whether pitch-flag=l, and ISF_meanSD is smaller than the music threshold of the spectrum (for example, 900), and the number of consecutive speech frames is less than 3. If yes, the signal is determined to be a music class; otherwise, the signal is still Determined to be an indeterminate class;

After the signal is determined to be a voice class in the above steps 1103 and 1107, step 1110 is performed: whether the number of consecutive music frames is greater than 3, and ISF_meanSD is smaller than the music threshold of the guided spectrum, and if yes, the signal is determined as a music signal; Otherwise, the signal is determined to be a speech signal.

After the voice signal and the music signal are determined through the above process, for the signal still in the uncertain class, the flow shown in FIG. 12 is executed, and the preliminary correction classification is performed, including: First, in step 1201, it is determined whether the levd_energy is smaller than the sub-band energy uncertainty. The class threshold (for example, 5000), if yes, still determines the signal type as an indeterminate class; otherwise, in step 1202, it is determined whether the continuous frame number of the music is greater than 1 and the ISF_meanSD is smaller than the guided music threshold, and if so, the signal is determined For music; otherwise:

Clear the voice and music trailing flags. If the frame is a continuous voice class and has strong continuity, then the voice is judged according to the characteristic parameters of the voice. If the voice condition is met, then the voice trailing flag speech is set— Hangover_flag = 1, specifically including step 1203 to step 1206 in FIG. 12; if the frame is a continuous music class before, and the continuity is strong, the music is judged according to the characteristic parameters of the music, and if the music condition is satisfied, Then set the music trailing flag music_hangover_flag=1, which specifically includes steps 1207 to 1210 in FIG.

Thereafter, as shown in step 1211 to step 1216 in FIG. 12, if the voice trailing flag is 1, the music trailing flag is 0, and the current signal category is set to the voice class; for example, the music trailing flag is 1, the voice If the trailing flag is 0, the current signal category is set to music class; if the music trailing flag and the music trailing flag are both 1 or 0 at the same time, the signal class is set to the uncertainty class, then if the music is before Continuity exceeds 20 frames, will The signal is determined to be a music class, and if the continuity of the previous speech exceeds 20 frames, the signal is determined to be a speech class.

After the above preliminary correction, the final correction of the useful signal type is performed in FIG. 13, and the category modification is continued according to the current context. In step 1301, if the current context is music, and the persistence is strong, After more than 3 seconds, that is, the current continuous number of music frames exceeds 150 frames, the music signal can be determined by forcibly correcting according to the value of ISF-meanSD. In step 1302, if the current context is speech and the persistence is strong, more than 3 seconds, that is, the current continuous number of speech frames exceeds 150 frames, then the forced correction may be performed according to the value of ISF_meanSD to determine the type of the speech signal; Thereafter, if the signal class is also an indeterminate class, then at step 1303 the signal class is modified according to the previous context, ie, the currently undefined signal class is summarized into the previous signal class.

After determining the class of the useful signal through the above process, it is necessary to update the threshold values in the three class counters and the update signal class decision module. For the three category counters, if the current classification is music signal_sort=music, the music counter music_countinue_counter is boosted by p1, otherwise cleared; the processing of other category counters is similar, as shown in Fig. 14, and will not be described in detail here. The threshold value is updated according to the signal-to-noise ratio of the signal output initial classification module. The threshold examples listed in the embodiment are the values learned under the 20db signal-to-noise ratio condition. The background noise parameter update module uses some of the spectrum distribution parameters calculated in the classification parameters in the SAD to control the update rate of the background noise. Due to the sudden increase of the energy level of the background noise in the actual application environment, the background noise estimation is likely to be unable to be updated due to the signal being continuously judged as a useful signal, and the setting of the background noise parameter update module solves the problem. problem.

The background noise parameter update module calculates the relevant spectral distribution parameter vector according to the parameters from the classification parameter extraction block, and includes the following elements: Short-term average of zero-crossing rate zcr

Short-term average of high and low sub-band energy ratio ra

Subband energy frequency domain fluctuation f-flux short-term average

Subband energy time domain fluctuation t-flux short-term average

Among them, the short-time average calculation method of zcr-mean is as follows, other similar:

Zcr _ mean _m = ALPHA'zcr _ mean _m _ + (1— ALPHA)»zcr _m

Where ALPHA=0.96, m represents the frame index.

This embodiment utilizes the characteristics that the spectral characteristics of the background noise are relatively stable, and the members of the spectrum distribution parameter vector may not be limited to the four listed above. The update rate of the current background noise is controlled by the difference between the current frequency distribution parameter and the background noise spectral distribution parameter estimate. This difference can be achieved by algorithms such as Euclidean distance and Manhattan distance. An inventive example of this patent uses Manhattan distance (a name for distance calculation, similar to Euclidean distance), namely:

Where ^ is the spectrum distribution parameter vector of the current signal and is the background noise spectrum distribution parameter vector estimate.

In an example of this embodiment, when <1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, when *1112, the update rate acc2 is output; otherwise, when <3, the update rate acc3 is output; Otherwise, the update rate acc4 is output. Here, TH1, TH2, TH3 and TH4 are update thresholds, which are determined according to the actual environmental conditions.

In the embodiment of the present invention, the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined. Non-useful signals and useful letters No., which reduces the misjudgment of determining the useful signal as a noise signal, and improves the accuracy of the classification of the sound signal.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is a better implementation. the way. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.

The above is a description of specific embodiments of the present invention, and the method of the present invention may be appropriately modified in a specific implementation process to suit the specific needs of a specific situation. Therefore, it is to be understood that the specific embodiments of the present invention are not intended to limit the scope of the invention.

Claims

Rights request

A method for classifying a sound signal, the method comprising:

A. receiving a sound signal, determining an update rate of the background noise according to the background noise spectral distribution parameter and the spectral distribution parameter of the sound signal;

B. The update rate updates the noise parameter, and classifies the sound signal according to the subband energy parameter and the updated noise parameter.

2. The method according to claim 1, wherein the step B further comprises:

C. For the useful signal obtained by the classification, the type of the useful signal is determined based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, the type including the voice signal and the music signal.

The method according to claim 2, wherein before the step C, the method further comprises:

C0, detecting whether the noise estimate converges, if yes, performing step C1; otherwise, performing the step C;

Cl, a useful signal for the classification, determining a type of useful signal based on a spectral frequency parameter and a sub-band energy parameter, the type comprising a speech signal and a music signal.

The method according to claim 3, wherein in the step CO, detecting whether the initial noise converges to: determining whether the number of consecutive consecutive noise frames before the received sound signal exceeds a preset noise convergence threshold, If so, it is determined that the noise estimate converges; otherwise, it is determined that the noise estimate does not converge.

5. The method according to claim 2, wherein the step B is further Obtaining the determined useful signal type, determining a signal smear length based on the useful signal type, and further classifying the sound signal based on the signal smear length.

The method according to claim 2, wherein the step C comprises: initializing a voice identification bit and a music identification bit, and then according to a pitch parameter identification, a long-term signal related parameter, a lead distance short-term average parameter, and a sub-band The energy sub-standard deviation average parameter, and the corresponding threshold, initially determine the type of the useful signal, including the voice type, the music type or the uncertainty type, and modify the voice identifier and the music identifier according to the initially determined voice type and music type. ;

According to the sub-band energy, long-term signal related parameters, sub-band energy sub-standard deviation average parameter sub-band energy sub-standard deviation average parameter, voice identification bit, music identification bit, pitch parameter identification value of 1 consecutive frames exceeds the preset The number of trailing frame numbers, the number of consecutive music frames, the number of consecutive speech frames, the type of the previous frame, and the corresponding threshold, correct the initially determined speech type, music type or uncertainty type, and finally determine The types of useful signals include voice signals and music signals.

7. The method according to claim 6, wherein the threshold is adjusted according to a signal to noise ratio of the sound signal.

The method according to claim 1, wherein after the step B, the method further includes:

D. Determine the coding mode of the non-useful signal obtained by the classification, and determine whether the reference frequency parameter needs to be calculated according to the determined coding mode.

9. The method according to claim 1, wherein the noise parameter in step B comprises: a noise estimation parameter and a noise spectrum distribution parameter.

The method according to claim 1 or 9, wherein the step A comprises: calculating a difference parameter between the sound signal spectral distribution parameter and the background noise spectral distribution parameter, and then determining the update according to the difference parameter. rate.

The method according to claim 10, wherein the calculating the spectral distribution parameters involved in the difference parameter comprises: a zero-crossing rate short-time average parameter, a high-low sub-band energy ratio short-time average parameter, and a sub-band energy frequency domain The fluctuation short-term average parameter and the sub-band energy time-domain fluctuation short-term average parameter.

12. A sound signal classification device, the device comprising: a background noise parameter update module and a signal initial classification PSC module;

The PSC module is configured to receive an update rate from the background noise parameter update module, update a noise parameter, and classify the current sound signal according to the subband energy parameter and the updated noise parameter, and send the classification determined Sound signal type.

13. The apparatus according to claim 12, wherein the apparatus further comprises:

a signal classification decision module, configured to receive a sound signal type from the PSC module, and determine useful information based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, or based on the pilot frequency parameter and the subband energy parameter The type of signal, the type including the voice signal and the music signal, and the type of the determined useful signal is transmitted.

14. The apparatus according to claim 13, wherein the apparatus further comprises:

a classification parameter extraction module, configured to receive a sound signal type from the PSC module, and transmit the sound signal type to the signal classification decision module; and obtain a reference spectrum frequency parameter and a subband energy parameter, or further obtain an open loop pitch parameter Processing the acquired parameters into signal classification feature parameters and transmitting the parameters to the signal classification decision module; and processing the acquired parameters into a spectrum distribution parameter and a background noise spectrum distribution parameter of the sound signal, and transmitting the spectrum distribution parameters to the Background noise parameter update module;

Then, the classification decision module determines the type of the useful signal according to the signal classification feature parameter and the type of the sound signal determined by the PSC module, and the type includes a voice signal and a music signal.

The apparatus according to claim 13 or 14, wherein the PSC module comprises: a background noise estimation submodule, a calculated signal to noise ratio submodule, a useful signal estimation submodule, a decision threshold adjustment word module, a comparison submodule, Trailing protection useful signal sub-module;

The background noise estimation sub-module receives an update rate from the background noise parameter update module, updates a noise parameter, and transmits a background noise sub-band energy estimation parameter calculated according to the updated noise parameter to the calculation signal a noise ratio submodule, configured to receive the background noise subband energy estimation parameter, calculate a signal to noise ratio according to the parameter and the subband energy parameter, and transmit a signal to noise ratio to the signal Classification decision module;

The signal classification decision module includes: a parameter update submodule and a decision submodule; a parameter update submodule, configured to update a threshold in a signal classification decision process according to the signal to noise ratio, and provide an updated threshold to the decision submodule;

The decision submodule is configured to receive a sound signal type from the PSC module, and the useful signal therein is based on an open loop pitch parameter, a guide frequency parameter, a subband energy parameter, and the updated threshold, or based on a guide spectrum The frequency parameter and the subband energy parameter and the updated threshold determine the type of the useful signal and transmit the determined type of useful signal.

16. The apparatus according to claim 13, wherein the apparatus further comprises:

An encoder mode and rate selection module for receiving the type of the useful signal from the signal classification decision module, and determining the encoding mode and rate of the sound signal by the type of the received useful signal.

17. The apparatus according to claim 14, wherein the apparatus further comprises:

An encoder parameter extraction module, configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module, and extract the extracted sub-band energy parameter Transfer to the PSC module.