WO2008067735A1 - A classing method and device for sound signal - Google Patents

A classing method and device for sound signal Download PDF

Info

Publication number
WO2008067735A1
WO2008067735A1 PCT/CN2007/003798 CN2007003798W WO2008067735A1 WO 2008067735 A1 WO2008067735 A1 WO 2008067735A1 CN 2007003798 W CN2007003798 W CN 2007003798W WO 2008067735 A1 WO2008067735 A1 WO 2008067735A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
signal
module
type
noise
Prior art date
Application number
PCT/CN2007/003798
Other languages
French (fr)
Chinese (zh)
Inventor
Wei Li
Lijing Xu
Qing Zhang
Jianfeng Xu
Shenghu Sang
Zhengzhong Du
Qin Yan
Haojiang Deng
Jun Wang
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to EP07855800A priority Critical patent/EP2096629B1/en
Publication of WO2008067735A1 publication Critical patent/WO2008067735A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding

Definitions

  • the present invention relates to the field of speech coding technologies, and in particular, to a sound signal classification method and a sound signal classification device. Background technique
  • voice activity detection (VAD, Voice Activity Detection) is used in speech coding.
  • VAD Voice Activity Detection
  • Technology that allows the encoder to encode background noise and active speech at different rates, encoding background noise at a lower rate, and encoding the active speech at a higher rate, thereby reducing the average
  • the code rate greatly promotes the development of variable rate speech coding technology.
  • VADs Existing signal detectors
  • AMR-WB+ encoder is coded in different modes depending on whether the input audio signal is speech or music after VAD detection to minimize the bit rate and ensure the encoding quality.
  • the two different coding modes in AMR-WB+ include: Algebraic Code Excited Linear Prediction and TCX (Transformation Coded Excited) two core coding algorithms.
  • ACELP belongs to the voice vocalization model, which makes full use of the characteristics of speech. It has high coding efficiency for speech signals, and its technology is quite mature. Therefore, it can be extended by using the former on the universal audio encoder to make the speech coding quality very good. Great improvement.
  • the encoding quality of wideband music is improved by extending the use of TCX encoding on a low bit rate speech coder.
  • Closed-loop selection corresponds to high complexity, which is the default option. It is a choice of ergodic search based on perceptually weighted SNR. Obviously, this selection method is very accurate, but its computational complexity is very high, and the code size is also very high. Larger.
  • the open loop selection includes the following steps:
  • the VAD module determines whether the signal is a non-useful signal or a useful signal based on the tone identification (Tone_flag) and the sub-band energy parameter (Level[n]).
  • step 102 preliminary mode selection (EC) is performed
  • the mode initially determined in step 102 is modified and refined mode selection (ESC) to determine the selected coding mode, based on the open loop pitch parameters and the ISF parameters.
  • ESC mode selection
  • step 104 TCXS processing is performed, that is, when the number of consecutively selecting the speech signal encoding mode is less than three times, a small-scale closed loop traversal search is performed, and finally the encoding mode is determined, wherein the speech signal encoding mode is ACELP, and the music signal encoding mode is TCX.
  • VAD detection algorithm in speech detection and noise immunity is better in various current encoders, in some special music signal tailing parts, it is possible to mistake the music signal into noise, which will result in The ending of the music is truncated, which sounds unnatural.
  • the AMR-WB+ mode selection algorithm does not consider the signal-to-noise ratio environment in which the signal is located, and the performance of distinguishing between speech and music is further deteriorated under low SNR conditions. Summary of the invention
  • the embodiments of the present invention provide a sound signal classification method and a sound signal classification device, which can improve the accuracy of classification and detection of sound signals.
  • a sound signal classification detection method provided by an embodiment of the present invention includes: receiving a sound signal, determining an update rate of the background noise according to a background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal; and performing noise parameters according to the update rate Updating, and classifying the sound signal based on the subband energy parameter and the updated noise parameter.
  • a sound signal classification device provided by an embodiment of the present invention includes: a background noise parameter update module and a signal initial classification PSC module;
  • the background noise parameter updating module is configured to determine an update rate of the background noise according to the background noise spectrum distribution parameter and the frequency distribution parameter of the current sound signal, and send the determined update rate;
  • the PSC module is configured to receive an update rate from the background noise parameter update module, update the noise parameter, and classify the current sound signal according to the subband energy parameter and the updated noise parameter, and send the classified sound signal type. . .
  • the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined.
  • the non-useful signal and the useful signal reduce the misjudgment of determining the useful signal as a noise signal, and improve the accuracy of the classification of the sound signal.
  • FIG. 1 is a schematic diagram of an open loop selection of an AMR-WB+ encoding algorithm in the prior art
  • FIG. 2 is a general flowchart of a sound signal classification detecting method according to an embodiment of the present invention
  • FIG. 3 is a composition of a sound signal sorting apparatus according to an embodiment of the present invention
  • 4 is a schematic diagram of a system composition based on a specific embodiment of the present invention
  • FIG. 5 is a diagram of an encoder parameter extraction module for calculating various types according to an embodiment of the present invention. Flow chart of parameters;
  • FIG. 6 is a flow chart of another encoder parameter extraction module for calculating various parameters according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a PSC module according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of determining a feature parameter by a signal classification decision module according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a voice classification decision module performing voice decision according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a signal classification decision module performing music decision according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a signal classification decision module for correcting an initial decision result according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram showing a preliminary classification of an uncertain signal by a signal classification decision module according to an embodiment of the present invention.
  • FIG. 13 is a schematic diagram of a final classification and correction of a signal by a signal classification decision module according to an embodiment of the present invention
  • Figure 14 is a diagram showing the parameter update of the signal classification decision module in the embodiment of the present invention. detailed description
  • the update rate of the background noise is determined according to the spectrum distribution parameter of the current sound signal and the background noise spectrum distribution parameter, and the noise parameter is updated according to the update rate, and the useful signal in the received voice signal is determined.
  • the non-useful signal is used, it is performed according to the updated noise parameter, so that the accuracy of the noise parameter is higher when determining the useful signal and the non-useful signal, and the accuracy of the sound signal classification is improved.
  • a voice signal classification detection is first provided.
  • the method includes:
  • Step 201 Receive a sound signal, and determine an update rate of the background noise according to the background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal.
  • Step 202 Update a noise parameter according to the update rate, and classify the sound signal according to the subband energy parameter and the updated noise parameter.
  • the classification of the sound signals is mainly divided into useful signal types and non-useful signal types. Thereafter, the type of the useful signal may further be determined, the type including the voice signal and the music signal, and when determined, based on whether the noise converges, the selection is determined based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, or the selection is based on The spectral frequency parameter and the sub-band energy parameter are determined.
  • the P-segment has a low-sound effect
  • the determined useful signal type is also obtained, and the signal smear length is determined according to the useful signal type, and further The useful signal and the non-useful signal in the received speech signal are determined based on the tail length of the signal.
  • the smearing of the music signal can be set larger, thereby improving the sound effect of the music signal.
  • the useful signal is determined to be a speech signal or a music signal
  • the signal that cannot be determined very accurately can be first set to an indeterminate type, and then the undetermined type is corrected according to other parameters, and finally the type of the useful signal is determined.
  • the encoding method of the non-useful signal does not need to calculate the spectral frequency parameter, in order to reduce the calculation amount in the classification process and improve the classification efficiency, if the corresponding non-useful signal is determined, the corresponding coding mode does not need to calculate the spectral frequency. For parameters, the lead frequency parameter is not calculated.
  • an embodiment of the present invention further provides an audio signal classification apparatus, including a background noise parameter update module and a signal initial classification (PSC) module.
  • the background noise parameter update module is configured to use a spectrum distribution parameter and a background of the current sound signal.
  • the noise spectrum distribution parameter determines an update rate of the background noise, and transmits the determined update rate to the PSC module;
  • the PSC module is configured to update the noise parameter according to an update rate from the background noise parameter update module, and according to the sub
  • the signal is initially classified with an energy parameter and an updated noise parameter, and the received speech signal is determined to be a useful signal type or a non-useful signal type.
  • the sound signal classification device may further include: a signal classification decision module;
  • the PSC module also transmits the determined signal type to the signal classification decision module; the signal classification decision module determines the type of the useful signal based on the open loop pitch parameter, the guided spectral frequency parameter, and the subband energy parameter, or based on the guided spectral frequency parameter and the subband energy parameter.
  • the type includes a voice signal and a music signal.
  • the sound signal classification device may further include: a classification parameter extraction module;
  • the PSC module transmits the determined signal type to the signal classification decision module by using a classification parameter extraction module; the classification parameter extraction module is further configured to acquire the included spectral frequency parameter and the sub-band energy parameter, or further obtain an open-loop pitch parameter, which will be obtained.
  • Parameter processing is transmitted to the classification decision module for signal classification feature parameters; and processing the parameters to be acquired as a spectral distribution parameter and a background noise spectral distribution parameter of the sound signal, and transmitting the spectral distribution parameters to the background noise parameter update Module; the classification decision module determines the type of the useful signal according to the signal classification feature parameter and the signal type determined by the PSC module, the type including the voice signal and the music signal.
  • the PSC module is further operable to transmit a signal to noise ratio of the sound signal calculated in the process of determining the signal type to the signal classification decision module; the signal classification decision module further determines the useful signal as a voice signal or music according to the signal to noise ratio signal.
  • the sound signal classification device may further include: an encoder mode and a rate selection module; the signal classification decision module transmits the determined signal type to the encoder mode and the rate selection module; and the encoder mode and rate selection module 4 receives the data
  • the signal type is indeed The encoding mode and rate of the sound signal.
  • the sound signal classification device may further include: an encoder parameter extraction module, configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module And transmitting the extracted subband energy parameters to the PSC module.
  • an encoder parameter extraction module configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module And transmitting the extracted subband energy parameters to the PSC module.
  • FIG. 4 it is a schematic diagram of a system composition based on a specific embodiment of the present invention.
  • These include a sound activity detector (SARD) which divides the input audio digital signal into different classes according to the needs of the encoder. It can be divided into non-useful signals, voice and music to provide encoders. The basis for coding mode selection and rate selection.
  • SARD sound activity detector
  • the SAD module internally includes: a background noise estimation control module, a signal initial classification module, a classification parameter extraction module, and a signal classification decision module.
  • a signal classifier used internally by the encoder SAD will make full use of the encoder's own parameters in order to reduce resource consumption and computational complexity. Therefore, the subband energy parameter and encoder are calculated by the encoder parameter extraction module in the encoder. Parameters and provide the calculated parameters to the SAD module.
  • the final output of the SAD module is a signal decision type, including non-useful signals, speech, and music, which are provided to the encoder mode and rate selection module for selecting the encoder mode and rate.
  • the encoder parameter extraction module in the encoder calculates the subband energy parameters and the encoder parameters, and provides the calculated parameters to the SAD module.
  • the calculation of the sub-band energy parameter can adopt the filter group filtering method, and the specific number of sub-bands is required according to the calculation complexity.
  • the classification accuracy requirement is determined, and in the present embodiment, the following description is divided into 12 sub-bands.
  • the process of the encoder parameter extraction module calculating the parameters required by the various SAD modules may be as shown in FIG. 5 or FIG. 6.
  • Step 501 The encoder parameter extraction module first calculates a subband energy parameter.
  • Step 502 The encoder parameter extraction module determines, according to a signal initial judgment result (Vad_flag) from the PSC module, whether an pilot frequency (ISF) operation is required, if necessary, step 503 is performed; otherwise, step 504 is performed.
  • Vad_flag signal initial judgment result
  • Determining whether to perform an ISF operation in this step includes: if the current frame is a non-useful signal, according to the mechanism of the encoder: if the encoder requires an ISF parameter for encoding the non-useful signal, performing an ISF operation; if not, the encoder
  • the parameter extraction module ends. If the current frame is a useful signal, an ISF operation is performed. Calculating ISF parameters for useful signals is required for most coding modes and therefore does not introduce redundant complexity into the encoder.
  • the technical solution of the ISF parameter calculation can refer to the data of various encoders, and will not be described here.
  • Step 503 The encoder parameter extraction module calculates an ISF parameter, and then performs step 504.
  • Step 504 The encoder parameter extraction module calculates an open loop pitch parameter.
  • the PSC module and the classification parameter extraction module, and the remaining parameters are provided to the classification parameters in the SAD.
  • Step 601 to step 603 are substantially the same as steps 501 to 503 in FIG. 5, and in step 604, it is determined whether the noise parameter is initialized, that is, whether the noise estimate converges, and if so, it is calculated at step 605. Open loop pitch parameter; otherwise the open loop pitch parameter is not calculated.
  • the open-loop pitch parameter is a redundant coding algorithm, such as the TCX coding mode, in order to reduce the computational complexity, after the noise estimation converges, it is basically determined that the coding mode corresponding to the signal does not need to calculate the open-loop pitch parameter. Therefore, the open loop pitch parameters are no longer calculated.
  • the open-loop pitch parameters need to be calculated, but this is the calculation of the startup phase, and its complexity can be ignored.
  • the technical solution for calculating the open-loop pitch parameters can refer to the ACELP-based coding, and will not be described here.
  • the basis for determining whether the noise estimate converges may be that the number of consecutively determined noise frames exceeds a threshold noise convergence threshold (THR1). In one example of this embodiment, the value of THR1 is 20.
  • the extracted sub-band energy parameter is: level[i].
  • i denotes the member index of the vector, in this embodiment, 1-12, corresponding to 0-200hz, 200-400hz, 400-600hz, 600-800hz, 800-1200hz, 1200-1600hz, 1600-2000hz, 2000- 2400hz, 2400-3200hz, 3200-40000hz, 4000-4800hz, 4800-6400hz
  • the above extracted ISF parameters are: , where n represents the frame index, and i takes 1 ... 16 to represent the member index in the vector.
  • the extracted open loop pitch parameters include:
  • the signal initial classification module can be implemented by using various existing VAD algorithm schemes, including a background noise estimation sub-module, a computational signal-to-noise ratio sub-module, a useful signal estimation sub-module, a decision threshold adjustment word module, and a comparison sub-module. , trailing protection useful letter Number submodule.
  • a background noise estimation sub-module e.g., a background noise estimation sub-module
  • a computational signal-to-noise ratio sub-module e.g., a useful signal estimation sub-module
  • a decision threshold adjustment word module e.g., a comparison sub-module.
  • trailing protection useful letter Number submodule e.g., trailing protection useful letter Number submodule.
  • Calculating the signal-to-noise ratio sub-module calculates the signal-to-noise ratio according to the parameter and the sub-band energy parameter, and the calculated signal-to-noise ratio ⁇ s (sr) is transmitted to the signal classification decision module in addition to the internal use of the PSC module. So that the signal classification decision module is more accurate in distinguishing between voice and music under low SNR conditions.
  • the present embodiment improves the VAD by the following: First, the calculation of the background noise parameter is controlled by the update rate acc provided by the background noise parameter update module.
  • the background noise estimation sub-module receives the update rate from the background noise parameter update module, updates the noise parameter, and transmits the background noise sub-band energy estimation parameter calculated by the updated noise parameter to the calculation signal-to-noise ratio sub-module.
  • the update rate can take 4 files: accl, acc2, acc3, acc4.
  • update_up and update_down parameters are determined, and update_up and update_down correspond to the update rate of background noise up and down, respectively.
  • n subband index
  • the useful signal is generally protected from noise by smearing, and the length of the smear should be compromised between protecting the signal and improving the transmission efficiency.
  • the length of the smear can be learned to take a constant.
  • multi-rate encoders it is oriented to audio signals including music. Such signals often have long low-energy tails. It is difficult for conventional VAD to detect this part of the tail, so it requires a long tailing pair. It is protected.
  • the classification parameter extraction module is configured to calculate the signal classification decision module and the background noise parameter update module according to the Vad_flag parameter determined by the signal initial classification module and the subband energy parameter, the ISF parameter, and the open loop pitch parameter provided by the encoder parameter extraction module.
  • the parameters, and the subband energy ⁇ :, the ISF parameter, the open loop pitch parameter, and the calculated parameter are provided to the signal classification decision module and the background noise parameter.
  • the parameters calculated by the classification parameter extraction module include:
  • H W is 1 when A is truth and 0 when it is false.
  • Sublevel— high— energy level [10]+ level[l l];
  • Sublevel_low_energy level[0]+ level [1]+ level[2]+ level [3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9 ];
  • the short-term average (isf-meanSD) of the distance is: the average value of the distance Isf_SD for five adjacent frames, where
  • level_meanSD which represents the average value of the energy standard deviation (level_SD) of two adjacent frames
  • the calculation method of the level-SD parameter refers to the above calculation method of Isf_SD.
  • the parameters provided to the background noise update module include: zcr, ra, f-flux, and 1_3 «.
  • the parameters provided to the signal classification decision module include: pitch, meangain, isf-meanSD, and level-meanSD.
  • the signal classification decision module is used to derive the sub-band energy from the sampling parameter extraction module based on the snr, Vad_flag from the signal initial classification module PSC.
  • the parameters, pitch, meangain, Isf- meanSD, level-meanSD finally distinguish the signals into: non-useful signal (NOISE), speech signal (SPEECH) and music signal (MUSIC;).
  • the signal classification decision module can include: parameter updater a module and a decision sub-module; the parameter update sub-module is configured to update a threshold in a signal classification decision process according to the signal-to-noise ratio, and provide an updated threshold to the decision sub-module; the Received from The sound signal type of the PSC module, and the useful signal therein is based on the open loop pitch parameter, the guide frequency parameter, the subband energy parameter and the updated threshold, or based on the spectral frequency parameter and the subband energy parameter and the update The latter threshold determines the type of the useful signal and transmits the determined type of useful signal to the encoder mode and rate selection module.
  • Determining the useful signal as a speech signal or a music signal comprises: first setting a value of the speech identifier bit and a value of the music identification bit to be 0, and then according to the pitch parameter identification, the long-term signal correlation value, the lead distance short-term average parameter, and the sub-band energy
  • the sub-standard deviation average parameter preliminarily determines the signal as a voice type, a music type, or an indeterminate type, and then modifies the value of the voice flag or the music flag according to the initially determined voice type or music type; Whether the number of consecutive frames with energy, long-term signal correlation value, sub-band energy sub-standard deviation average parameter, speech_flag, music_flag, and pitch value of 1 exceeds the preset threshold of the number of trailing frames, the number of consecutive music frames, and continuous The number of speech frames, and the type of the previous frame, are corrected for the initially determined speech type, music type, or uncertainty type to determine the type of useful signal, including speech signals and music signals.
  • the embodiment provides a flag tailing mechanism for the parameter, including pitch_flag, level_meanSD_high_flag, ISF-meanSD_high_flag, ISF-meanSD-low — flag, level—meanSD—low—flag, meangain—flag
  • a flag tailing mechanism for the parameter, including pitch_flag, level_meanSD_high_flag, ISF-meanSD_high_flag, ISF-meanSD-low — flag, level—meanSD—low—flag, meangain—flag
  • the length of the trailing period in Fig. 8 is determined according to the trailing parameter identification value.
  • two kinds of trailing settings are provided, that is, a scheme for determining the trailing parameter identification value:
  • the corresponding The parameter tailing counter value is incremented by one; otherwise, the corresponding parameter trailing counter value is set to 0, and different parameter trailing identifiers are set according to the value of the parameter trailing counter.
  • the specific value is determined according to the actual situation when setting the parameter smear identification value according to the parameter counter, and details are not described herein again.
  • the length of the trailing length is controlled according to the error rate ER of each internal node of the decision tree corresponding to the training parameter, and the parameter with a small error rate is short; the parameter with a large error rate is long.
  • the initial voice is determined. As shown in FIG. 9, the voice flag is set to 0 in step 901. Then, in step 902, it is determined whether Isf_meanSD is greater than a preset first voice voice threshold (for example, 1500). If yes, the setting is set. The value of the voice flag is 1; otherwise,
  • step 903 it is determined whether the pitch value is 1, and the pitch delay value t_top-mean obtained by the switch pitch search is smaller than the pitch voice threshold (for example, 40), and if so, the value of the voice flag is set to 1; otherwise,
  • step 904 it is determined whether the number of consecutive frames whose pitch value is 1 exceeds a preset threshold of the number of trailing frames (for example, 2 frames), and if so, the value of the voice flag is set to 1; otherwise,
  • step 905 it is determined whether the meangain is greater than a preset long-term related speech threshold (for example, 8000), and if so, the value of the voice flag is set to 1; otherwise, in step 906, the level_meanSD_high_flag and the ISF_meanSD_high_flag are determined. Whether one or both of them have a value of 1, and if so, the value of the voice flag is set to 1; otherwise, the value of the voice flag is not changed.
  • a preset long-term related speech threshold for example, 8000
  • step 1101 it is determined whether the instantaneous energy of the subband is less than the subband energy threshold (for example, 5000), and if yes, step 1102 is performed; otherwise, the signal is determined to be an indeterminate class (UNCERTAIN);
  • the subband energy threshold for example, 5000
  • step 1103 it is determined that the value of ISF_meanSD is greater than a preset second pilot voice threshold (for example, 2000). If yes, the signal is determined to be a voice signal; otherwise, in step 1104, it is determined whether level_energy is less than 10000, and the previous decision is made. The number of frames that are noisy exceeds five frames. If so, the current signal class is set to an indeterminate class. This is to reduce the misjudgment of classifying noise into music; otherwise,
  • step 1105 it is determined whether the value of the music flag and the voice flag are both 1, and if so, the current signal class is determined to be an indeterminate class; otherwise,
  • step 1106 it is determined whether the values of the music flag and the voice flag are both 0. If yes, the current signal class is determined to be a bit uncertainty class; otherwise,
  • step 1107 it is determined whether the music flag is 0, the voice flag is 1, and if so, the current signal type is determined to be a voice class; otherwise,
  • step 1108 since the music flag is 1 and the voice flag is 0, the current signal type is determined to be a music class.
  • step 1110 is performed: whether the number of consecutive music frames is greater than 3, and ISF_meanSD is smaller than the music threshold of the guided spectrum, and if yes, the signal is determined as a music signal; Otherwise, the signal is determined to be a speech signal.
  • step 1201 it is determined whether the levd_energy is smaller than the sub-band energy uncertainty.
  • the class threshold for example, 5000
  • step 1202 it is determined whether the continuous frame number of the music is greater than 1 and the ISF_meanSD is smaller than the guided music threshold, and if so, the signal is determined For music; otherwise:
  • step 1211 to step 1216 in FIG. 12 if the voice trailing flag is 1, the music trailing flag is 0, and the current signal category is set to the voice class; for example, the music trailing flag is 1, the voice If the trailing flag is 0, the current signal category is set to music class; if the music trailing flag and the music trailing flag are both 1 or 0 at the same time, the signal class is set to the uncertainty class, then if the music is before Continuity exceeds 20 frames, will The signal is determined to be a music class, and if the continuity of the previous speech exceeds 20 frames, the signal is determined to be a speech class.
  • the final correction of the useful signal type is performed in FIG. 13, and the category modification is continued according to the current context.
  • the current context is music, and the persistence is strong.
  • the music signal can be determined by forcibly correcting according to the value of ISF-meanSD.
  • step 1302 if the current context is speech and the persistence is strong, more than 3 seconds, that is, the current continuous number of speech frames exceeds 150 frames, then the forced correction may be performed according to the value of ISF_meanSD to determine the type of the speech signal; Thereafter, if the signal class is also an indeterminate class, then at step 1303 the signal class is modified according to the previous context, ie, the currently undefined signal class is summarized into the previous signal class.
  • the threshold value is updated according to the signal-to-noise ratio of the signal output initial classification module.
  • the threshold examples listed in the embodiment are the values learned under the 20db signal-to-noise ratio condition.
  • the background noise parameter update module uses some of the spectrum distribution parameters calculated in the classification parameters in the SAD to control the update rate of the background noise. Due to the sudden increase of the energy level of the background noise in the actual application environment, the background noise estimation is likely to be unable to be updated due to the signal being continuously judged as a useful signal, and the setting of the background noise parameter update module solves the problem. problem.
  • the background noise parameter update module calculates the relevant spectral distribution parameter vector according to the parameters from the classification parameter extraction block, and includes the following elements: Short-term average of zero-crossing rate zcr
  • Zcr _ mean m ALPHA'zcr _ mean m _ + (1— ALPHA)»zcr m
  • m represents the frame index
  • This embodiment utilizes the characteristics that the spectral characteristics of the background noise are relatively stable, and the members of the spectrum distribution parameter vector may not be limited to the four listed above.
  • the update rate of the current background noise is controlled by the difference between the current frequency distribution parameter and the background noise spectral distribution parameter estimate. This difference can be achieved by algorithms such as Euclidean distance and Manhattan distance.
  • An inventive example of this patent uses Manhattan distance (a name for distance calculation, similar to Euclidean distance), namely:
  • is the spectrum distribution parameter vector of the current signal and is the background noise spectrum distribution parameter vector estimate.
  • the module when ⁇ 1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, when *1112, the update rate acc2 is output; otherwise, when ⁇ 3, the update rate acc3 is output; Otherwise, the update rate acc4 is output.
  • TH1, TH2, TH3 and TH4 are update thresholds, which are determined according to the actual environmental conditions.
  • the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined.
  • Non-useful signals and useful letters No. which reduces the misjudgment of determining the useful signal as a noise signal, and improves the accuracy of the classification of the sound signal.
  • the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is a better implementation. the way.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A classing method for sound signal includes: receiving the sound signal, determining updating rate of background noise according to a spectral distribution parameter of a background noise and a spectral distribution parameter of the sound signal; updating the noise parameter according the updating rate, and classing the sound signal according to a sub-band energy parameter and the updated noise parameter. A classing device for sound signal applies above method.

Description

声音信号分类方法和装置 技术领域  Sound signal classification method and device
本发明涉及语音编码技术领域,特别涉及一种声音信号分类方法 和一种声音信号分类装置。 背景技术  The present invention relates to the field of speech coding technologies, and in particular, to a sound signal classification method and a sound signal classification device. Background technique
在语音通信中只有大约 40 %的信号是包含语音的, 其它时间都 是静音或背景噪声, 为了节省传输带宽, 在语音信号处理领域进行语 音编码中, 釆用语音活动检测( VAD, Voice Activity Detection )技术, 使得编码器可以对背景噪声和活动的语音釆用不同的速率进行编码, 即对背景噪声用较低的速率进行编码,对活动的语音用较高的速率进 行编码,从而降低了平均码率,极大的促进了变速率语音编码技术的 发展。  In voice communication, only about 40% of the signals are voice-containing, and other times are muted or background noise. In order to save transmission bandwidth, voice activity detection (VAD, Voice Activity Detection) is used in speech coding. Technology that allows the encoder to encode background noise and active speech at different rates, encoding background noise at a lower rate, and encoding the active speech at a higher rate, thereby reducing the average The code rate greatly promotes the development of variable rate speech coding technology.
现有的信号检测器(VAD )均针对语音信号而开发, 只将输入的 音频信号分为两种: 噪声和非噪声。 较新的编码器如 AMR— WB+和 SMV, 包含音乐信号的检测, 作为 VAD判决以外的一个修正和补充。  Existing signal detectors (VADs) have been developed for speech signals, dividing only the input audio signals into two types: noise and non-noise. Newer encoders such as AMR-WB+ and SMV contain detection of music signals as a correction and supplement to VAD decisions.
AMR-WB+编码器的重要特征是在 VAD检测之后, 根据输入音频信 号是语音还是音乐, 用不同的模式进行编码, 以在最大程度上减小码 率, 保证编码质量。 An important feature of the AMR-WB+ encoder is that it is coded in different modes depending on whether the input audio signal is speech or music after VAD detection to minimize the bit rate and ensure the encoding quality.
AMR-WB+中的两种不同编码模式包括: 基于代数码本激励线性 预测语音编码器 ACELP ( Algebraic Code Excited Linear Prediction ) 和变换激励编码 TCX ( Transform coded excitation )模式两种核心编 码算法。 ACELP属于通过建立语音发声模型, 充分利用了语音的特 点, 对于语音信号的编码效率很高, 加之其技术已经相当成熟, 故可 以通过在通用音频编码器上扩展使用前者使其语音编码质量得到很 大提高。 类似地, 通过在低比特率的语音编码器上扩展使用 TCX编 码使其宽带音乐的编码质量得到提高。  The two different coding modes in AMR-WB+ include: Algebraic Code Excited Linear Prediction and TCX (Transformation Coded Excited) two core coding algorithms. ACELP belongs to the voice vocalization model, which makes full use of the characteristics of speech. It has high coding efficiency for speech signals, and its technology is quite mature. Therefore, it can be extended by using the former on the universal audio encoder to make the speech coding quality very good. Great improvement. Similarly, the encoding quality of wideband music is improved by extending the use of TCX encoding on a low bit rate speech coder.
AMR-WB+编码算法的 ACELP和 TCX模式选择算法根据复杂度  AMR-WB+ encoding algorithm for ACELP and TCX mode selection algorithms based on complexity
1  1
确认本 有两种: 开环选择算法和闭环选择算法。 闭环选择对应高复杂度, 为 缺省选项,是一种基于感知加权信噪比的遍历搜索的选择方式,显然, 这样的选择方法是很准确的, 但它运算复杂度非常高,代码量也较大。 Confirmation There are two types: open loop selection algorithm and closed loop selection algorithm. Closed-loop selection corresponds to high complexity, which is the default option. It is a choice of ergodic search based on perceptually weighted SNR. Obviously, this selection method is very accurate, but its computational complexity is very high, and the code size is also very high. Larger.
开环选择包括如下步骤:  The open loop selection includes the following steps:
首先在步骤 101, 由 VAD模块根据声调标识(Tone_flag )和子 带能量参数 ( Level[n] ), 确定信号是非有用信号还是有用信号。  First, in step 101, the VAD module determines whether the signal is a non-useful signal or a useful signal based on the tone identification (Tone_flag) and the sub-band energy parameter (Level[n]).
然后在步骤 102, 进行初步模式选择 ( EC );  Then at step 102, preliminary mode selection (EC) is performed;
在步骤 103, 对步骤 102初步确定的模式进行修正和细化模式选 择(ESC ), 以确定选择的编码模式, 具体基于开环基音参数和 ISF 参数进行。  At step 103, the mode initially determined in step 102 is modified and refined mode selection (ESC) to determine the selected coding mode, based on the open loop pitch parameters and the ISF parameters.
在步骤 104、 进行 TCXS处理, 即当连续选择语音信号编码模式 的次数小于三次时,进行小规模的闭环遍历搜索,最终确定编码模式, 其中语音信号编码模式为 ACELP, 音乐信号编码模式为 TCX。  In step 104, TCXS processing is performed, that is, when the number of consecutively selecting the speech signal encoding mode is less than three times, a small-scale closed loop traversal search is performed, and finally the encoding mode is determined, wherein the speech signal encoding mode is ACELP, and the music signal encoding mode is TCX.
在实现本发明的过程中,发明人发现上述 AMR- WB+的语音信号 选择算法具有如下缺点:  In carrying out the invention, the inventors have found that the above AMR-WB+ speech signal selection algorithm has the following disadvantages:
1、现有的 VAD模块在对信号进行分类时,对噪声和一些种类的 音乐信号区分不够理想, 降低了声音信号分类的准确性;  1. When the existing VAD module classifies signals, it is not ideal for distinguishing between noise and some kinds of music signals, which reduces the accuracy of sound signal classification;
2、计算开环基音参数,对于 ACELP编码模式是必要的运算, 然 而对于 TCX编码模式是不必要的。按照 AMR-WB+的结构设计, VAD 和开环模式选择算法需要用到开环基音参数,因此对所有帧都需要计 算开环基音, 而这对于其它非 ACELP编码模式(例如 TCX )来说, 属于冗余的复杂度, 增加了编码模式选择的计算量, 降低了效率。  2. Calculating the open-loop pitch parameters is necessary for the ACELP coding mode, but is not necessary for the TCX coding mode. According to the structural design of AMR-WB+, the VAD and open-loop mode selection algorithms require the use of open-loop pitch parameters, so the open-loop pitch is calculated for all frames, which is true for other non-ACELP coding modes (eg TCX). The complexity of redundancy increases the amount of computation for coding mode selection and reduces efficiency.
3、虽然 VAD检测算法在语音检测和噪声免疫上的表现是当前各 种编码器中较优的,但在某些特殊的音乐信号拖尾部分有可能误将音 乐信号判成噪音, 这将导致音乐的尾音被截断, 听起来不自然。  3. Although the performance of VAD detection algorithm in speech detection and noise immunity is better in various current encoders, in some special music signal tailing parts, it is possible to mistake the music signal into noise, which will result in The ending of the music is truncated, which sounds unnatural.
4、 AMR-WB+的模式选择算法不考虑信号所处的信噪比环境, 在低信噪比条件下区分语音和音乐的性能进一步恶化。 发明内容 4. The AMR-WB+ mode selection algorithm does not consider the signal-to-noise ratio environment in which the signal is located, and the performance of distinguishing between speech and music is further deteriorated under low SNR conditions. Summary of the invention
有鉴于此,本发明实施例提供了一种声音信号分类方法和一种声 音信号分类装置, 能够提高对声音信号分类检测的准确性。  In view of this, the embodiments of the present invention provide a sound signal classification method and a sound signal classification device, which can improve the accuracy of classification and detection of sound signals.
本发明实施例提供的一种声音信号分类检测方法包括: 接收声音信号,根据背景噪声频谱分布参数和所述声音信号的频 谱分布参数确定背景噪声的更新速率;根据所述更新速率对噪声参数 进行更新,并根据子带能量参数和更新后的噪声参数对所述声音信号 进行分类。  A sound signal classification detection method provided by an embodiment of the present invention includes: receiving a sound signal, determining an update rate of the background noise according to a background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal; and performing noise parameters according to the update rate Updating, and classifying the sound signal based on the subband energy parameter and the updated noise parameter.
本发明实施例提供的一种声音信号分类装置包括:背景噪声参数 更新模块和信号初始分类 PSC模块;  A sound signal classification device provided by an embodiment of the present invention includes: a background noise parameter update module and a signal initial classification PSC module;
背景噪声参数更新模块用于根据背景噪声频谱分布参数和当前 声音信号的频语分布参数确定背景噪声的更新速率 ,并发送所述确定 的更新速率;  The background noise parameter updating module is configured to determine an update rate of the background noise according to the background noise spectrum distribution parameter and the frequency distribution parameter of the current sound signal, and send the determined update rate;
PSC模块用于接收来自所述背景噪声参数更新模块的更新速率, 对噪声参数进行更新 ,并根据子带能量参数和更新后的噪声参数对当 前声音信号进行分类, 并发送分类确定的声音信号类型。。  The PSC module is configured to receive an update rate from the background noise parameter update module, update the noise parameter, and classify the current sound signal according to the subband energy parameter and the updated noise parameter, and send the classified sound signal type. . .
本发明实施例中, 通过确定背景噪声的更新速率, 并根据该更新 速率对噪声参数进行更新,再根据子带能量参数和更新后的噪声参数 对信号进行初始分类,确定接收的语音信号中的非有用信号和有用信 号, 降低了将有用信号判决为噪音信号的误判,提高了声音信号分类 的准确性。 附图说明  In the embodiment of the present invention, the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined. The non-useful signal and the useful signal reduce the misjudgment of determining the useful signal as a noise signal, and improve the accuracy of the classification of the sound signal. DRAWINGS
图 1为现有技术中的 AMR-WB+编码算法开环选择示意图; 图 2为本发明实施例中声音信号分类检测方法的总体流程图; 图 3为本发明实施例中声音信号分类装置的组成示意图; 图 4为本发明具体实施例所基于的系统组成示意图;  1 is a schematic diagram of an open loop selection of an AMR-WB+ encoding algorithm in the prior art; FIG. 2 is a general flowchart of a sound signal classification detecting method according to an embodiment of the present invention; FIG. 3 is a composition of a sound signal sorting apparatus according to an embodiment of the present invention; 4 is a schematic diagram of a system composition based on a specific embodiment of the present invention;
图 5 为本发明具体实施例中一种编码器参数提取模块计算各种 参数的流程图; FIG. 5 is a diagram of an encoder parameter extraction module for calculating various types according to an embodiment of the present invention; Flow chart of parameters;
图 6 为本发明具体实施例中另一种编码器参数提取模块计算各 种参数的流程图;  6 is a flow chart of another encoder parameter extraction module for calculating various parameters according to an embodiment of the present invention;
图 7为本发明具体实施例中 PSC模块组成示意图;  7 is a schematic structural diagram of a PSC module according to an embodiment of the present invention;
图 8 为本发明具体实施例中信号分类判决模块确定特征参数的 示意图;  FIG. 8 is a schematic diagram of determining a feature parameter by a signal classification decision module according to an embodiment of the present invention; FIG.
图 9 为本发明具体实施例中信号分类判决模块进行语音判决的 示意图;  9 is a schematic diagram of a voice classification decision module performing voice decision according to an embodiment of the present invention;
图 10为本发明具体实施例中信号分类判决模块进行音乐判决的 示意图;  10 is a schematic diagram of a signal classification decision module performing music decision according to an embodiment of the present invention;
图 11为本发明具体实施例中信号分类判决模块对初始判决结果 进行修正的示意图;  11 is a schematic diagram of a signal classification decision module for correcting an initial decision result according to an embodiment of the present invention;
图 12为本发明具体实施例中信号分类判决模块对不确定信号进 行初步修正分类示意图;  FIG. 12 is a schematic diagram showing a preliminary classification of an uncertain signal by a signal classification decision module according to an embodiment of the present invention; FIG.
图 13为本发明具体实施例中信号分类判决模块对信号进行最终 分类修正示意图;  13 is a schematic diagram of a final classification and correction of a signal by a signal classification decision module according to an embodiment of the present invention;
图 14为本发明具体实施例中信号分类判决模块进行参数更新示 意图。 具体实施方式  Figure 14 is a diagram showing the parameter update of the signal classification decision module in the embodiment of the present invention. detailed description
为使本发明实施例的目的、技术方案和优点更加清楚, 下面结合 附图对本发明实施例作进一步的详细描述。  The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
本发明实施例中,根据当前声音信号的频谱分布参数和背景噪声 频谱分布参数确定背景噪声的更新速率,并根据该更新速率对噪声参 数进行更新, 则在确定接收的语音信号中的有用信号和非有用信号 时,根据该更新后的噪声参数进行,从而使得在确定有用信号和非有 用信号时, 噪声参数的准确性更高, 提高了声音信号分类的准确性。  In the embodiment of the present invention, the update rate of the background noise is determined according to the spectrum distribution parameter of the current sound signal and the background noise spectrum distribution parameter, and the noise parameter is updated according to the update rate, and the useful signal in the received voice signal is determined. When the non-useful signal is used, it is performed according to the updated noise parameter, so that the accuracy of the noise parameter is higher when determining the useful signal and the non-useful signal, and the accuracy of the sound signal classification is improved.
如图 2所示,本发明实施例中首先提供了一种声音信号分类检测 方法, 该方法包括: As shown in FIG. 2, in the embodiment of the present invention, a voice signal classification detection is first provided. Method, the method includes:
步骤 201、 接收声音信号, 根据背景噪声频谱分布参数和所述声 音信号的频谱分布参数确定背景噪声的更新速率;  Step 201: Receive a sound signal, and determine an update rate of the background noise according to the background noise spectrum distribution parameter and a spectrum distribution parameter of the sound signal.
步骤 202、 根据所述更新速率对噪声参数进行更新, 并根据子带 能量参数和更新后的噪声参数对所述声音信号进行分类。  Step 202: Update a noise parameter according to the update rate, and classify the sound signal according to the subband energy parameter and the updated noise parameter.
步骤 202中,将声音信号分类主要是分为有用信号类型和非有用 信号类型。 此后, 还可以进一步确定有用信号的类型, 所述类型包括 语音信号和音乐信号, 在确定时, 根据噪声是否收敛, 选择基于开环 基音参数、导谱频率参数和子带能量参数确定, 或选择基于导谱频率 参数和子带能量参数确定。  In step 202, the classification of the sound signals is mainly divided into useful signal types and non-useful signal types. Thereafter, the type of the useful signal may further be determined, the type including the voice signal and the music signal, and when determined, based on whether the noise converges, the selection is determined based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, or the selection is based on The spectral frequency parameter and the sub-band energy parameter are determined.
此夕卜,为防止将音乐信号拖尾误判为非有用信号, P条低声音效果, 本发明实施例中还获取确定的有用信号类型,根据该有用信号类型确 定信号拖尾长度,并进一步根据该信号拖尾长度确定接收的语音信号 中的有用信号和非有用信号。 这里,对音乐信号的拖尾可以设置的较 大, 从而提高音乐信号的声音效果。  In addition, in order to prevent the smearing of the music signal as a non-useful signal, the P-segment has a low-sound effect, and in the embodiment of the present invention, the determined useful signal type is also obtained, and the signal smear length is determined according to the useful signal type, and further The useful signal and the non-useful signal in the received speech signal are determined based on the tail length of the signal. Here, the smearing of the music signal can be set larger, thereby improving the sound effect of the music signal.
在将有用信号确定为语音信号或音乐信号时,可以首先将不能够 非常准确确定的信号设置为不确定类型,然后再 4艮据其它参数对不确 定类型进行修正, 最终确定有用信号的类型。  When the useful signal is determined to be a speech signal or a music signal, the signal that cannot be determined very accurately can be first set to an indeterminate type, and then the undetermined type is corrected according to other parameters, and finally the type of the useful signal is determined.
由于非有用信号的编码方式并非均需要计算导谱频率参数,因此 为降低分类过程中的计算量,提高分类效率,对确定出的非有用信号, 如果其对应的编码方式不需要计算导谱频率参数,则不计算导语频率 参数。  Since the encoding method of the non-useful signal does not need to calculate the spectral frequency parameter, in order to reduce the calculation amount in the classification process and improve the classification efficiency, if the corresponding non-useful signal is determined, the corresponding coding mode does not need to calculate the spectral frequency. For parameters, the lead frequency parameter is not calculated.
如图 3所示, 本发明实施例中还提供了一种声音信号分类装置, 包括背景噪声参数更新模块和信号初始分类(PSC )模块。 其中, 背 景噪声参数更新模块用于根据当前声音信号的频谱分布参数和背景 噪声频谱分布参数确定背景噪声的更新速率 ,并将确定的更新速率传 送给所述 PSC模块; PSC模块用于根据来自所述背景噪声参数更新 模块的更新速率,对噪声参数进行更新, 并根据子带能量参数和更新 后的噪声参数对信号进行初始分类,将接收的语音信号确定为有用信 号类型或非有用信号类型。 As shown in FIG. 3, an embodiment of the present invention further provides an audio signal classification apparatus, including a background noise parameter update module and a signal initial classification (PSC) module. The background noise parameter update module is configured to use a spectrum distribution parameter and a background of the current sound signal. The noise spectrum distribution parameter determines an update rate of the background noise, and transmits the determined update rate to the PSC module; the PSC module is configured to update the noise parameter according to an update rate from the background noise parameter update module, and according to the sub The signal is initially classified with an energy parameter and an updated noise parameter, and the received speech signal is determined to be a useful signal type or a non-useful signal type.
该声音信号分类装置进一步可以包括: 信号分类判决模块; 则 The sound signal classification device may further include: a signal classification decision module;
PSC模块还将确定的信号类型传送给信号分类判决模块;信号分类判 决模块基于开环基音参数、导谱频率参数和子带能量参数, 或者基于 导谱频率参数和子带能量参数,确定有用信号的类型, 所述类型包括 语音信号和音乐信号。 The PSC module also transmits the determined signal type to the signal classification decision module; the signal classification decision module determines the type of the useful signal based on the open loop pitch parameter, the guided spectral frequency parameter, and the subband energy parameter, or based on the guided spectral frequency parameter and the subband energy parameter. The type includes a voice signal and a music signal.
该声音信号分类装置进一步还可以包括: 分类参数提取模块; 则 The sound signal classification device may further include: a classification parameter extraction module;
PSC模块通过分类参数提取模块将确定的信号类型传送给所述信号 分类判决模块;分类参数提取模块还用于获取包括导谱频率参数和子 带能量参数, 或者进一步获取开环基音参数, 将获取的参数处理为信 号分类特征参数传送给所述分类判决模块;以及根据将获取的参数处 理为声音信号的频谱分布参数和背景噪声频谱分布参数,并将这些频 谱分布参数传送给所述背景噪声参数更新模块;则分类判决模块根据 上述信号分类特征参数和 PSC模块确定的信号类型, 确定有用信号 的类型, 所述类型包括语音信号和音乐信号。 The PSC module transmits the determined signal type to the signal classification decision module by using a classification parameter extraction module; the classification parameter extraction module is further configured to acquire the included spectral frequency parameter and the sub-band energy parameter, or further obtain an open-loop pitch parameter, which will be obtained. Parameter processing is transmitted to the classification decision module for signal classification feature parameters; and processing the parameters to be acquired as a spectral distribution parameter and a background noise spectral distribution parameter of the sound signal, and transmitting the spectral distribution parameters to the background noise parameter update Module; the classification decision module determines the type of the useful signal according to the signal classification feature parameter and the signal type determined by the PSC module, the type including the voice signal and the music signal.
PSC模块进一步还可以用于将确定信号类型过程中计算的声音 信号的信噪比传送给所述信号分类判决模块;信号分类判决模块进一 步根据所述信噪比将有用信号确定为语音信号或音乐信号。  The PSC module is further operable to transmit a signal to noise ratio of the sound signal calculated in the process of determining the signal type to the signal classification decision module; the signal classification decision module further determines the useful signal as a voice signal or music according to the signal to noise ratio signal.
该声音信号分类装置进一步可以包括:编码器模式及速率选择模 块;信号分类判决模块将确定的信号类型传送给所述编码器模式及速 率选择模块;编码器模式及速率选择模块 4艮据接收的所述信号类型确 定声音信号的编码模式及速率。 The sound signal classification device may further include: an encoder mode and a rate selection module; the signal classification decision module transmits the determined signal type to the encoder mode and the rate selection module; and the encoder mode and rate selection module 4 receives the data The signal type is indeed The encoding mode and rate of the sound signal.
该声音信号分类装置进一步可以包括: 编码器参数提取模块, 用 于提取导谱频率参数和子带能量参数, 或者进一步提取开环基音参 数, 并将提取的所述参数传送给所述分类参数提取模块, 以及将提取 的子带能量参数传送给 PSC模块。  The sound signal classification device may further include: an encoder parameter extraction module, configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module And transmitting the extracted subband energy parameters to the PSC module.
以下通过一个具体实施例对本发明实施例中提供的声音信号分 类检测方法和声音信号分类装置进行说明。  The sound signal classification detecting method and the sound signal sorting apparatus provided in the embodiments of the present invention will be described below by way of a specific embodiment.
如图 4所示, 为本发明具体实施例基于的系统组成示意图。 其中 包括声音信号分类检测器( sound activity detector , SAD )它根据编 码器的需要, 将输入音频数字信号划分为不同的类, 可分为非有用信 号、语音和音乐三类,从而为编码器提供编码模式选择和速率选择的 依据。  As shown in FIG. 4, it is a schematic diagram of a system composition based on a specific embodiment of the present invention. These include a sound activity detector (SARD) which divides the input audio digital signal into different classes according to the needs of the encoder. It can be divided into non-useful signals, voice and music to provide encoders. The basis for coding mode selection and rate selection.
在图 4中可以看出, SAD模块内部包括: 背景噪声估计控制模 块、 信号初始分类模块、 分类参数提取模块和信号分类判决模块共 4 个子模块。 SAD作为编码器内部使用的信号分类器, 为减少资源耗 占及计算复杂度,将充分利用编码器自有的参数, 所以通过编码器中 的编码器参数提取模块计算子带能量参数及编码器参数,并将计算的 参数提供给 SAD模块。 另外, SAD模块最终输出是信号判决类型, 包括非有用信号、语音和音乐三类,提供给编码器模式和速率选择模 块, 供其选择编码器模式和速率。  As can be seen in FIG. 4, the SAD module internally includes: a background noise estimation control module, a signal initial classification module, a classification parameter extraction module, and a signal classification decision module. As a signal classifier used internally by the encoder, SAD will make full use of the encoder's own parameters in order to reduce resource consumption and computational complexity. Therefore, the subband energy parameter and encoder are calculated by the encoder parameter extraction module in the encoder. Parameters and provide the calculated parameters to the SAD module. In addition, the final output of the SAD module is a signal decision type, including non-useful signals, speech, and music, which are provided to the encoder mode and rate selection module for selecting the encoder mode and rate.
以下分别对编码器中与 SAD相关的模块、 SAD中的各个子模块, 以及各个模块之间的交互过程进行详细说明。 编码器中的编码器参数提取模块计算子带能量参数及编码器参 数, 并将计算的参数提供给 SAD模块。 其中, 子带能量参数的计算 可以采用滤波器组滤波的方法,具体的子带数量根据计算复杂度要求 和分类准确性要求确定, 在本实施例中下述以分为 12个子带进行说 明。 The following describes the interaction process between the modules related to the SAD in the encoder, the submodules in the SAD, and the respective modules. The encoder parameter extraction module in the encoder calculates the subband energy parameters and the encoder parameters, and provides the calculated parameters to the SAD module. Wherein, the calculation of the sub-band energy parameter can adopt the filter group filtering method, and the specific number of sub-bands is required according to the calculation complexity. The classification accuracy requirement is determined, and in the present embodiment, the following description is divided into 12 sub-bands.
本实施例中, 编码器参数提取模块计算各种 SAD模块需要的参 数的过程可以如图 5或图 6所示,  In this embodiment, the process of the encoder parameter extraction module calculating the parameters required by the various SAD modules may be as shown in FIG. 5 or FIG. 6.
其中, 图 5所示的流程包括如下步骤:  The process shown in Figure 5 includes the following steps:
步骤 501、 编码器参数提取模块首先计算子带能量参数。  Step 501: The encoder parameter extraction module first calculates a subband energy parameter.
步骤 502、编码器参数提取模块根据来自 PSC模块的信号初始判 决结果(Vad_flag )决定是否需要进行导频频率(ISF )运算, 如果需 要执行步骤 503; 否则执行步骤 504。  Step 502: The encoder parameter extraction module determines, according to a signal initial judgment result (Vad_flag) from the PSC module, whether an pilot frequency (ISF) operation is required, if necessary, step 503 is performed; otherwise, step 504 is performed.
本步骤中决定是否需要进行 ISF运算包括:如果当前帧是非有用 信号, 则根据编码器的机制: 如果编码器针对非有用信号的编码需要 ISF参数, 则进行 ISF运算; 若不需要, 则编码器参数提取模块结束。 如果当前帧是有用信号, 则进行 ISF运算。 对于有用信号计算 ISF参 数,是大多数编码模式都需要的, 因此不会给编码器带来冗余的复杂 度。 ISF参数计算的技术方案可以参考各种编码器的资料, 在此不赘 述。  Determining whether to perform an ISF operation in this step includes: if the current frame is a non-useful signal, according to the mechanism of the encoder: if the encoder requires an ISF parameter for encoding the non-useful signal, performing an ISF operation; if not, the encoder The parameter extraction module ends. If the current frame is a useful signal, an ISF operation is performed. Calculating ISF parameters for useful signals is required for most coding modes and therefore does not introduce redundant complexity into the encoder. The technical solution of the ISF parameter calculation can refer to the data of various encoders, and will not be described here.
步骤 503、编码器参数提取模块计算 ISF参数,然后执行步骤 504。 步骤 504、 编码器参数提取模块计算开环基音参数。  Step 503: The encoder parameter extraction module calculates an ISF parameter, and then performs step 504. Step 504: The encoder parameter extraction module calculates an open loop pitch parameter.
通过上述图 5 的流程计算出的子带能量参数提供给 SAD 中的 The sub-band energy parameters calculated by the above process of Figure 5 are provided to the SAD.
PSC模块和分类参数提取模块, 其余参数提供给 SAD中的分类参数 提 块。 The PSC module and the classification parameter extraction module, and the remaining parameters are provided to the classification parameters in the SAD.
图 6所示的流程中, 在图 5流程的基础上, 增加了根据初始噪声 是否收敛来决定是否计算开环基音参数的步骤。 其中, 步骤 601至步 骤 603与图 5中的步骤 501至步骤 503基本相同, 而在步骤 604, 判 断初始化噪声参数, 即噪声估计是否收敛, 如果是则在步骤 605计算 开环基音参数; 否则不计算开环基音参数。 In the flow shown in FIG. 6, on the basis of the flow of FIG. 5, a step of determining whether to calculate an open-loop pitch parameter based on whether the initial noise converges is added. Step 601 to step 603 are substantially the same as steps 501 to 503 in FIG. 5, and in step 604, it is determined whether the noise parameter is initialized, that is, whether the noise estimate converges, and if so, it is calculated at step 605. Open loop pitch parameter; otherwise the open loop pitch parameter is not calculated.
由于开环基音参数对于有的编码模式, 如 TCX编码模式, 属于 冗余的计算, 为降低计算复杂度, 在噪声估计收敛之后, 基本可以确 定信号对应的编码模式不需要计算开环基音参数,因此就不再计算开 环基音参数。  Since the open-loop pitch parameter is a redundant coding algorithm, such as the TCX coding mode, in order to reduce the computational complexity, after the noise estimation converges, it is basically determined that the coding mode corresponding to the signal does not need to calculate the open-loop pitch parameter. Therefore, the open loop pitch parameters are no longer calculated.
在噪声估计收敛之前, 为确保噪声估计能够收敛及其收敛速度, 需要计算开环基音参数,但这属于启动阶段的计算, 可以忽略其复杂 度。 开环基音参数计算的技术方案可以参考基于 ACELP的编码, 在 此不赘述。判断噪声估计是否收敛的依据可以是连续判决为噪声帧的 次数超过门限噪声收敛门限(THR1 ), 本实施例的一个示例中 THR1 值取 20。  Before the noise estimation converges, in order to ensure that the noise estimate can converge and its convergence speed, the open-loop pitch parameters need to be calculated, but this is the calculation of the startup phase, and its complexity can be ignored. The technical solution for calculating the open-loop pitch parameters can refer to the ACELP-based coding, and will not be described here. The basis for determining whether the noise estimate converges may be that the number of consecutively determined noise frames exceeds a threshold noise convergence threshold (THR1). In one example of this embodiment, the value of THR1 is 20.
上述提取出的子带能量参数为: level[i]。 其中, i表示向量的成 员索引, 本实施例中取 1— 12, 分别对应 0-200hz, 200-400hz , 400-600hz , 600-800hz , 800-1200hz , 1200-1600hz , 1600-2000hz , 2000-2400hz, 2400-3200hz, 3200-40000hz, 4000-4800hz, 4800-6400hz„ 上述提取出的 ISF参数为: , 其中, n表示帧索引, i取 1 ... 16表示向量中成员索引。  The extracted sub-band energy parameter is: level[i]. Where i denotes the member index of the vector, in this embodiment, 1-12, corresponding to 0-200hz, 200-400hz, 400-600hz, 600-800hz, 800-1200hz, 1200-1600hz, 1600-2000hz, 2000- 2400hz, 2400-3200hz, 3200-40000hz, 4000-4800hz, 4800-6400hz The above extracted ISF parameters are: , where n represents the frame index, and i takes 1 ... 16 to represent the member index in the vector.
上述提取出的开环基音参数包括:  The extracted open loop pitch parameters include:
开环基因增益( open_loop pitch gain, ol_gain )和开环基因延迟 ( open— loop pitch lag , ol— lag ), 以及音调标志(tone— flag )。 其中, 如果 ol^gain的值大于音调门限( TONEJTHR ),则音调标志 tone— flag 设为 1。  Open-loop pitch gain (ol_gain) and open-loop pitch lag (ol-lag), and tone-flag. Wherein, if the value of ol^gain is greater than the pitch threshold (TONEJTHR), the tone flag tone_flag is set to 1.
信号初始分类模块 ( PSC )可以采用各种已有的 VAD算法方案 来实现, 具体包括背景噪声估计子模块、 计算信噪比子模块、有用信 号估计子模块、 判决阈值调整字模块、 比较子模块、 拖尾保护有用信 号子模块。 本实施例中, 如图 7所示, PSC模块的具体实现也可以与 现有的 VAD算法模块有以下三点不同: The signal initial classification module (PSC) can be implemented by using various existing VAD algorithm schemes, including a background noise estimation sub-module, a computational signal-to-noise ratio sub-module, a useful signal estimation sub-module, a decision threshold adjustment word module, and a comparison sub-module. , trailing protection useful letter Number submodule. In this embodiment, as shown in FIG. 7, the specific implementation of the PSC module may be different from the existing VAD algorithm module in the following three points:
I、 计算信噪比子模块根据该参数和子带能量参数计算信噪比, 计算出的信噪比^ it ( snr )除在 PSC模块内部使用外, 还将该 snr 参数传送给信号分类判决模块,以使得信号分类判决模块在低信噪比 条件下对语音和音乐的区分也更加准确。  I. Calculating the signal-to-noise ratio sub-module calculates the signal-to-noise ratio according to the parameter and the sub-band energy parameter, and the calculated signal-to-noise ratio ^s (sr) is transmitted to the signal classification decision module in addition to the internal use of the PSC module. So that the signal classification decision module is more accurate in distinguishing between voice and music under low SNR conditions.
II、 由于现有的 VAD对噪声和某些种类的音乐的区分不够理想, 本实施例对 VAD进行了以下改进: 首先背景噪声参数的计算由背景 噪声参数更新模块提供的更新速率 acc来控制。 由背景噪声估计子模 块接收来自背景噪声参数更新模块的更新速率, 对噪声参数进行更 新,并将 居更新后的噪声参数计算的背景噪声子带能量估计参数传 送给计算信噪比子模块。具体对更新速率的计算参见后续对背景噪声 参数更新模块的说明, 在本实施例的一个示例中, 更新速率可以取 4 个档: accl , acc2, acc3 , acc4。 对于不同的更新速率, 确定不同的 向上更新参数 ( update up )和向下更新参数( update— down ), update— up 及 update_down分别对应背景噪声向上及向下的更新速率。  II. Since the existing VAD distinguishes between noise and certain kinds of music is not ideal, the present embodiment improves the VAD by the following: First, the calculation of the background noise parameter is controlled by the update rate acc provided by the background noise parameter update module. The background noise estimation sub-module receives the update rate from the background noise parameter update module, updates the noise parameter, and transmits the background noise sub-band energy estimation parameter calculated by the updated noise parameter to the calculation signal-to-noise ratio sub-module. For the calculation of the update rate, refer to the description of the background noise parameter update module. In an example of this embodiment, the update rate can take 4 files: accl, acc2, acc3, acc4. For different update rates, different update up and update_down parameters are determined, and update_up and update_down correspond to the update rate of background noise up and down, respectively.
然后噪声参数更新的方案具体可采用 AMR_WB+中的方案: If ( bckr _ estm [n] level ^ [n] ) update=update_up Then the noise parameter update scheme can specifically adopt the scheme in AMR_WB+: If ( bckr _ est m [n] level ^ [n] ) update=update_up
else  Else
update=update_down  Update=update_down
则噪声估计更新的公式为:  Then the formula for updating the noise estimate is:
bckr _ Bckr _
Figure imgf000013_0001
Figure imgf000013_0001
则噪声频谱分布参数向量更新的公式为:  Then the formula for updating the noise spectrum distribution parameter vector is:
P [!] = (1- pdate) * pm [i] + update * pm [/] 其中, m: 帧索引 P [! ] = (1- pdate) * p m [i] + update * p m [/] where, m: frame index
n: 子带索引  n: subband index
i: 频谱分布参数向量的元素索引, i=l,2,3,4  i: element index of the spectral distribution parameter vector, i=l, 2, 3, 4
bckr— est: 背景噪声估计子带能量 Bckr- est: background noise estimation subband energy
: 背景噪声频谱分布参数向量估计  : Vector estimation of background noise spectral distribution parameters
: 当前信号频谱分布参数向量  : Current Signal Spectrum Distribution Parameter Vector
III、 在现有的 VAD中, 一般都通过拖尾来保护有用信号不被误 判为噪声,拖尾的长短应在保护信号和提高传输效率两方面取一个折 衷。 对于传统的语音编码器, 拖尾的长短可以经学习取一个常量。 而 对于多速率编码器, 面向的是包括音乐的音频信号, 这类信号经常出 现较长的低能量的拖尾, 常规 VAD较难将这部分拖尾检测出来, 因 此需要较长的拖尾对其进行保护。在实施例中, 将托尾保护有用信号 子模块中的拖尾长短设计为根据 SAD信号判决结果自适应, 如果判 决出是音乐信号 ( SAD—flag=MUSIC ) 则设置较长的拖尾参数 ( hang— len=HANG_LONG ) , 如 果 判 决 出 是语 音 信 号 ( SAD_flag=SPEECH ) , 则 设 置 较 短 的 拖 尾 参 数 ( hang_len=HANG— SHORT ), 具体设置方式如下:  III. In the existing VAD, the useful signal is generally protected from noise by smearing, and the length of the smear should be compromised between protecting the signal and improving the transmission efficiency. For traditional speech coder, the length of the smear can be learned to take a constant. For multi-rate encoders, it is oriented to audio signals including music. Such signals often have long low-energy tails. It is difficult for conventional VAD to detect this part of the tail, so it requires a long tailing pair. It is protected. In an embodiment, the trailing length in the trailer protection useful signal sub-module is designed to be adaptive according to the SAD signal decision result, and if the music signal is determined (SAD_flag=MUSIC), a longer smearing parameter is set ( Hang_ len=HANG_LONG ) , if the decision is a speech signal ( SAD_flag=SPEECH ), set a shorter trailing parameter ( hang_len=HANG_ SHORT ), the specific setting is as follows:
If ( SAD_flag=MUSIC )  If ( SAD_flag=MUSIC )
hang— len=HANG— LONG  Hang- len=HANG— LONG
else if ( SAD_flag=SPEECH )  Else if ( SAD_flag=SPEECH )
hang— len=HANG— SHORT  Hang- len=HANG— SHORT
else  Else
hang— len=0  Hang- len=0
其中:  among them:
SAD— flag SAD判决标志  SAD-flag SAD judgment flag
hangjen 拖尾保护长度 本 实 施 例 的 一 个 示 例 中 , HANG— LONG=100 , HANG_SHORT=20, 单位可以是帧数。 分类参数提取模块用于根据信号初始分类模块确定的 Vad— flag 参数和编码器参数提取模块提供的子带能量参数、 ISF参数、 开环基 音参数计算信号分类判决模块和背景噪声参数更新模块需要的参数, 以及将子带能量 ^:、 ISF参数、 开环基音参数和计算出的参数对应 提供给信号分类判决模块和背景噪声参数。分类参数提取模块计算出 的参数包括: Hangjen trailing protection length In an example of this embodiment, HANG_LONG=100, HANG_SHORT=20, and the unit may be the number of frames. The classification parameter extraction module is configured to calculate the signal classification decision module and the background noise parameter update module according to the Vad_flag parameter determined by the signal initial classification module and the subband energy parameter, the ISF parameter, and the open loop pitch parameter provided by the encoder parameter extraction module. The parameters, and the subband energy ^:, the ISF parameter, the open loop pitch parameter, and the calculated parameter are provided to the signal classification decision module and the background noise parameter. The parameters calculated by the classification parameter extraction module include:
1、 基音参数 ( pitch )  1, pitch parameters ( pitch )
比较连续的开环基音延迟的差值,如果开环基音延迟的增量小于 设定的阈值,则延迟计数累加;如果连续两帧的延迟计数之和足够大, 则设置 pitch=l,否则 pitch=0。 开环基音延迟的计算公式可参见 AMR-WB+/AMR-WB标准文档。  Comparing the difference of consecutive open-loop pitch delays, if the increment of the open-loop pitch delay is less than the set threshold, the delay count is accumulated; if the sum of the delay counts of two consecutive frames is sufficiently large, set pitch=l, otherwise pitch =0. The calculation formula for the open loop pitch delay can be found in the AMR-WB+/AMR-WB standard document.
2、 长时信号相关值参数(meangain )  2, long-term signal correlation value parameter (meangain)
meangain 是相邻三帧音调 tone 的滑动平均,其 中 tone=1000*tone_flg; tone flg定义与 AMR-WB+中的相同。  Meangain is the moving average of adjacent three-frame tonal tone, where tone=1000*tone_flg; tone flg is defined the same as in AMR-WB+.
3、 过零率 ( zcr )
Figure imgf000015_0001
3. Zero crossing rate ( zcr )
Figure imgf000015_0001
HW在当 A是 truth是 1 , 当是 false时为 0。 H W is 1 when A is truth and 0 when it is false.
4、 子带能量时域波动 ( t_flux )
Figure imgf000015_0002
4, subband energy time domain fluctuations (t_flux)
Figure imgf000015_0002
― short _ mean _ eve _ energy  ― short _ mean _ eve _ energy
其中 short— mean— level— energy表示短时平均能量  Where short-mean-level-energy represents short-term average energy
5、 高低子带能量比(ra ) sublevel high energy 5, high and low sub-band energy ratio (ra) Sublevel high energy
ra - = ~ ^ ―  Ra - = ~ ^ ―
sublevel _ low― energy  Sublevel _ low- energy
其中, 本专利发明的一个实例:  Among them, an example of the patented invention:
sublevel— high— energy = level [10]+ level[l l];  Sublevel— high— energy = level [10]+ level[l l];
sublevel_low_energy =level[0]+ level [1]+ level[2]+ level [3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9];  Sublevel_low_energy =level[0]+ level [1]+ level[2]+ level [3]+ level[4]+ level[5]+ level[6]+ level[7] + level[8]+ level[9 ];
6、 子带能量频域波动 (f_flux )  6, subband energy frequency domain fluctuations (f_flux)
∑|/eve/m( - eve/m ( - l)| ∑|/eve/ m ( - eve/ m ( - l)|
f _ flux =—  f _ flux =—
short _ mean _ level _ energy  Short _ mean _ level _ energy
7、 导 i普距离短时平均 ( isf— meanSD ): 为五个相邻帧导 i瞽距离 Isf— SD的平均值, 其中
Figure imgf000016_0001
7. The short-term average (isf-meanSD) of the distance is: the average value of the distance Isf_SD for five adjacent frames, where
Figure imgf000016_0001
8、 子带能量标准差平均参数(level_meanSD ), 表示两个相邻帧 子带能量标准差(level— SD ) 的平均值, level— SD参数的计算方法参 考上述 Isf_SD的计算方法。  8. The sub-band energy standard deviation average parameter (level_meanSD), which represents the average value of the energy standard deviation (level_SD) of two adjacent frames, and the calculation method of the level-SD parameter refers to the above calculation method of Isf_SD.
上述 8个参数中,提供给背景噪声 更新模块的参数包括: zcr、 ra、 f— flux和1_3 «。 提供给信号分类判决模块的参数包括: pitch 、 meangain 、 isf— meanSD和 level— meanSD„ 信号分类判决模块用于根据来自信号初始分类模块 PSC的 snr、 Vad_flag , 以及来自分类参数提取模块的子带能量参数、 pitch、 meangain, Isf— meanSD、 level— meanSD将信号最终区分为: 非有用信 号( NOISE )、 语音信号(SPEECH )和音乐信号(MUSIC;)。 信号分 类判决模块中可以包括: 参数更新子模块和判决子模块; 所述参数更 新子模块用于根据所述信噪比更新信号分类判决过程中的门限,并将 更新后的门限提供给所述判决子模块;所述判决子模块用于接收来自 PSC模块的声音信号类型, 并对其中的有用信号基于开环基音参数、 导谱频率参数、子带能量参数和所述更新后的门限, 或者基于导谱频 率参数和子带能量参数和所述更新后的门限 ,确定所述有用信号的类 型, 并发送所确定的有用信号的类型到编码器模式及速率选择模块。 Among the above eight parameters, the parameters provided to the background noise update module include: zcr, ra, f-flux, and 1_3 «. The parameters provided to the signal classification decision module include: pitch, meangain, isf-meanSD, and level-meanSD. The signal classification decision module is used to derive the sub-band energy from the sampling parameter extraction module based on the snr, Vad_flag from the signal initial classification module PSC. The parameters, pitch, meangain, Isf- meanSD, level-meanSD finally distinguish the signals into: non-useful signal (NOISE), speech signal (SPEECH) and music signal (MUSIC;). The signal classification decision module can include: parameter updater a module and a decision sub-module; the parameter update sub-module is configured to update a threshold in a signal classification decision process according to the signal-to-noise ratio, and provide an updated threshold to the decision sub-module; the Received from The sound signal type of the PSC module, and the useful signal therein is based on the open loop pitch parameter, the guide frequency parameter, the subband energy parameter and the updated threshold, or based on the spectral frequency parameter and the subband energy parameter and the update The latter threshold determines the type of the useful signal and transmits the determined type of useful signal to the encoder mode and rate selection module.
将有用信号确定为语音信号或音乐信号包括:首先设置语音标识 位的值和音乐标识位的值均为 0, 然后根据基音参数标识、 长时信号 相关值、导语距离短时平均参数和子带能量子标准差平均参数将信号 初步确定为语音类型、音乐类型或不确定类型, 并才艮据初步确定出的 语音类型或音乐类型对应修改语音标识位或音乐标识位的值;再才艮据 子带能量、 长时信号相关值、 子带能量子标准差平均参数、 speech_flag、 music— flag、 pitch值为 1的连续帧数是否超过预先设置 的拖尾帧数门限、 连续的音乐帧数、 连续的语音帧数, 以及上一帧的 类型, 对初步确定出的所述语音类型、音乐类型或不确定类型进行修 正, 确定有用信号的类型, 所述类型包括语音信号和音乐信号。  Determining the useful signal as a speech signal or a music signal comprises: first setting a value of the speech identifier bit and a value of the music identification bit to be 0, and then according to the pitch parameter identification, the long-term signal correlation value, the lead distance short-term average parameter, and the sub-band energy The sub-standard deviation average parameter preliminarily determines the signal as a voice type, a music type, or an indeterminate type, and then modifies the value of the voice flag or the music flag according to the initially determined voice type or music type; Whether the number of consecutive frames with energy, long-term signal correlation value, sub-band energy sub-standard deviation average parameter, speech_flag, music_flag, and pitch value of 1 exceeds the preset threshold of the number of trailing frames, the number of consecutive music frames, and continuous The number of speech frames, and the type of the previous frame, are corrected for the initially determined speech type, music type, or uncertainty type to determine the type of useful signal, including speech signals and music signals.
以下再对将有用信号确定为语音信号或音乐信号的具体流程进 行说明:  The following describes the specific process of determining a useful signal as a speech signal or a music signal:
为保证信号判决的稳定及避免频繁的判决结果的转换,本实施例 提供 了 参数的 标志 拖尾机制 , 包括对 pitch_flag 、 level_meanSD一 high— flag 、 ISF—meanSD— high—flag 、 ISF—meanSD— low— flag、 level— meanSD— low— flag、 meangain— flag这些 特征参数值的确定根据拖尾机制进行,这些特征参数值的具体确定如 图 8所示。  In order to ensure the stability of the signal decision and avoid frequent conversion of the decision result, the embodiment provides a flag tailing mechanism for the parameter, including pitch_flag, level_meanSD_high_flag, ISF-meanSD_high_flag, ISF-meanSD-low — flag, level—meanSD—low—flag, meangain—flag These feature parameter values are determined according to the trailing mechanism. The specific determination of these feature parameter values is shown in Figure 8.
图 8中的拖尾期间的长度根据拖尾参数标识值确定,本实施例中 提供了两种拖尾设置, 即确定拖尾参数标识值的方案:  The length of the trailing period in Fig. 8 is determined according to the trailing parameter identification value. In this embodiment, two kinds of trailing settings are provided, that is, a scheme for determining the trailing parameter identification value:
第一种拖尾设置方案中, 当参数值高于或低于一定门限时,对应 的参数拖尾计数器值加一; 否则对应的参数拖尾计数器值设置为 0, 并根据参数拖尾计数器的值设定不同的参数拖尾标识。 其中, 参数拖 尾计数器的值越大, 参数拖尾标识值的长度越长, 具体在根据参数计 数器设置参数拖尾标识值时根据实际情况确定, 这里不再赘述。 In the first type of tailing setting, when the parameter value is higher or lower than a certain threshold, the corresponding The parameter tailing counter value is incremented by one; otherwise, the corresponding parameter trailing counter value is set to 0, and different parameter trailing identifiers are set according to the value of the parameter trailing counter. The larger the value of the parameter smear counter is, the longer the length of the parameter smear identification value is. The specific value is determined according to the actual situation when setting the parameter smear identification value according to the parameter counter, and details are not described herein again.
第二种拖尾设置方案中,根据训练参数对应的决策树的各内部节 点的错误率 ER来控制拖尾长短, 错误率小的参数, 拖尾短; 错误率 大的参数, 拖尾长。  In the second tailing setting scheme, the length of the trailing length is controlled according to the error rate ER of each internal node of the decision tree corresponding to the training parameter, and the parameter with a small error rate is short; the parameter with a large error rate is long.
此后, 如果当前的信号分类为有用信号, 进行语音和音乐的初始 分类:  Thereafter, if the current signal is classified as a useful signal, an initial classification of speech and music is performed:
首先进行语音初始判决, 如图 9所示,在步骤 901设置语音标识 位 = 0, 然后在步骤 902, 判断 Isf_meanSD是否大于预先设定的第一 导语语音门限(例如为 1500 ), 如果是则设置语音标识位的值为 1; 否则,  First, the initial voice is determined. As shown in FIG. 9, the voice flag is set to 0 in step 901. Then, in step 902, it is determined whether Isf_meanSD is greater than a preset first voice voice threshold (for example, 1500). If yes, the setting is set. The value of the voice flag is 1; otherwise,
在步骤 903, 判断是否 pitch值为 1, 并且开关基音搜索获得的基 音延迟值 t— top— mean小于基音语音门限(例如为 40 ), 如果是, 则设 置语音标识位的值为 1; 否则,  In step 903, it is determined whether the pitch value is 1, and the pitch delay value t_top-mean obtained by the switch pitch search is smaller than the pitch voice threshold (for example, 40), and if so, the value of the voice flag is set to 1; otherwise,
在步骤 904, 判断 pitch值为 1的连续帧数是否超过预先设置的 拖尾帧数门限(例如为 2帧), 如果是, 则设置语音标识位的值为 1; 否则,  In step 904, it is determined whether the number of consecutive frames whose pitch value is 1 exceeds a preset threshold of the number of trailing frames (for example, 2 frames), and if so, the value of the voice flag is set to 1; otherwise,
在步骤 905,判断 meangain是否大于预先设定的长时相关语音门 限(例如为 8000 ), 如果是, 则设置语音标识位的值为 1; 否则, 在 步 骤 906 , 判 断 level_meanSD_high— flag 和 ISF_meanSD_high— flag中是否有一个或两个的值为 1, 如果是, 则设 置语音标识位的值为 1; 否则不更改语音标识位的值。  In step 905, it is determined whether the meangain is greater than a preset long-term related speech threshold (for example, 8000), and if so, the value of the voice flag is set to 1; otherwise, in step 906, the level_meanSD_high_flag and the ISF_meanSD_high_flag are determined. Whether one or both of them have a value of 1, and if so, the value of the voice flag is set to 1; otherwise, the value of the voice flag is not changed.
然后, 进行音乐初始判决, 具体如图 10所示: 在步骤 1001 , 首先将音乐标识位设置为 0, 然后在步骤 1002, 判 断 信 号 同 时 满 足标 志 ISF—meanSD— low— flag = 1 和 level— meanSD_low— flag = 1 , 如果是则设置音乐信号标志 music— flag; 否则, 不更改音乐标识位的值。 Then, the initial decision of the music is performed, as shown in Figure 10: In step 1001, the music flag is first set to 0, and then in step 1002, the decision signal satisfies the flags ISF_meanSD_low_flag = 1 and level- meanSD_low_flag = 1 at the same time, and if so, the music signal flag music_ Flag; Otherwise, the value of the music flag is not changed.
此后, 如图 11所示, 对初始判决结果进行修正:  Thereafter, as shown in Figure 11, the initial decision result is corrected:
首先在步骤 1101、 判断是否子带的即时能量小于子带能量门限 (例如为 5000 ), 如果是则执行步骤 1102; 否则将信号确定为不确定 类 ( UNCERTAIN );  First, in step 1101, it is determined whether the instantaneous energy of the subband is less than the subband energy threshold (for example, 5000), and if yes, step 1102 is performed; otherwise, the signal is determined to be an indeterminate class (UNCERTAIN);
在步骤 1102, 判断是否 meangain_flag = l, 并且音乐持续计数器 小于音乐持续计数语音判断门限(例如为 3 ), 如果是则将信号确定 为语音信号; 否则,  In step 1102, it is judged whether meangain_flag = l, and the music duration counter is smaller than the music continuous counting voice judgment threshold (for example, 3), and if yes, the signal is determined as a voice signal; otherwise,
在步骤 1103,判断 ISF_meanSD的值大于预先设定的第二导傳语 音门限(例如为 2000 ), 如果是则将信号确定为语音信号; 否则, 在步骤 1104, 判断是否 level_energy小于 10000, 并且之前判决 为噪声的帧数超过了五帧, 如果是, 则将当前的信号类别置为不确定 类, 这是为了降低将噪声归为音乐类的误判; 否则,  In step 1103, it is determined that the value of ISF_meanSD is greater than a preset second pilot voice threshold (for example, 2000). If yes, the signal is determined to be a voice signal; otherwise, in step 1104, it is determined whether level_energy is less than 10000, and the previous decision is made. The number of frames that are noisy exceeds five frames. If so, the current signal class is set to an indeterminate class. This is to reduce the misjudgment of classifying noise into music; otherwise,
在步骤 1105, 判断是否音乐标识位和语音标识位的值均为 1 , 如 果是, 则将当前信号类別确定位不确定类; 否则,  In step 1105, it is determined whether the value of the music flag and the voice flag are both 1, and if so, the current signal class is determined to be an indeterminate class; otherwise,
在步骤 1106, 判断是否音乐标识位和语音标识位的值均为 0, 如 果是, 则将当前信号类别确定位不确定类; 否则,  In step 1106, it is determined whether the values of the music flag and the voice flag are both 0. If yes, the current signal class is determined to be a bit uncertainty class; otherwise,
在步骤 1107, 判断是否音乐标识位为 0, 语音标识位为 1, 如果 是, 则将当前信号类型确定为语音类; 否则,  In step 1107, it is determined whether the music flag is 0, the voice flag is 1, and if so, the current signal type is determined to be a voice class; otherwise,
在步骤 1108, 由于音乐标识位为 1 , 语音标识位为 0, 将当前信 号类型确定为音乐类。  In step 1108, since the music flag is 1 and the voice flag is 0, the current signal type is determined to be a music class.
在上述步骤 1104、 1105即步骤 1106中确定出信号为不确定类后, 执行步骤 1109: 判断是否 pitch一 flag=l, 并且 ISF_meanSD小于导谱 音乐门限(例如为 900 ), 并且连续的语音帧数小于 3, 如果是, 则将 信号确定为音乐类; 否则, 将信号仍确定为不确定类; After the above steps 1104, 1105, step 1106, determine that the signal is an indeterminate class, Step 1109: Determine whether pitch-flag=l, and ISF_meanSD is smaller than the music threshold of the spectrum (for example, 900), and the number of consecutive speech frames is less than 3. If yes, the signal is determined to be a music class; otherwise, the signal is still Determined to be an indeterminate class;
而在上述步骤 1103和步骤 1107将信号确定为语音类后,执行步 骤 1110: 是否连续的音乐帧数大于 3, 并且 ISF— meanSD小于导谱音 乐门限, 如果是, 则将信号确定为音乐信号; 否则, 将信号确定为语 音信号。  After the signal is determined to be a voice class in the above steps 1103 and 1107, step 1110 is performed: whether the number of consecutive music frames is greater than 3, and ISF_meanSD is smaller than the music threshold of the guided spectrum, and if yes, the signal is determined as a music signal; Otherwise, the signal is determined to be a speech signal.
在通过上述流程确定出语音信号和音乐信号后,对于仍然处于不 确定类的信号, 执行图 12所示的流程, 进行初步修正分类, 包括: 首先在步骤 1201判断 levd_energy是否小于子带能量不确定类门限 (例如为 5000 ), 如果是, 仍将信号类型确定为不确定类; 否则, 在 步骤 1202,判断是否音乐的持续帧数大于 1并且 ISF_meanSD小于导 谱音乐门限, 如果是, 将信号确定为音乐类; 否则:  After the voice signal and the music signal are determined through the above process, for the signal still in the uncertain class, the flow shown in FIG. 12 is executed, and the preliminary correction classification is performed, including: First, in step 1201, it is determined whether the levd_energy is smaller than the sub-band energy uncertainty. The class threshold (for example, 5000), if yes, still determines the signal type as an indeterminate class; otherwise, in step 1202, it is determined whether the continuous frame number of the music is greater than 1 and the ISF_meanSD is smaller than the guided music threshold, and if so, the signal is determined For music; otherwise:
对语音和音乐拖尾标志清零, 如果本帧之前为连续的语音类, 且 连续性较强, 那么根据语音的特征参数对语音进行判决, 若满足语音 条件, 那么设置语音拖尾标志 speech— hangover— flag = 1, 具体包括图 12中的步骤 1203至步骤 1206; 如果本帧之前为连续的音乐类, 且连 续性较强, 那么根据音乐的特征参数对音乐进行判决, 若满足音乐条 件, 那么设置音乐拖尾的标志 music— hangover— flag = 1 , 具体包括图 12中的步骤 1207至步骤 1210。  Clear the voice and music trailing flags. If the frame is a continuous voice class and has strong continuity, then the voice is judged according to the characteristic parameters of the voice. If the voice condition is met, then the voice trailing flag speech is set— Hangover_flag = 1, specifically including step 1203 to step 1206 in FIG. 12; if the frame is a continuous music class before, and the continuity is strong, the music is judged according to the characteristic parameters of the music, and if the music condition is satisfied, Then set the music trailing flag music_hangover_flag=1, which specifically includes steps 1207 to 1210 in FIG.
此后, 如图 12中的步骤 1211至步骤 1216所示, 如果语音拖尾 标志为 1, 音乐拖尾标志为 0, 将当前的信号类别置为语音类; 如杲 音乐拖尾标志为 1, 语音拖尾标志为 0, 则将当前的信号类别置为音 乐类; 如果音乐拖尾标志和音乐拖尾标志同时为 1或同时为 0, 将信 号类别设为不确定类, 这时如果之前音乐的连续性超过了 20帧, 将 信号确定为音乐类, 如果之前语音的连续性超过了 20帧, 将信号确 定为语音类。 Thereafter, as shown in step 1211 to step 1216 in FIG. 12, if the voice trailing flag is 1, the music trailing flag is 0, and the current signal category is set to the voice class; for example, the music trailing flag is 1, the voice If the trailing flag is 0, the current signal category is set to music class; if the music trailing flag and the music trailing flag are both 1 or 0 at the same time, the signal class is set to the uncertainty class, then if the music is before Continuity exceeds 20 frames, will The signal is determined to be a music class, and if the continuity of the previous speech exceeds 20 frames, the signal is determined to be a speech class.
在经过上述初步修正后, 在图 13中对有用信号类型进行最终修 正, 继续才艮据当前的语境进行类别的修正, 在步骤 1301, 如果当前 的语境为音乐, 且持续性很强, 超过了 3秒, 即当前连续的音乐帧数 超过了 150帧, 那么可根据 ISF— meanSD的值进行强制修正, 确定音 乐信号。 在步骤 1302, 如果当前的语境为语音, 并且持续性很强, 超过了 3 秒, 即当前连续的语音帧数超过了 150 帧, 那么可根据 ISF_meanSD的值进行强制修正, 确定语音信号类型; 此后如果信号 类别还为不确定类, 那么在步骤 1303根据之前的语境对信号类别进 行修正, 即将当前不确定的信号类别归纳为之前的信号类别。  After the above preliminary correction, the final correction of the useful signal type is performed in FIG. 13, and the category modification is continued according to the current context. In step 1301, if the current context is music, and the persistence is strong, After more than 3 seconds, that is, the current continuous number of music frames exceeds 150 frames, the music signal can be determined by forcibly correcting according to the value of ISF-meanSD. In step 1302, if the current context is speech and the persistence is strong, more than 3 seconds, that is, the current continuous number of speech frames exceeds 150 frames, then the forced correction may be performed according to the value of ISF_meanSD to determine the type of the speech signal; Thereafter, if the signal class is also an indeterminate class, then at step 1303 the signal class is modified according to the previous context, ie, the currently undefined signal class is summarized into the previous signal class.
在通过上述流程确定了有用信号的类别后,需要更新三个类别计 数器和更新信号类别判决模块中的各门限值。 对于三个类别计数器, 如果当前分类为音乐 signal— sort = music, 则音乐计数器 music_countinue_counter增力 p 1, 否则清零; 其它类别计数器的处理 类似, 如图 14所示, 这里不再详述。 而门限值根据信号初始分类模 块输出的信噪比大小来更新,在实施例中列举的各门限示例是在 20db 信噪比条件下学习得到的值。 背景噪声参数更新模块利用 SAD中分类参数提 莫块中计算出 的一些频谱分布参数, 来控制背景噪声的更新速率。 由于在实际应用 环境可能出现背景噪声的能量水平突然提高的情况,这时易出现背景 噪声估计因信号持续被判为有用信号而一直不能更新的状态,背景噪 声参数更新模块的设置即解决了该问题。  After determining the class of the useful signal through the above process, it is necessary to update the threshold values in the three class counters and the update signal class decision module. For the three category counters, if the current classification is music signal_sort=music, the music counter music_countinue_counter is boosted by p1, otherwise cleared; the processing of other category counters is similar, as shown in Fig. 14, and will not be described in detail here. The threshold value is updated according to the signal-to-noise ratio of the signal output initial classification module. The threshold examples listed in the embodiment are the values learned under the 20db signal-to-noise ratio condition. The background noise parameter update module uses some of the spectrum distribution parameters calculated in the classification parameters in the SAD to control the update rate of the background noise. Due to the sudden increase of the energy level of the background noise in the actual application environment, the background noise estimation is likely to be unable to be updated due to the signal being continuously judged as a useful signal, and the setting of the background noise parameter update module solves the problem. problem.
该背景噪声参数更新模块根据来自分类参数提 块中的参数, 计算的有关频谱分布参数向量包含以下元素: 过零率 zcr的短时平均 The background noise parameter update module calculates the relevant spectral distribution parameter vector according to the parameters from the classification parameter extraction block, and includes the following elements: Short-term average of zero-crossing rate zcr
高低子带能量比 ra的短时平均  Short-term average of high and low sub-band energy ratio ra
子带能量频域波动 f— flux的短时平均  Subband energy frequency domain fluctuation f-flux short-term average
子带能量时域波动 t— flux的短时平均  Subband energy time domain fluctuation t-flux short-term average
其中, zcr— mean短时平均的计算方法如下, 其它类似:  Among them, the short-time average calculation method of zcr-mean is as follows, other similar:
zcr _ meanm = ALPHA'zcr _ meanm_ + (1— ALPHA)»zcrm Zcr _ mean m = ALPHA'zcr _ mean m _ + (1— ALPHA)»zcr m
其中 ALPHA=0.96, m表示帧索引。  Where ALPHA=0.96, m represents the frame index.
本实施例利用了背景噪声的频谱特性较为稳定的特点,其中频谱 分布参数向量的成员可不限于以上列出的 4个。当前背景噪声的更新 速率由当前频镨分布参数与背景噪声频谱分布参数估计之间的差异 来控制。该差异可以通过欧式距离、 Manhattan距离等算法来实现。 本专利的一个发明实例采用 Manhattan距离(一种距离计算方式的命 名, 类似于欧式距离), 即:
Figure imgf000022_0001
This embodiment utilizes the characteristics that the spectral characteristics of the background noise are relatively stable, and the members of the spectrum distribution parameter vector may not be limited to the four listed above. The update rate of the current background noise is controlled by the difference between the current frequency distribution parameter and the background noise spectral distribution parameter estimate. This difference can be achieved by algorithms such as Euclidean distance and Manhattan distance. An inventive example of this patent uses Manhattan distance (a name for distance calculation, similar to Euclidean distance), namely:
Figure imgf000022_0001
其中, ^是当前信号的频谱分布参数向量, 是背景噪声频谱分 布参数向量估计。  Where ^ is the spectrum distribution parameter vector of the current signal and is the background noise spectrum distribution parameter vector estimate.
在本实施例的一个示例中, 当 <ΤΗ1 时, 模块输出更新速率 accl , 代表最快更新速率; 否则, 当 * 1112时, 输出更新速率 acc2; 否则, 当 <ΤΗ3时,输出更新速率 acc3; 否则,输出更新速率 acc4。 这里的 TH1、 TH2、 TH3和 TH4为更新门限, 具体根据实际环境情 况确定。  In an example of this embodiment, when <1, the module outputs an update rate accl, which represents the fastest update rate; otherwise, when *1112, the update rate acc2 is output; otherwise, when <3, the update rate acc3 is output; Otherwise, the update rate acc4 is output. Here, TH1, TH2, TH3 and TH4 are update thresholds, which are determined according to the actual environmental conditions.
本发明实施例中, 通过确定背景噪声的更新速率, 并根据该更新 速率对噪声参数进行更新,再根据子带能量参数和更新后的噪声参数 对信号进行初始分类,确定接收的语音信号中的非有用信号和有用信 号, 降低了将有用信号判决为噪音信号的误判,提高了声音信号分类 的准确性。 In the embodiment of the present invention, the update rate of the background noise is determined, and the noise parameter is updated according to the update rate, and then the signal is initially classified according to the sub-band energy parameter and the updated noise parameter, and the received voice signal is determined. Non-useful signals and useful letters No., which reduces the misjudgment of determining the useful signal as a noise signal, and improves the accuracy of the classification of the sound signal.
通过以上的实施方式的描述, 本领域的技术人员可以清楚地 了解到本发明可借助软件加必需的通用硬件平台的方式来实现, 当然也可以通过硬件, 但很多情况下前者是更佳的实施方式。 基 于这样的理解, 本发明的技术方案本质上或者说对现有技术做出 贡献的部分可以以软件产品的形式体现出来, 该计算机软件产品 存储在一个存储介质中, 包括若干指令用以使得一台计算机设备 (可以是个人计算机, 服务器, 或者网络设备等)执行本发明各 个实施例所述的方法。  Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is a better implementation. the way. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium, including a plurality of instructions for making a A computer device (which may be a personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.
以上是对本发明具体实施例的说明,在具体的实施过程中可对本 发明的方法进行适当的改进, 以适应具体情况的具体需要。 因此可以 理解, 根据本发明的具体实施方式只是起示范作用, 并不用以限制本 发明的保护范围。  The above is a description of specific embodiments of the present invention, and the method of the present invention may be appropriately modified in a specific implementation process to suit the specific needs of a specific situation. Therefore, it is to be understood that the specific embodiments of the present invention are not intended to limit the scope of the invention.

Claims

权利要求 Rights request
1、 一种声音信号分类方法, 其特征在于, 该方法包括:  A method for classifying a sound signal, the method comprising:
A、 接收声音信号, 根据背景噪声频谱分布参数和所述声音信号 的频谱分布参数确定背景噪声的更新速率;  A. receiving a sound signal, determining an update rate of the background noise according to the background noise spectral distribution parameter and the spectral distribution parameter of the sound signal;
B、 所述更新速率对噪声参数进行更新, 并根据子带能量参 数和更新后的噪声参数对所述声音信号进行分类。  B. The update rate updates the noise parameter, and classifies the sound signal according to the subband energy parameter and the updated noise parameter.
2、 根据权利要求 1所述的方法, 其特征在于, 所述步骤 B后进 一步包括:  2. The method according to claim 1, wherein the step B further comprises:
C、 对所述分类得到的有用信号, 基于开环基音参数、 导谱频率 参数和子带能量参数确定有用信号的类型,所述类型包括语音信号和 音乐信号。  C. For the useful signal obtained by the classification, the type of the useful signal is determined based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, the type including the voice signal and the music signal.
3、 根据权利要求 2所述的方法, 其特征在于, 所述步骤 C之前 进一步包括:  The method according to claim 2, wherein before the step C, the method further comprises:
C0、 检测噪声估计是否收敛, 如果是, 则执行步骤 C1; 否则, 执行所述步骤 C;  C0, detecting whether the noise estimate converges, if yes, performing step C1; otherwise, performing the step C;
Cl、对所述分类得到的有用信号,基于导谱频率参数和子带能量 参数确定有用信号的类型, 所述类型包括语音信号和音乐信号。  Cl, a useful signal for the classification, determining a type of useful signal based on a spectral frequency parameter and a sub-band energy parameter, the type comprising a speech signal and a music signal.
4、 根据权利要求 3所述的方法, 其特征在于, 所述步骤 CO中, 检测初始噪声是否收敛为:判断所述接收的声音信号前连续噪声帧数 是否超过预先设定的噪声收敛门限, 如果是, 则确定噪声估计收敛; 否则, 确定噪声估计不收敛。  The method according to claim 3, wherein in the step CO, detecting whether the initial noise converges to: determining whether the number of consecutive consecutive noise frames before the received sound signal exceeds a preset noise convergence threshold, If so, it is determined that the noise estimate converges; otherwise, it is determined that the noise estimate does not converge.
5、 根据权利要求 2所述的方法, 其特征在于, 所述步骤 B中还 获取所述确定的有用信号类型,根据该有用信号类型确定信号拖尾长 度, 并进一步根据该信号拖尾长度对所述声音信号进行分类。 5. The method according to claim 2, wherein the step B is further Obtaining the determined useful signal type, determining a signal smear length based on the useful signal type, and further classifying the sound signal based on the signal smear length.
6、根据权利要求 2所述的方法, 其特征在于, 所述步骤 C包括: 初始化语音标识位和音乐标识位, 然后根据基音参数标识、 长时 信号相关参数、 导语距离短时平均参数和子带能量子标准差平均参 数, 以及对应的门限, 初步确定有用信号的类型, 包括语音类型、 音 乐类型或不确定类型,并根据初步确定出的语音类型和音乐类型对应 修改语音标识位和音乐标识位;  The method according to claim 2, wherein the step C comprises: initializing a voice identification bit and a music identification bit, and then according to a pitch parameter identification, a long-term signal related parameter, a lead distance short-term average parameter, and a sub-band The energy sub-standard deviation average parameter, and the corresponding threshold, initially determine the type of the useful signal, including the voice type, the music type or the uncertainty type, and modify the voice identifier and the music identifier according to the initially determined voice type and music type. ;
根据子带能量、 长时信号相关参数、子带能量子标准差平均参数 子带能量子标准差平均参数、语音标识位、 音乐标识位、 基音参数标 识值为 1的连续帧数是否超过预先设置的拖尾帧数门限、连续的音乐 帧数、 连续的语音帧数、 上一帧的类型及对应的门限, 对初步确定出 的所述语音类型、音乐类型或不确定类型进行修正, 最终确定所述有 用信号的类型, 包括语音信号和音乐信号。  According to the sub-band energy, long-term signal related parameters, sub-band energy sub-standard deviation average parameter sub-band energy sub-standard deviation average parameter, voice identification bit, music identification bit, pitch parameter identification value of 1 consecutive frames exceeds the preset The number of trailing frame numbers, the number of consecutive music frames, the number of consecutive speech frames, the type of the previous frame, and the corresponding threshold, correct the initially determined speech type, music type or uncertainty type, and finally determine The types of useful signals include voice signals and music signals.
7、 根据权利要求 6所述的方法, 其特征在于, 根据所述声音信 号的信噪比对所述门限进行调整。  7. The method according to claim 6, wherein the threshold is adjusted according to a signal to noise ratio of the sound signal.
8、 根据权利要求 1所述的方法, 其特征在于, 所述步驟 B后, 进一步包括:  The method according to claim 1, wherein after the step B, the method further includes:
D、 对所述分类得到的非有用信号, 确定其对应的编码方式, 并 根据确定的编码方式确定是否需要计算导谱频率参数。  D. Determine the coding mode of the non-useful signal obtained by the classification, and determine whether the reference frequency parameter needs to be calculated according to the determined coding mode.
9、 根据权利要求 1所述的方法, 其特征在于, 步骤 B中所述的 噪声参数包括: 噪声估计参数和噪声频谱分布参数。 9. The method according to claim 1, wherein the noise parameter in step B comprises: a noise estimation parameter and a noise spectrum distribution parameter.
10、 根据权利要求 1或 9所述的方法, 其特征在于, 所述步骤 A 包括:计算所述声音信号频谱分布参数与背景噪音频谱分布参数之间 的差异参数, 然后根据该差异参数确定更新速率。 The method according to claim 1 or 9, wherein the step A comprises: calculating a difference parameter between the sound signal spectral distribution parameter and the background noise spectral distribution parameter, and then determining the update according to the difference parameter. rate.
11、 根据权利要求 10所述的方法, 其特征在于, 计算所述差异 参数涉及的频谱分布参数包括: 过零率短时平均参数、 高低子带能量 比短时平均参数、子带能量频域波动短时平均参数和子带能量时域波 动短时平均参数。  The method according to claim 10, wherein the calculating the spectral distribution parameters involved in the difference parameter comprises: a zero-crossing rate short-time average parameter, a high-low sub-band energy ratio short-time average parameter, and a sub-band energy frequency domain The fluctuation short-term average parameter and the sub-band energy time-domain fluctuation short-term average parameter.
12、 一种声音信号分类装置, 其特征在于, 该装置包括: 背景噪 声参数更新模块和信号初始分类 PSC模块;  12. A sound signal classification device, the device comprising: a background noise parameter update module and a signal initial classification PSC module;
所述背景噪声参数更新模块,用于根据背景噪声频谱分布参数和 当前声音信号的频语分布参数确定背景噪声的更新速率,并发送所述 确定的更新速率;  The background noise parameter updating module is configured to determine an update rate of the background noise according to the background noise spectrum distribution parameter and the frequency distribution parameter of the current sound signal, and send the determined update rate;
所述 PSC模块, 用于接收来自所述背景噪声参数更新模块的更 新速率,对噪声参数进行更新, 并根据子带能量参数和更新后的噪声 参数对当前声音信号进行分类, 并发送分类确定的声音信号类型。  The PSC module is configured to receive an update rate from the background noise parameter update module, update a noise parameter, and classify the current sound signal according to the subband energy parameter and the updated noise parameter, and send the classification determined Sound signal type.
13、 根据权利要求 12所述的装置, 其特征在于, 该装置进一步 包括:  13. The apparatus according to claim 12, wherein the apparatus further comprises:
信号分类判决模块, 用于接收来自 PSC模块的声音信号类型, 并对其中的有用信号基于开环基音参数、导镨频率参数和子带能量参 数, 或者基于导谱频率参数和子带能量参数, 确定有用信号的类型, 所述类型包括语音信号和音乐信号, 并发送所确定的有用信号的类 型。 a signal classification decision module, configured to receive a sound signal type from the PSC module, and determine useful information based on the open loop pitch parameter, the pilot frequency parameter, and the subband energy parameter, or based on the pilot frequency parameter and the subband energy parameter The type of signal, the type including the voice signal and the music signal, and the type of the determined useful signal is transmitted.
14、 根据权利要求 13所述的装置, 其特征在于, 该装置进一步 包括: 14. The apparatus according to claim 13, wherein the apparatus further comprises:
分类参数提取模块, 用于接收来自 PSC模块的声音信号类型, 并将该声音信号类型传送给所述信号分类判决模块;和获取包括导谱 频率参数和子带能量参数, 或者进一步获取开环基音参数, 将获取的 参数处理为信号分类特征参数传送给所述信号分类判决模块;以及将 获取的参数处理为声音信号的频谱分布参数和背景噪声频谱分布参 数, 并将这些频谱分布参数传送给所述背景噪声参数更新模块;  a classification parameter extraction module, configured to receive a sound signal type from the PSC module, and transmit the sound signal type to the signal classification decision module; and obtain a reference spectrum frequency parameter and a subband energy parameter, or further obtain an open loop pitch parameter Processing the acquired parameters into signal classification feature parameters and transmitting the parameters to the signal classification decision module; and processing the acquired parameters into a spectrum distribution parameter and a background noise spectrum distribution parameter of the sound signal, and transmitting the spectrum distribution parameters to the Background noise parameter update module;
则所述分类判决模块根据所述信号分类特征参数和所述 PSC模 块确定的声音信号类型,确定有用信号的类型, 所述类型包括语音信 号和音乐信号。  Then, the classification decision module determines the type of the useful signal according to the signal classification feature parameter and the type of the sound signal determined by the PSC module, and the type includes a voice signal and a music signal.
15、 根据权利要求 13或 14所述的装置, 所述 PSC模块中包括: 背景噪声估计子模块、 计算信噪比子模块、 有用信号估计子模块、 判 决阔值调整字模块、 比较子模块、 拖尾保护有用信号子模块; 其特征 在于,  The apparatus according to claim 13 or 14, wherein the PSC module comprises: a background noise estimation submodule, a calculated signal to noise ratio submodule, a useful signal estimation submodule, a decision threshold adjustment word module, a comparison submodule, Trailing protection useful signal sub-module;
所述背景噪声估计子模块,接收来自所述背景噪声参数更新模块 的更新速率,对噪声参数进行更新, 并将根据更新后的噪声参数计算 的背景噪声子带能量估计参数传送给所述计算信噪比子模块; 所述计算信噪比子模块,用于接收所述背景噪声子带能量估计参 数, 并根据该参数和子带能量参数计算信噪比, 并将信噪比传送给所 述信号分类判决模块;  The background noise estimation sub-module receives an update rate from the background noise parameter update module, updates a noise parameter, and transmits a background noise sub-band energy estimation parameter calculated according to the updated noise parameter to the calculation signal a noise ratio submodule, configured to receive the background noise subband energy estimation parameter, calculate a signal to noise ratio according to the parameter and the subband energy parameter, and transmit a signal to noise ratio to the signal Classification decision module;
所述信号分类判决模块包括: 参数更新子模块和判决子模块; 所 述参数更新子模块,用于才艮据所述信噪比更新信号分类判决过程中的 门限, 并将更新后的门限提供给所述判决子模块; The signal classification decision module includes: a parameter update submodule and a decision submodule; a parameter update submodule, configured to update a threshold in a signal classification decision process according to the signal to noise ratio, and provide an updated threshold to the decision submodule;
所述判决子模块, 用于接收来自 PSC模块的声音信号类型, 并 对其中的有用信号基于开环基音参数、导谱频率参数、子带能量参数 和所述更新后的门限,或者基于导谱频率参数和子带能量参数和所述 更新后的门限, 确定所述有用信号的类型, 并发送所确定的有用信号 的类型。  The decision submodule is configured to receive a sound signal type from the PSC module, and the useful signal therein is based on an open loop pitch parameter, a guide frequency parameter, a subband energy parameter, and the updated threshold, or based on a guide spectrum The frequency parameter and the subband energy parameter and the updated threshold determine the type of the useful signal and transmit the determined type of useful signal.
16、 根据权利要求 13所述的装置, 其特征在于, 该装置进一步 包括:  16. The apparatus according to claim 13, wherein the apparatus further comprises:
编码器模式及速率选择模块,用于接收来自信号分类判决模块的 有用信号的类型,并 居接收的有用信号的类型确定声音信号的编码 模式及速率。  An encoder mode and rate selection module for receiving the type of the useful signal from the signal classification decision module, and determining the encoding mode and rate of the sound signal by the type of the received useful signal.
17、 根据权利要求 14所述的装置, 其特征在于, 该装置进一步 包括:  17. The apparatus according to claim 14, wherein the apparatus further comprises:
编码器参数提取模块, 用于提取导谱频率参数和子带能量参数, 或者进一步提取开环基音参数,并将提取的所述参数传送给所述分类 参数提取模块, 以及将提取的子带能量参数传送给所述 PSC模块。  An encoder parameter extraction module, configured to extract a guide frequency parameter and a sub-band energy parameter, or further extract an open-loop pitch parameter, and transmit the extracted parameter to the classification parameter extraction module, and extract the extracted sub-band energy parameter Transfer to the PSC module.
PCT/CN2007/003798 2006-12-05 2007-12-26 A classing method and device for sound signal WO2008067735A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07855800A EP2096629B1 (en) 2006-12-05 2007-12-26 Method and apparatus for classifying sound signals

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 200610164456 CN100483509C (en) 2006-12-05 2006-12-05 Aural signal classification method and device
CN200610164456.7 2006-12-05

Publications (1)

Publication Number Publication Date
WO2008067735A1 true WO2008067735A1 (en) 2008-06-12

Family

ID=39491665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/003798 WO2008067735A1 (en) 2006-12-05 2007-12-26 A classing method and device for sound signal

Country Status (3)

Country Link
EP (1) EP2096629B1 (en)
CN (1) CN100483509C (en)
WO (1) WO2008067735A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2684194C1 (en) * 2015-06-26 2019-04-04 ЗетТиИ Корпорейшн Method of producing speech activity modification frames, speed activity detection device and method

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5168162B2 (en) * 2009-01-16 2013-03-21 沖電気工業株式会社 SOUND SIGNAL ADJUSTMENT DEVICE, PROGRAM AND METHOD, AND TELEPHONE DEVICE
CN102714034B (en) * 2009-10-15 2014-06-04 华为技术有限公司 Signal processing method, device and system
CN102299693B (en) * 2010-06-28 2017-05-03 瀚宇彩晶股份有限公司 Message adjustment system and method
WO2012000882A1 (en) 2010-07-02 2012-01-05 Dolby International Ab Selective bass post filter
CN102446506B (en) * 2010-10-11 2013-06-05 华为技术有限公司 Classification identifying method and equipment of audio signals
BR112013026333B1 (en) 2011-04-28 2021-05-18 Telefonaktiebolaget L M Ericsson (Publ) frame-based audio signal classification method, audio classifier, audio communication device, and audio codec layout
US8990074B2 (en) * 2011-05-24 2015-03-24 Qualcomm Incorporated Noise-robust speech coding mode classification
US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise
US9173025B2 (en) 2012-02-08 2015-10-27 Dolby Laboratories Licensing Corporation Combined suppression of noise, echo, and out-of-location signals
US8712076B2 (en) 2012-02-08 2014-04-29 Dolby Laboratories Licensing Corporation Post-processing including median filtering of noise suppression gains
CN107195313B (en) * 2012-08-31 2021-02-09 瑞典爱立信有限公司 Method and apparatus for voice activity detection
CN102928713B (en) * 2012-11-02 2017-09-19 北京美尔斯通科技发展股份有限公司 A kind of background noise measuring method of magnetic field antenna
EP2922052B1 (en) * 2012-11-13 2021-10-13 Samsung Electronics Co., Ltd. Method for determining an encoding mode
CN104347067B (en) 2013-08-06 2017-04-12 华为技术有限公司 Audio signal classification method and device
CN106328152B (en) * 2015-06-30 2020-01-31 芋头科技(杭州)有限公司 automatic indoor noise pollution identification and monitoring system
CN105654944B (en) * 2015-12-30 2019-11-01 中国科学院自动化研究所 It is a kind of merged in short-term with it is long when feature modeling ambient sound recognition methods and device
CN107123419A (en) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 The optimization method of background noise reduction in the identification of Sphinx word speeds
CN108257617B (en) * 2018-01-11 2021-01-19 会听声学科技(北京)有限公司 Noise scene recognition system and method
CN110992989B (en) * 2019-12-06 2022-05-27 广州国音智能科技有限公司 Voice acquisition method and device and computer readable storage medium
CN113257276B (en) * 2021-05-07 2024-03-29 普联国际有限公司 Audio scene detection method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1296258A (en) * 1999-11-10 2001-05-23 三菱电机株式会社 Noise canceller
CN1331825A (en) * 1998-12-21 2002-01-16 高通股份有限公司 Periodic speech coding
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
CN1430778A (en) * 2001-03-28 2003-07-16 三菱电机株式会社 Noise suppressor
CN1624766A (en) * 2000-08-21 2005-06-08 康奈克森特系统公司 Method for noise robust classification in speech coding

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5742734A (en) * 1994-08-10 1998-04-21 Qualcomm Incorporated Encoding rate selection in a variable rate vocoder
US6691084B2 (en) * 1998-12-21 2004-02-10 Qualcomm Incorporated Multiple mode variable rate speech coding
US6694293B2 (en) * 2001-02-13 2004-02-17 Mindspeed Technologies, Inc. Speech coding system with a music classifier

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1331825A (en) * 1998-12-21 2002-01-16 高通股份有限公司 Periodic speech coding
CN1296258A (en) * 1999-11-10 2001-05-23 三菱电机株式会社 Noise canceller
CN1624766A (en) * 2000-08-21 2005-06-08 康奈克森特系统公司 Method for noise robust classification in speech coding
CN1354455A (en) * 2000-11-18 2002-06-19 深圳市中兴通讯股份有限公司 Sound activation detection method for identifying speech and music from noise environment
CN1430778A (en) * 2001-03-28 2003-07-16 三菱电机株式会社 Noise suppressor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAI L. ET AL.: "Feature Analysis and Extraction for Audio Automatic Classification", MINI-MICRO SYSTEMS, vol. 26, no. 11, November 2005 (2005-11-01), pages 2029 - 2034, XP008109115 *
QI F. AND BAO C.: "A new method to voiced/unvoiced/silence of speech classification using Support Vector Machine", ACTA ELECTRONICA SINICA, vol. 34, no. 4, April 2006 (2006-04-01), pages 605 - 611, XP008109092 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2684194C1 (en) * 2015-06-26 2019-04-04 ЗетТиИ Корпорейшн Method of producing speech activity modification frames, speed activity detection device and method

Also Published As

Publication number Publication date
CN100483509C (en) 2009-04-29
EP2096629A1 (en) 2009-09-02
EP2096629B1 (en) 2012-10-24
EP2096629A4 (en) 2011-01-26
CN101197135A (en) 2008-06-11

Similar Documents

Publication Publication Date Title
WO2008067735A1 (en) A classing method and device for sound signal
JP3197155B2 (en) Method and apparatus for estimating and classifying a speech signal pitch period in a digital speech coder
KR100883656B1 (en) Method and apparatus for discriminating audio signal, and method and apparatus for encoding/decoding audio signal using it
KR100964402B1 (en) Method and Apparatus for determining encoding mode of audio signal, and method and appartus for encoding/decoding audio signal using it
CN101197130B (en) Sound activity detecting method and detector thereof
CN107004409B (en) Neural network voice activity detection using run range normalization
CN106409313B (en) Audio signal classification method and device
US6993481B2 (en) Detection of speech activity using feature model adaptation
RU2417456C2 (en) Systems, methods and devices for detecting changes in signals
KR101116363B1 (en) Method and apparatus for classifying speech signal, and method and apparatus using the same
WO2010072115A1 (en) Signal classification processing method, classification processing device and encoding system
WO2006019556A2 (en) Low-complexity music detection algorithm and system
WO2008148321A1 (en) An encoding or decoding apparatus and method for background noise, and a communication device using the same
CN101399039B (en) Method and device for determining non-noise audio signal classification
US8380494B2 (en) Speech detection using order statistics
US20120197642A1 (en) Signal processing method, device, and system
CN101149921A (en) Mute test method and device
CN101393741A (en) Audio signal classification apparatus and method used in wideband audio encoder and decoder
CN110992965B (en) Signal classification method and apparatus, and audio encoding method and apparatus using the same
JP3331297B2 (en) Background sound / speech classification method and apparatus, and speech coding method and apparatus
Górriz et al. An effective cluster-based model for robust speech detection and speech recognition in noisy environments
CN101393744B (en) Method for regulating threshold of sound activation and device
JPH10247093A (en) Audio information classifying device
CN111128244B (en) Short wave communication voice activation detection method based on zero crossing rate detection
CN1275223C (en) A low bit-rate speech coder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07855800

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007855800

Country of ref document: EP