CN101197135A

CN101197135A - Aural signal classification method and device

Info

Publication number: CN101197135A
Application number: CN 200610164456
Authority: CN
Inventors: 严勤; 邓浩江; 王珺; 许剑峰; 许丽净; 李伟; 张清; 桑盛虎; 杜正中
Original assignee: Huawei Technologies Co Ltd; Institute of Acoustics CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Acoustics CAS
Priority date: 2006-12-05
Filing date: 2006-12-05
Publication date: 2008-06-11
Anticipated expiration: 2026-12-05
Also published as: EP2096629A1; WO2008067735A1; EP2096629B1; EP2096629A4; CN100483509C

Abstract

The invention discloses a sound signal classifying method, which comprises the following steps: receiving sound signal and determining updating rate of background noise according to frequency spectrum distribution parameter of the background noise and the frequency spectrum distribution parameter of the sound signal; and updating the noise parameter according to the updating rate and classifying the sound signal according to sub-band energy parameter and updated noise parameter. The invention further discloses a sound signal classifying device, which comprises the following parts: a background noise parameter updating module, which is used for determining the updating rate of the background noise according to frequency spectrum distribution parameter of the background noise and the frequency spectrum distribution parameter of the current sound signal and sending the determined updating rate; and a PSC module, which is used for receiving the updating rate from the background noise parameter updating module to update the noise parameter, classifying the current sound signal according to sub-band energy parameter and updated noise parameter, and sending the sound signal type determined by classification.

Description

Aural signal classification method and device

Technical field

The present invention relates to the speech coding technology field, particularly a kind of aural signal classification method and a kind of voice signal sorter.

Background technology

In voice communication, have only about 40% signal to comprise voice, all be quiet or ground unrest At All Other Times, in order to save transmission bandwidth, carry out in the voice coding in field of voice signal, adopt voice activity detection (VAD, Voice Activity Detection) technology, make scrambler to adopt different speed to encode with movable voice to ground unrest, promptly ground unrest is encoded with lower speed, voice to activity are encoded with higher speed, thereby reduced average bit rate, promoted the development of variable rate speech coding technology greatly.

Existing signal detector (VAD) is all developed at voice signal, and only the sound signal with input is divided into two kinds: noise and non-noise.Newer scrambler such as AMR_WB+ and SMV comprise the detection of music signal, revise and replenish as one beyond the VAD judgement.The key character of AMR-WB+ scrambler is after VAD detects, and is voice or music according to input audio signal, encodes with different patterns, to reduce code check to the full extent, guarantees coding quality.

Two kinds of different coding patterns among the AMR-WB+ comprise: based on algebraic codebook excited linear predict voice coding device ACELP (Algebraic Code Excited Linear Prediction) and two kinds of core encoder algorithms of conversion excitation coding TCX (Transform coded excitation) pattern.ACELP belongs to by setting up the speech utterance model, the characteristics of voice have been made full use of, code efficiency for voice signal is very high, and its technology is quite ripe in addition, so can use the former that its voice coding quality is greatly improved by expansion on the universal audio scrambler.Similarly, use the TCX coding that the coding quality of its broadband music is improved by expansion on the speech coder of low bit rate.

The ACELP of AMR-WB+ encryption algorithm and TCX model selection algorithm have two kinds according to complexity: open loop selection algorithm and closed loop selection algorithm.Closed loop is selected corresponding high complexity, is default option, is a kind of selection mode of the traversal search based on the perceptual weighting signal to noise ratio (S/N ratio), and obviously, such system of selection is very accurately, but its computational complexity is very high, and size of code is also bigger.

Open loop is selected to comprise the steps:

At first in step 101, according to tone sign (Tone_flag) and sub belt energy parameter (Level[n]), determine that signal is non-useful signal or useful signal by the VAD module.

In step 102, carry out preliminary model selection (EC) then;

In step 103, the pattern that step 102 is tentatively determined to be revised and refinement model selection (ESC), the coding mode to determine to select specifically carries out based on open-loop pitch parameter and ISF parameter.

In step 104, carry out TCXS and handle, promptly, carry out small-scale closed loop traversal search, finally determine coding mode when the number of times of Continuous Selection speech signal coding pattern during less than three times, wherein the speech signal coding pattern is ACELP, and the music signal coding pattern is TCX.

Voice signal selection algorithm at above-mentioned AMR-WB+ has following shortcoming:

1, existing VAD module is being carried out the branch time-like to signal, and is not ideal enough to the music signal differentiation of noise and some kinds, reduced the accuracy of voice signal classification;

2, calculating the open-loop pitch parameter, is necessary computing for the ACELP coding mode, yet is unnecessary for the TCX coding mode.Structural design according to AMR-WB+, VAD and open loop mode selection algorithm need be used the open-loop pitch parameter, therefore all frames are all needed to calculate open-loop pitch, and this is for other non-ACELP coding mode (for example TCX), belong to redundant complexity, increase the calculated amount that coding mode is selected, reduced efficient.

Though the performance of 3 VAD detection algorithms in speech detection and noise immunity is more excellent in the current various scrambler, but might music signal be declared into noise in some special music signal hangover part by mistake, this will cause the last or end syllable of music to be blocked, and sound not nature.

4, the model selection algorithm of AMR-WB+ is not considered the residing signal to noise ratio (S/N ratio) environment of signal, and the performance of distinguishing voice and music under the low signal-to-noise ratio condition further worsens.

Summary of the invention

In view of this, the invention provides a kind of aural signal classification method and a kind of voice signal sorter, can improve accuracy the voice signal classification and Detection.

A kind of voice signal classification and Detection method provided by the invention comprises:

Receive voice signal, determine the renewal rate of ground unrest according to the spectrum distribution parameter of background noise spectrum distribution parameter and described voice signal; According to described renewal rate noise parameter is upgraded, and described voice signal is classified according to the noise parameter after sub belt energy parameter and the renewal.

A kind of voice signal sorter provided by the invention comprises: ground unrest parameter update module and signal preliminary classification PSC module;

Ground unrest parameter update module is used for determining according to the spectrum distribution parameter of background noise spectrum distribution parameter and current voice signal the renewal rate of ground unrest, and sends described definite renewal rate;

The PSC module is used to receive the renewal rate from described ground unrest parameter update module, noise parameter is upgraded, and according to the sub belt energy parameter and the noise parameter after upgrading current voice signal is classified, and send the voice signal type that classification is determined.。

From such scheme as can be seen, pass through to determine the renewal rate of ground unrest among the present invention, and noise parameter is upgraded according to this renewal rate, according to the noise parameter after sub belt energy parameter and the renewal signal is carried out preliminary classification again, non-useful signal and useful signal in the voice signal of determining to receive, having reduced the useful signal judgement is the erroneous judgement of noise signal, has improved the accuracy that voice signal is classified.

Description of drawings

Fig. 1 is that synoptic diagram is selected in AMR-WB+ encryption algorithm open loop of the prior art;

Fig. 2 is the overview flow chart of voice signal classification and Detection method of the present invention;

Fig. 3 is the composition synoptic diagram of voice signal sorter of the present invention;

Fig. 4 be the specific embodiment of the invention based on system form synoptic diagram;

Fig. 5 calculates the process flow diagram of various parameters for a kind of coder parameters extraction module in the specific embodiment of the invention;

Fig. 6 calculates the process flow diagram of various parameters for another kind of coder parameters extraction module in the specific embodiment of the invention;

Fig. 7 is that the PSC module is formed synoptic diagram in the specific embodiment of the invention;

Fig. 8 determines the synoptic diagram of characteristic parameter for signal classification judging module in the specific embodiment of the invention;

The synoptic diagram that Fig. 9 carries out the voice judgement for signal classification judging module in the specific embodiment of the invention;

The synoptic diagram that Figure 10 carries out the music judgement for signal classification judging module in the specific embodiment of the invention;

The synoptic diagram that Figure 11 revises initial court verdict for signal classification judging module in the specific embodiment of the invention;

Figure 12 tentatively revises the classification synoptic diagram for signal classification judging module in the specific embodiment of the invention to neutral signal;

Figure 13 for signal classification judging module in the specific embodiment of the invention to the signal correction synoptic diagram of finally classifying;

Figure 14 carries out the parameter update synoptic diagram for signal classification judging module in the specific embodiment of the invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing.

Main thought of the present invention is, the renewal rate of determining ground unrest according to the spectrum distribution parameter and the background noise spectrum distribution parameter of current voice signal, and noise parameter is upgraded according to this renewal rate, then when useful signal in the voice signal that determine to receive and non-useful signal, carry out according to the noise parameter after this renewal, thereby make that the accuracy of noise parameter is higher when definite useful signal and non-useful signal, improved the accuracy of voice signal classification.

As shown in Figure 2, the present invention at first provides a kind of voice signal classification and Detection method, and this method comprises:

Step 201, receive voice signal, determine the renewal rate of ground unrest according to the spectrum distribution parameter of background noise spectrum distribution parameter and described voice signal;

Step 202, noise parameter is upgraded according to described renewal rate, and according to the sub belt energy parameter and the noise parameter after upgrading described voice signal is classified.

In the step 202, mainly be to be divided into useful signal type and non-useful signal type with the voice signal classification.After this, can also further determine the type of useful signal, described type comprises voice signal and music signal, when definite, whether restrain according to noise, selection based on the open-loop pitch parameter, lead spectral frequency parameter and sub belt energy parameter and determine, or select to determine based on leading spectral frequency parameter and sub belt energy parameter.

In addition, for preventing that music signal hangover erroneous judgement is non-useful signal, reduce sound effect, also obtain definite useful signal type among the present invention, determine smear length according to this useful signal type, and further according to useful signal and non-useful signal in the definite voice signal that receives of this smear length.Here, to the hangover of music signal can be provided with bigger, thereby improve the sound effect of music signal.

When useful signal was defined as voice signal or music signal, the signal that can at first can not very accurately determine was set to uncertain type, and then according to other parameters uncertain type is revised, and finally determined the type of useful signal.

Because the coded system of non-useful signal is not all to need to calculate to lead the spectral frequency parameter, therefore for reducing the calculated amount in the assorting process, improve classification effectiveness, to the non-useful signal of determining, do not lead the spectral frequency parameter if its corresponding coding manner does not need to calculate, then do not calculate and lead the spectral frequency parameter.

As shown in Figure 3, the present invention also provides a kind of voice signal sorter, comprises ground unrest parameter update module and signal preliminary classification (PSC) module.Wherein, ground unrest parameter update module is used for determining according to the spectrum distribution parameter of current voice signal and background noise spectrum distribution parameter the renewal rate of ground unrest, and sends the renewal rate of determining to described PSC module; The PSC module is used for according to the renewal rate from described ground unrest parameter update module, noise parameter is upgraded, and according to the sub belt energy parameter and the noise parameter after upgrading signal is carried out preliminary classification, the voice signal that receives is defined as useful signal type or non-useful signal type.

This voice signal sorter further can comprise: signal classification judging module; Then the PSC module also sends the signal type of determining to signal classification judging module; Signal classification judging module based on the open-loop pitch parameter, lead spectral frequency parameter and sub belt energy parameter, perhaps, determine the type of useful signal based on leading spectral frequency parameter and sub belt energy parameter, described type comprises voice signal and music signal.

This voice signal sorter further can also comprise: the sorting parameter extraction module; Then the PSC module sends the signal type of determining to described signal classification judging module by the sorting parameter extraction module; The sorting parameter extraction module also is used to obtain and comprises and lead spectral frequency parameter and sub belt energy parameter, perhaps further obtains the open-loop pitch parameter, the parameter of obtaining is treated to signal characteristic of division parameter sends described classification judging module to; And be treated to the spectrum distribution parameter and the background noise spectrum distribution parameter of voice signal, and send these spectrum distribution parameters to described ground unrest parameter update module according to the parameter that will obtain; The judging module of then classifying is determined the type of useful signal according to above-mentioned signal characteristic of division parameter and the definite signal type of PSC module, and described type comprises voice signal and music signal.

The signal to noise ratio (S/N ratio) that the PSC module further can also be used for determining the voice signal that the signal type process is calculated sends described signal classification judging module to; Signal classification judging module further is defined as voice signal or music signal according to described signal to noise ratio (S/N ratio) with useful signal.

This voice signal sorter further can comprise: encoder modes and rate selection module; Signal classification judging module sends the signal type of determining to described encoder modes and rate selection module; Encoder modes and rate selection module are determined the coding mode and the speed of voice signal according to the described signal type that receives.

This voice signal sorter further can comprise: the coder parameters extraction module, be used for extracting and lead spectral frequency parameter and sub belt energy parameter, perhaps further extract the open-loop pitch parameter, and the described parameter that will extract sends described sorting parameter extraction module to, and sends the sub belt energy parameter of extracting to the PSC module.

Below by a specific embodiment voice signal classification and Detection method provided by the invention and voice signal sorter are described.

As shown in Figure 4, for the specific embodiment of the invention based on system form synoptic diagram.Comprising voice signal classification and Detection device (sound activity detector, SAD) it is according to the needs of scrambler, the input audio digital signals is divided into different classes, can be divided into non-useful signal, voice and music three classes, thereby the foundation of coding mode selection and rate selection is provided for scrambler.

As can be seen, the SAD inside modules comprises in Fig. 4: ground unrest is estimated control module, signal preliminary classification module, sorting parameter extraction module and signal classification judging module totally 4 submodules.The signal classifier that SAD uses as scrambler inside, account for and computation complexity for reducing the resource consumption, the own parameter of scrambler will be made full use of, so calculate sub belt energy parameter and coder parameters by the coder parameters extraction module in the scrambler, and parameters calculated offered the SAD module.In addition, the final output of SAD module is the signal decision type, comprises non-useful signal, voice and music three classes, offers encoder modes and rate selection module, selects encoder modes and speed for it.

Below respectively to module relevant with SAD in the scrambler, each submodule among the SAD, and the reciprocal process between each module is elaborated.

Coder parameters extraction module in the scrambler calculates sub belt energy parameter and coder parameters, and parameters calculated is offered the SAD module.Wherein, the sub belt energy CALCULATION OF PARAMETERS can adopt the method for bank of filters filtering, and concrete number of sub-bands requires according to computation complexity and classification accuracy requires to determine, followingly in the present embodiment brings capable explanation into to be divided into 12 sons.

In the present embodiment, the coder parameters extraction module calculates the process of the parameter that various SAD modules need can be as Fig. 5 or shown in Figure 6,

Wherein, flow process shown in Figure 5 comprises the steps:

Step 501, coder parameters extraction module at first calculate the sub belt energy parameter.

Step 502, coder parameters extraction module are according to whether decision needs to carry out pilot carrier frequency (ISF) computing from the initial court verdict of the signal of PSC module (Vad_flag), and execution in step 503 if desired; Otherwise execution in step 504.

Whether decision needs to carry out the ISF computing and comprises in this step: if present frame is non-useful signal, then according to the mechanism of scrambler: if scrambler needs the ISF parameter at the coding of non-useful signal, then carry out the ISF computing; If do not need, then the coder parameters extraction module finishes.If present frame is a useful signal, then carry out the ISF computing.Calculate the ISF parameter for useful signal, most of coding modes all need, and therefore can not bring redundant complexity to scrambler.The technical scheme of ISF calculation of parameter can not given unnecessary details at this with reference to the data of various scramblers.

Step 503, coder parameters extraction module calculate the ISF parameter, and execution in step 504 then.

Step 504, coder parameters extraction module calculate the open-loop pitch parameter.

The sub belt energy parameter that flow process by above-mentioned Fig. 5 calculates offers PSC module and the sorting parameter extraction module among the SAD, and all the other parameters offer the sorting parameter extraction module among the SAD.

In the flow process shown in Figure 6, on the basis of Fig. 5 flow process, increased according to initial noise whether restrain the step that determines whether calculating the open-loop pitch parameter.Wherein, step 601 is basic identical to step 503 with the step 501 among Fig. 5 to step 603, and in step 604, judges the initialization noise parameter, and promptly whether Noise Estimation restrains, if then calculate the open-loop pitch parameter in step 605; Otherwise do not calculate the open-loop pitch parameter.

Because the open-loop pitch parameter for the coding mode that has, as the TCX coding mode, belongs to redundant calculating, for reducing computation complexity, after the Noise Estimation convergence, can determine that substantially signal corresponding codes pattern does not need to calculate the open-loop pitch parameter, therefore just no longer calculate the open-loop pitch parameter.

Before Noise Estimation convergence, can restrain and speed of convergence for guaranteeing Noise Estimation, need to calculate the open-loop pitch parameter, but this calculating belong to the unloading phase can be ignored its complexity.The technical scheme of open-loop pitch calculation of parameter can not given unnecessary details at this with reference to the coding based on ACELP.Judge that whether convergent is according to being that consecutive sentence is that the number of times of noise frame surpasses threshold noise convergence threshold (THR1) for Noise Estimation, the THR1 value gets 20 in the example of present embodiment.

The sub belt energy parameter that said extracted goes out is: level[i].Wherein, i represents member's index of vector, gets 1...12 in the present embodiment, the corresponding 0-200hz of difference, 200-400hz, 400-600hz, 600-800hz, 800-1200hz, 1200-1600hz, 1600-2000hz, 2000-2400hz, 2400-3200hz, 3200-40000hz, 4000-4800hz, 4800-6400hz.

The ISF parameter that said extracted goes out is: Isf _n[i], wherein, n represents frame index, i gets 1...16 and represents member's index in the vector.

The open-loop pitch parameter that said extracted goes out comprises:

The gain of open loop gene (open_loop pitch gain, ol_gain) postpone with the open loop gene (open_loop pitch lag, ol_lag), and pitch marks (tone_flag).Wherein, if the value of ol_gain greater than tone thresholding (TONE_THR), then pitch marks tone_flag is made as 1.

Signal preliminary classification module (PSC) can adopt various existing vad algorithm schemes to realize, specifically comprises ground unrest estimator module, calculating signal to noise ratio (S/N ratio) submodule, useful signal estimator module, decision threshold adjustment word modules, comparison sub-module, hangover protection useful signal submodule.In the present embodiment, as shown in Figure 7, the specific implementation of PSC module also can with existing vad algorithm module have following 3 different:

I, calculating signal to noise ratio (S/N ratio) submodule are according to this parameter and sub belt energy calculation of parameter signal to noise ratio (S/N ratio), the signal to noise ratio (S/N ratio) parameter (snr) that calculates is except that the PSC inside modules is used, also this snr parameter is sent to signal classification judging module, so that the differentiation to voice and music under the low signal-to-noise ratio condition of signal classification judging module is also more accurate.

II, not ideal enough to the differentiation of the music of noise and some kind owing to existing VAD, present embodiment has carried out following improvement to VAD: at first the ground unrest CALCULATION OF PARAMETERS is controlled by the renewal rate acc that ground unrest parameter update module provides.By the renewal rate of ground unrest estimator module reception, noise parameter is upgraded, and will send calculating signal to noise ratio (S/N ratio) submodule to according to the ground unrest sub belt energy estimated parameter that the noise parameter after upgrading calculates from ground unrest parameter update module.Specifically to the calculating of renewal rate referring to follow-up explanation to ground unrest parameter update module, in an example of present embodiment, renewal rate can be got 4 shelves: acc1, acc2, acc3, acc4.For different renewal rates, determine different upwards undated parameters (update_up) and downward undated parameter (update_down), update_up and update_down corresponding ground unrest respectively upwards reach downward renewal rate.

The scheme of noise parameter renewal specifically can adopt the scheme among the AMR_WB+ then:

If(bckr_est _m[n]＜level _m-1[n])

update＝update_up

else

update＝update_down

Then the formula of Noise Estimation renewal is:

bckr_est _m+1[n]＝(1-update)*bckr_est _m[n]+update*level _m-1[n]

Then the formula of noise spectrum distribution parameter vector renewal is:

{\tilde{p}}_{m + 1} [i] = (1 - update) * {\tilde{p}}_{m} [i] + update * p_{m} [i]

Wherein,

M: frame index

N: subband index

I: the element index of spectrum distribution parameter vector, i=1,2,3,4

Bckr_est: ground unrest is estimated sub belt energy

: background noise spectrum distribution parameter vector is estimated

P: current demand signal spectrum distribution parameter vector

III, in existing VAD, generally all protect useful signal not to be mistaken for noise by hangover, the length of hangover should and improve at guard signal gets one aspect the transfer efficiency two and trades off.For traditional speech coder, the length of hangover can be got a constant through study.And for the multi-rate coding device, towards be the sound signal that comprises music, long low-energy hangover often appears in this class signal, conventional VAD is difficult to detect this part hangover, therefore needs long hangover protect it.In an embodiment; hangover length in the holder tail protection useful signal submodule is designed to according to SAD signal decision self-adaptation as a result; be music signal (SAD_flag=MUSIC) then long hangover parameter (hang_len=HANG_LONG) is set if rule out; if ruling out is voice signal (SAD_flag=SPEECH); short hangover parameter (hang_len=HANG_SHORT) then is set, and concrete set-up mode is as follows:

If(SAD_flag＝MUSIC)

hang_len＝HANG_LONG

else?if(SAD_flag＝SPEECH)

hang_len＝HANG_SHORT

else

hang_len＝0

Wherein:

SAD_flag SAD adjudicates sign

Hang_len hangover protection length

In the example of present embodiment, HANG_LONG=100, HANG_SHORT=20, unit can be a frame number.

The sorting parameter extraction module is used for the parameter of sub belt energy parameter, ISF parameter, open-loop pitch calculation of parameter signal classification judging module and ground unrest parameter update module needs that the Vad_flag parameter determined according to signal preliminary classification module and coder parameters extraction module provide, and sub belt energy parameter, ISF parameter, open-loop pitch parameter and the parameter correspondence that calculates are offered signal classification judging module and ground unrest parameter.The parameter that the sorting parameter extraction module calculates comprises:

1, fundamental tone parameter (pitch)

The difference of more continuous open-loop pitch delay, if the increment of open-loop pitch delay less than preset threshold, then delay counter adds up; If the delay counter sum of two continuous frames is enough big, pitch=1 then is set, otherwise pitch=0.The computing formula of open-loop pitch delay can be referring to the AMR-WB+/AMR-WB standard document.

Signal correction value parameter (meangain) when 2, growing

Meangain is the running mean of adjacent three frame tone tone, wherein tone=1000*tone_flg; Identical among tone_flg definition and the AMR-WB+.

3, zero-crossing rate (zcr)

zcr = \frac{1}{T} Σ_{i - 1}^{T - 1} II {x (i) x (i - 1) < 0}

II{A} is 0 when being false being that truth is 1 as A.

4, sub belt energy time domain fluctuation (t_flux)

t_flux = \frac{Σ_{i = 1}^{12} | {level}_{m} (i) - {level}_{m - 1} (i) |}{short_mean_level_energy}

Wherein short_mean_level_energy represents short-time average energy

5, the height sub belt energy is than (ra)

ra = \frac{sublevel_high_energy}{sublevel_low_energy}

Wherein, an example of this patent invention:

sublevel_high_energy＝level[10]+level[11]；

sublevel_low_energy＝level[0]+level[1]+level[2]+level[3]+level[4]+level[5]+level[6]+level[7]+level[8]+level[9]；

6, sub belt energy frequency domain fluctuation (f_flux)

f_flux = \frac{Σ_{i = 2}^{12} | {level}_{m} (i) - {level}_{m} (i - 1) |}{short_mean_level_energy}

7, lead spectrum distance from short-time average (isf_meanSD): be that five consecutive frames are led the mean value of spectrum distance from Isf_SD, wherein

Isf_SD = Σ_{i = 1}^{16} | {Isf}_{m} (i) - {Isf}_{m - 1} (i) |

8, sub belt energy standard deviation mean parameter (level_meanSD) is represented the mean value of two consecutive frame sub belt energy standard deviations (level_SD), and level_SD CALCULATION OF PARAMETERS method is with reference to the computing method of above-mentioned Isf_SD.

In above-mentioned 8 parameters, the parameter that offers ground unrest parameter update module comprises: zcr, ra, f_flux and t_flux.The parameter that offers signal classification judging module comprises: pitch, meangain, isf_meanSD and level_meanSD.

Signal classification judging module is used for according to snr, Vad_flag from signal preliminary classification module PSC, and from sub belt energy parameter, pitch, meangain, Isf_meanSD, the level_meanSD of sorting parameter extraction module signal is finally divided into: non-useful signal (NOISE), voice signal (SPEECH) and music signal (MUSIC).Can comprise in the signal classification judging module: parameter update submodule and judgement submodule; Described parameter update submodule is used for the thresholding according to described signal to noise ratio (S/N ratio) update signal classification judging process, and the thresholding after will upgrading offers described judgement submodule; Described judgement submodule is used to receive the voice signal type from the PSC module, and to wherein useful signal based on the open-loop pitch parameter, lead the thresholding after spectral frequency parameter, sub belt energy parameter and the described renewal, perhaps based on the thresholding of leading after spectral frequency parameter and sub belt energy parameter and the described renewal, determine the type of described useful signal, and the type that sends determined useful signal is to encoder modes and rate selection module.

Useful signal is defined as voice signal or music signal comprises: the value of voice identifier position and the value of music identification position at first are set are 0, then according to fundamental tone parameter identification, signal correlation values when long, lead spectrum distance and signal tentatively be defined as sound-type, music type or uncertain type, and according to sound-type of tentatively determining or the corresponding value of revising voice identifier position or music identification position of music type from short-time average parameter and sub belt energy substandard difference mean parameter; Again according to sub belt energy, signal correlation values, sub belt energy substandard difference mean parameter, speecn_flag, music_flag, pitch value are whether 1 continuous frame number is above the hangover frame number thresholding that sets in advance, continuous music frame number, continuous number of speech frames when long, and the type of previous frame, described sound-type, music type or the uncertain type tentatively determined are revised, determine the type of useful signal, described type comprises voice signal and music signal.

Below again the idiographic flow that useful signal is defined as voice signal or music signal is described:

For guaranteeing the stable of signal decision and avoiding the conversion of frequent court verdict, present embodiment provides the sign hangover mechanism of parameter, comprise to pitcn_flag, level_meanSD_high_flag, ISF_meanSD_high_flag, ISF_meanSD_low_flag, these characteristic ginseng values of level_meanSD_low_flag, meangain_flag really normal root carry out according to hangover mechanism, the concrete of these characteristic ginseng values determined as shown in Figure 8.

Length during the hangover among Fig. 8 is determined according to hangover parameter identification value, provides two kinds of hangovers to be provided with in the present embodiment, promptly determines the scheme of hangover parameter identification value:

In first kind of hangover plan of establishment, when parameter value was higher or lower than certain thresholding, corresponding parameters hangover Counter Value added one; Otherwise corresponding parameters hangover Counter Value is set to 0, and sets different parameter hangover signs according to the value of parameter hangover counter.Wherein, the value of parameter hangover counter is big more, and the length of parameter hangover ident value is long more, specifically determines according to actual conditions when parameter hangover ident value is set according to the params-count device, repeats no more here.

In second kind of hangover plan of establishment, control hangover length according to the error rate ER of each internal node of the decision tree of training parameter correspondence, the parameter that error rate is little, hangover is short; The parameter that error rate is big, hangover is long.

After this, if current signal is categorized as useful signal, carry out the preliminary classification of voice and music:

At first carrying out voice initially adjudicates, as shown in Figure 9, voice identifier position=0 is set, then in step 902 in step 901, judging that whether Isf_meanSD leads spectrum voice thresholdings (for example being 1500) greater than predefined first, is 1 if the value of voice identifier position then is set; Otherwise,

In step 903, judge whether that the pitch value is 1, and the pitch lag values t_top_mean that the switch pitch search obtains is less than fundamental tone voice thresholding (for example being 40), if the value that the voice identifier position then is set is 1; Otherwise,

In step 904, judge the pitch value is whether 1 continuous frame number surpasses the hangover frame number thresholding (for example being 2 frames) that sets in advance, if the value that the voice identifier position then is set is 1; Otherwise,

In step 905, judge that meangain is whether greater than predefined related voice thresholding (for example being 8000) when long, if the value that the voice identifier position then is set is 1; Otherwise,

In step 906, judge that the value whether one or two is arranged among level_meanSD_high_flag and the ISF_meanSD_high_flag is 1, if the value that the voice identifier position then is set is 1; Otherwise do not change the value of voice identifier position.

Then, carry out music and initially adjudicate, specifically as shown in figure 10:

In step 1001, at first the music identification position is set to 0, then in step 1002, judges that signal satisfies sign ISF_meanSD_low_flag=1 and level_meanSD_low_flag=1 simultaneously, if music signal sign music_flag then is set; Otherwise, do not change the value of music identification position.

After this, as shown in figure 11, initial court verdict is revised:

At first step 1101, judge whether subband instant energy less than sub belt energy thresholding (for example being 5000), if execution in step 1102 then; Otherwise signal is defined as uncertain class (UNCERTAIN);

In step 1102, judge whether meangain_flag=1, and music continues counter less than the lasting counting of music voice judgement thresholding (for example being 3), if then signal is defined as voice signal; Otherwise,

In step 1103, judge that the value of ISF_meanSD is led spectrum voice thresholdings (for example being 2000) greater than predefined second, if then signal is defined as voice signal; Otherwise,

In step 1104, judge whether level_energy less than 10000, and judgement has surpassed five frames for the frame number of noise before, if, then current signal classification being changed to uncertain class, this is the erroneous judgement that noise is classified as the music class in order to reduce; Otherwise,

In step 1105, judge whether that the value of music identification position and voice identifier position is 1, if then the current demand signal classification is determined the uncertain class in position; Otherwise,

In step 1106, judge whether that the value of music identification position and voice identifier position is 0, if then the current demand signal classification is determined the uncertain class in position; Otherwise,

In step 1107, judge whether that the music identification position is 0, the voice identifier position is 1, if then the current demand signal type is defined as voice class; Otherwise,

In step 1108, because the music identification position is 1, the voice identifier position is 0, and the current demand signal type is defined as the music class.

After above-mentioned

steps

1104,1105 is to determine signal in the step 1106 and be uncertain class, execution in step 1109: judge whether pitch_flag=1, and ISF_meanSD is less than leading spectrum music thresholding (for example being 900), and continuous number of speech frames is less than 3, if then signal is defined as the music class; Otherwise, signal still is defined as uncertain class;

And after above-mentioned steps 1103 and step 1107 were defined as voice class with signal, execution in step 1110: whether continuous music frame number greater than 3, and ISF_meanSD is less than leading spectrum music thresholding, if then signal is defined as music signal; Otherwise, signal is defined as voice signal.

After determining voice signal and music signal by above-mentioned flow process, for the signal that still is in uncertain class, carry out flow process shown in Figure 12, tentatively revise classification, comprise: at first judge that in step 1201 whether level_energy is less than the uncertain class thresholding of sub belt energy (for example being 5000), if will be defined as uncertain class by signal type; Otherwise, in step 1202, the lasting frame number that judges whether music greater than 1 and ISF_meanSD less than leading spectrum music thresholding, if signal is defined as the music class; Otherwise:

To voice and the zero clearing of music hangover sign, if before this frame is continuous voice class, and continuity is stronger, according to the characteristic parameter of voice voice are adjudicated so, if satisfy the voice condition, voice hangover sign speech_hangover_flag=1 is set so, comprises that specifically step 1203 among Figure 12 is to step 1206; If before this frame is continuous music class, and continuity is stronger, according to the characteristic parameter of music music is adjudicated so, if satisfy the music condition, the sign music_hangover_flag=1 of music hangover is set so, comprises that specifically step 1207 among Figure 12 is to step 1210.

After this, to shown in the step 1216, if the voice hangover is masked as 1, the music hangover is masked as 0, and current signal classification is changed to voice class as the step 1211 among Figure 12; If the music hangover is masked as 1, the voice hangover is masked as 0, then current signal classification is changed to the music class; If music hangover sign and music hangover sign are 1 simultaneously or are 0 simultaneously, the signal classification is made as uncertain class, if at this moment before the continuity of music surpassed 20 frames, signal is defined as the music class, if the continuity of voice has surpassed 20 frames before, signal is defined as voice class.

After the above-mentioned preliminary correction of process, in Figure 13, the useful signal type is finally revised, the correction of classification is carried out in continuation according to current linguistic context, in step 1301, if current linguistic context is a music, and continuation is very strong, 3 seconds have been surpassed, be that current continuous music frame number has surpassed 150 frames, can force to revise according to the value of ISF_meanSD so, determine music signal.In step 1302, if current linguistic context is voice, and continuation is very strong, has surpassed 3 seconds, and promptly current continuous number of speech frames has surpassed 150 frames, can force to revise according to the value of ISF_meanSD so, determines voice signal; If after this signal classification also is uncertain class, the linguistic context before step 1303 basis is revised the signal classification so, is about to current uncertain signal classification and reduces signal classification before.

After having determined the classification of useful signal, need to upgrade each threshold value in three classification counters and the update signal classification judging module by above-mentioned flow process.For three classification counters, if the current music signal_sort=music that is categorized as, then music counter music_countinue_counter increases by 1, otherwise zero clearing; The processing of other classification counter is similar, as shown in figure 14, no longer describes in detail here.And threshold value is upgraded according to the signal to noise ratio (S/N ratio) size of signal preliminary classification module output, and each thresholding example of enumerating in an embodiment is in the 20db signal to noise ratio (S/N ratio) condition value that acquistion arrives of finishing classes and leave school.

Ground unrest parameter update module is utilized some spectrum distribution parameters that calculate in the sorting parameter extraction module among the SAD, controls the renewal rate of ground unrest.Owing to the unexpected situation about improving of energy level of ground unrest may occur in actual application environment, at this moment be prone to ground unrest and estimate to continue to be judged to the unrenewable always state of useful signal because of signal, the setting of ground unrest parameter update module has promptly solved this problem.

This ground unrest parameter update module basis is from the parameter in the sorting parameter extraction module, and the relevant spectrum distribution parameter vector of calculating comprises following element:

The short-time average of zero-crossing rate zcr

The height sub belt energy is than the short-time average of ra

The short-time average of sub belt energy frequency domain fluctuation f_flux

The short-time average of sub belt energy time domain fluctuation t_flux

Wherein, the computing method of zcr_mean short-time average are as follows, and other is similar:

zcr_mean _m＝ALPHA□zcr_mean _m-1+(1-ALPHA)□zcr _m

ALPHA=0.96 wherein, m represents frame index.

Present embodiment has utilized the comparatively stable characteristics of spectral characteristic of ground unrest, and wherein the member of spectrum distribution parameter vector can be not limited to 4 listed above.The renewal rate of current background noise is by the difference d between current spectrum distribution parameter and the background noise spectrum estimation of distribution parameters _CbControl.This difference can realize apart from scheduling algorithm by Euclidean distance, Manhattan.An invention example of this patent adopts Manhattan distance (a kind of name of distance calculation mode is similar to Euclidean distance), that is:

d_{cb} = Σ_{i = 1}^{4} | p (i) - \tilde{p} (i) |

Wherein, p is the spectrum distribution parameter vector of current demand signal,

Be that background noise spectrum distribution parameter vector is estimated.

In an example of present embodiment, work as d _CbDuring＜TH1, module output renewal rate acc1, the fastest renewal rate of representative; Otherwise, work as d _CbDuring＜TH2, output renewal rate acc2; Otherwise, work as d _CbDuring＜TH3, output renewal rate acc3; Otherwise, output renewal rate acc4.The TH1 here, TH2, TH3 and TH4 specifically determine according to the actual environment situation for upgrading thresholding.

More than be explanation, in concrete implementation process, can carry out suitable improvement, to adapt to the concrete needs of concrete condition method of the present invention to the specific embodiment of the invention.Therefore be appreciated that according to the specific embodiment of the present invention just to play an exemplary role, not in order to restriction protection scope of the present invention.

Claims

1. an aural signal classification method is characterized in that, this method comprises:

A, receive voice signal, determine the renewal rate of ground unrest according to the spectrum distribution parameter of background noise spectrum distribution parameter and described voice signal;

B, noise parameter is upgraded according to described renewal rate, and according to the sub belt energy parameter and the noise parameter after upgrading described voice signal is classified.

2. method according to claim 1 is characterized in that, further comprises behind the described step B:

C, the useful signal that described classification is obtained, based on the open-loop pitch parameter, lead the type that spectral frequency parameter and sub belt energy parameter are determined useful signal, described type comprises voice signal and music signal.

3. method according to claim 2 is characterized in that, further comprises before the described step C:

C0, detection noise estimate whether restrain, if, execution in step C1 then; Otherwise, carry out described step C;

C1, the useful signal that described classification is obtained are determined the type of useful signal based on leading spectral frequency parameter and sub belt energy parameter with the type of useful signal, and described type comprises voice signal and music signal.

4. method according to claim 3, whether it is characterized in that, among the described step C0, detect initial noise and converge to: whether the preceding continuing noise frame number of voice signal of judging described reception surpasses predefined noise convergence threshold, if then determine the Noise Estimation convergence; Otherwise, determine that Noise Estimation does not restrain.

5. method according to claim 2, it is characterized in that, also obtain described definite useful signal type among the described step B, determine smear length, and further described voice signal is classified according to this smear length according to this useful signal type.

6. method according to claim 2 is characterized in that, described step C comprises:

Initialization voice identifier position and music identification position, then according to fundamental tone parameter identification, signal correction parameter when long, lead spectrum distance from short-time average parameter and sub belt energy substandard difference mean parameter, and corresponding thresholding, the preliminary type of determining useful signal, comprise sound-type, music type or uncertain type, and according to sound-type of tentatively determining and corresponding voice identifier position and the music identification position of revising of music type;

According to sub belt energy, signal correction parameter, sub belt energy substandard difference mean parameter sub belt energy substandard difference mean parameter, voice identifier position, music identification position, fundamental tone parameter identification value are whether 1 continuous frame number surpasses the hangover frame number thresholding that sets in advance, continuous music frame number, continuous number of speech frames, the type and the corresponding thresholding of previous frame when long, described sound-type, music type or the uncertain type tentatively determined are revised, the final type of determining described useful signal comprises voice signal and music signal.

7. method according to claim 6 is characterized in that described thresholding is adjusted according to the signal to noise ratio (S/N ratio) of described voice signal.

8. method according to claim 1 is characterized in that, behind the described step B, further comprises:

D, to the non-useful signal that described classification obtains, determine its corresponding coding manner, and determine whether according to the coded system of determining that needs calculate and lead the spectral frequency parameter.

9. method according to claim 1 is characterized in that, the noise parameter described in the step B comprises: Noise Estimation parameter and noise spectrum distribution parameter.

10. according to claim 1 or 9 described methods, it is characterized in that described steps A comprises: calculate the difference parameter between described voice signal spectrum distribution parameter and the background noise spectrum distribution parameter, determine renewal rate according to this difference parameter then.

11. method according to claim 10, it is characterized in that, calculate the spectrum distribution parameter that described difference parameter relates to and comprise: zero-crossing rate short-time average parameter, height sub belt energy are than short-time average parameter, sub belt energy frequency domain fluctuation short-time average parameter and sub belt energy time domain fluctuation short-time average parameter.

12. a voice signal sorter is characterized in that, this device comprises: ground unrest parameter update module and signal preliminary classification PSC module;

Described ground unrest parameter update module is used for determining according to the spectrum distribution parameter of background noise spectrum distribution parameter and current voice signal the renewal rate of ground unrest, and sends described definite renewal rate;

Described PSC module is used to receive the renewal rate from described ground unrest parameter update module, noise parameter is upgraded, and according to the sub belt energy parameter and the noise parameter after upgrading current voice signal is classified, and send the voice signal type that classification is determined.

13. device according to claim 12, it is characterized in that, this device further comprises: signal classification judging module, be used to receive voice signal type from the PSC module, and to wherein useful signal based on the open-loop pitch parameter, lead spectral frequency parameter and sub belt energy parameter, perhaps, determine the type of useful signal based on leading spectral frequency parameter and sub belt energy parameter, described type comprises voice signal and music signal, and sends the type of determined useful signal.

14. device according to claim 13 is characterized in that, this device further comprises: the sorting parameter extraction module is used to receive the voice signal type from the PSC module, and sends this voice signal type to described signal classification judging module; Comprise and lead spectral frequency parameter and sub belt energy parameter with obtaining, perhaps further obtain the open-loop pitch parameter, the parameter of obtaining is treated to signal characteristic of division parameter sends described signal classification judging module to; And the parameter of obtaining is treated to the spectrum distribution parameter and the background noise spectrum distribution parameter of voice signal, and send these spectrum distribution parameters to described ground unrest parameter update module;

Then described classification judging module is determined the type of useful signal according to described signal characteristic of division parameter and the definite voice signal type of described PSC module, and described type comprises voice signal and music signal.

15. according to claim 13 or 14 described devices, comprise in the described PSC module: ground unrest estimator module, calculating signal to noise ratio (S/N ratio) submodule, useful signal estimator module, decision threshold are adjusted word modules, comparison sub-module, hangover protection useful signal submodule; It is characterized in that,

Described ground unrest estimator module receives the renewal rate from described ground unrest parameter update module, noise parameter is upgraded, and will send described calculating signal to noise ratio (S/N ratio) submodule to according to the ground unrest sub belt energy estimated parameter that the noise parameter after upgrading calculates;

Described calculating signal to noise ratio (S/N ratio) submodule is used to receive described ground unrest sub belt energy estimated parameter, and according to this parameter and sub belt energy calculation of parameter signal to noise ratio (S/N ratio), and sends signal to noise ratio (S/N ratio) to described signal classification judging module;

Described signal classification judging module comprises: parameter update submodule and judgement submodule; Described parameter update submodule is used for the thresholding according to described signal to noise ratio (S/N ratio) update signal classification judging process, and the thresholding after will upgrading offers described judgement submodule;

Described judgement submodule is used to receive the voice signal type from the PSC module, and to wherein useful signal based on the open-loop pitch parameter, lead the thresholding after spectral frequency parameter, sub belt energy parameter and the described renewal, perhaps based on the thresholding of leading after spectral frequency parameter and sub belt energy parameter and the described renewal, determine the type of described useful signal, and send the type of determined useful signal.

16. device according to claim 13, it is characterized in that, this device further comprises: encoder modes and rate selection module, be used to receive type, and determine the coding mode and the speed of voice signal according to the type of the useful signal that receives from the useful signal of signal classification judging module.

17. device according to claim 14, it is characterized in that, this device further comprises: the coder parameters extraction module, be used for extracting and lead spectral frequency parameter and sub belt energy parameter, perhaps further extract the open-loop pitch parameter, and the described parameter that will extract sends described sorting parameter extraction module to, and sends the sub belt energy parameter of extracting to described PSC module.