Summary of the invention
The problem that the present invention solves provides a kind of voice-activation detecting method and device, can fast, effectively identify voice signal and noise signal in the voice signal of input, reaches the purpose that reduces ground unrest when guaranteeing speech quality.
For addressing the above problem, technical scheme of the present invention provides a kind of voice-activation detecting method, comprising:
Voice signal to input divides frame;
Take frame as unit the voice signal of input carried out time frequency analysis;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; Described the second reference threshold and the first reference threshold have the multiple relation.
Optionally, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
Optionally, based on the result of determination of next frame voice signal described signal undetermined is judged and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
Optionally, after judging that this frame is as noise signal, also comprise based on this frame noise signal and upgrade described the first reference threshold and the second reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
Optionally, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
Optionally, described the second reference threshold is 1.3 times of the first reference threshold.
Optionally, the length of each frame voice signal is 8ms.
Optionally, described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice-activation detecting method, comprising:
Voice signal to input divides frame;
Set the first reference threshold and second reference threshold of noise signal, described the second reference threshold and the first reference threshold have the multiple relation;
Judge whether described the first reference threshold is within the preset range, otherwise take frame as unit the voice signal of input is carried out time frequency analysis; Be then take frame as unit the voice signal of input to be carried out zero-crossing rate to calculate, if the zero-crossing rate that calculates greater than predetermined threshold value, then carries out described time frequency analysis, otherwise judge that this frame is noise signal;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold.
Optionally, described voice-activation detecting method also comprises the predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with described maximum preset value, described the second preset range is relevant with middle preset value with described minimum preset value, described in the middle of preset value greater than minimum preset value, and less than the maximum preset value; The described predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold comprises: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.
Optionally, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
Optionally, based on the result of determination of next frame voice signal described signal undetermined is judged and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
Optionally, after judging that this frame is as noise signal, described voice-activation detecting method also comprises based on this frame noise signal and upgrades described the first reference threshold and the second reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice activation pick-up unit, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The time frequency analysis unit is suitable for take frame as unit the voice signal of input being carried out time frequency analysis;
Identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; Described the second reference threshold and the first reference threshold have the multiple relation.
Optionally, described voice activation pick-up unit also comprises the noise prediction unit, be suitable for extracting in the voice signal of described input before N frame voice signal and analyzing, obtain described the first reference threshold and the second reference threshold.
Optionally, described voice activation pick-up unit also comprises updating block, is suitable for upgrading described the first reference threshold and the second reference threshold based on this frame noise signal after described identifying unit judges that this frame is as noise signal.
Optionally, described voice activation pick-up unit also comprises storage unit, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice activation pick-up unit, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The first setup unit is suitable for setting the first reference threshold and second reference threshold of noise signal, and described the second reference threshold and the first reference threshold have the multiple relation;
The first identifying unit is suitable for judging whether described the first reference threshold is within the preset range;
The zero-crossing rate computing unit is suitable for being within the preset range when judging described the first reference threshold, take frame as unit the voice signal of input is carried out zero-crossing rate and calculates;
Whether the second identifying unit is suitable for judging the zero-crossing rate that calculates greater than predetermined threshold value, otherwise judges that this frame is noise signal;
The time frequency analysis unit, be suitable for judging described the first reference threshold when described the first identifying unit and be in outside the preset range or described the second identifying unit when judging the zero-crossing rate that calculates greater than described predetermined threshold value, take frame as unit time frequency analysis is carried out in the voice signal of input;
The 3rd identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; Result behind time frequency analysis is greater than the first reference threshold, and less than described the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Result behind time frequency analysis judges then that more than or equal to described the second reference threshold this frame is voice signal.
Compared with prior art, the technical program has the following advantages:
Divide frame (every frame voice signal seamlessly transits) by the voice signal to input, take frame as unit the voice signal of input is carried out time frequency analysis again, the first reference threshold and second reference threshold of the result of time frequency analysis and pre-set noise signal are compared, thereby can fast, effectively identify a certain frame voice signal is voice signal or noise signal, reaches the purpose that reduces ground unrest when guaranteeing speech quality with realization.
By judging whether the first reference threshold of setting is within the preset range, then to be in the corresponding predetermined threshold value of setting different zero-crossing rates of different preset range (different noise signal types) according to described the first reference threshold, calculate by take frame as unit the voice signal of input being carried out zero-crossing rate, then be judged to be noise signal for the zero-crossing rate that calculates less than described predetermined threshold value, otherwise further check with time frequency analysis, realized thus different noise signals are checked targetedly, thereby can avoid to a great extent producing flase drop and undetected situation, more effective to the identification of noise signal and voice signal.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are constantly upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
In addition, by N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described the first reference threshold and the second reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
Embodiment
As stated in the Background Art, a lot of methods all are that to sacrifice speech quality be that cost removes to reduce ground unrest in the prior art, and not good for the treatment effect of complex background noise.The technical program is by adopting various simulation tools to find the difference of characteristic between voice signal and the noise signal, then fully utilize voice level and smooth (minute frame), the time domain zero-crossing rate calculates, the time domain variance is calculated, the methods such as frequency domain variance calculating obtain the value of the comprehensive rate of change of voice signal on time domain and frequency domain of reflection input, and adopt the method for adaptive background noise to detect VAD, so that can fast, effectively identify voice signal and noise signal in the voice signal of input, reach the purpose of reduction voice real quality when removing noise signal.
For above-mentioned purpose of the present invention, feature and advantage can more be become apparent, below in conjunction with accompanying drawing the specific embodiment of the present invention is described in detail.Set forth detail in the following description so that fully understand the present invention.But the present invention can be different from alternate manner described here and implements with multiple, and those skilled in the art can do similar popularization in the situation of intension of the present invention.Therefore the present invention is not subjected to the restriction of following public embodiment.
Embodiment one
Fig. 1 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention one.As shown in Figure 1, described voice-activation detecting method may further comprise the steps:
At first execution in step S101 divides frame to the voice signal of inputting.
Those skilled in the art know, the purpose of speech signal analysis is exactly to be to extract easily and effectively and represent the information that voice signal is entrained, prerequisite and the basis that voice signal is processed, only analyze the parameter that can represent phonic signal character, just might utilize these parameters to carry out the processing such as efficient voice communication, phonetic synthesis and speech recognition.Voice generally are divided into unvoiced segments, voiceless sound section and voiced segments.Generally voiced sound is thought an oblique triangular pulse string take pitch period as the cycle, voiceless sound is modeled to random white noise.Because voice signal is a non stationary state process, the signal processing technology that can not use pats steady signal is carried out analyzing and processing to it.But because voice signal itself, in short time (for example 10~30ms even shorter time) scope, its characteristic can be regarded as a metastable state process, and namely voice signal has in short-term stationarity.Therefore, utilize the in short-term smooth performance of voice, the signal processing technology of processing stationary signal can be incorporated into going in the processing in short-term of voice signal, for example can adopt windowing to divide the method for frame that the voice signal (comprising voice signal and noise signal) of inputting is divided into the multiframe voice signal, each frame voice signal in short-term is called again an analysis frame (referred to as frame).Dividing frame is the voice signal formation analysis frame that intercepts input with the window function of finite length, and window function will need the sampling point zero setting outside the processing region to obtain current analysis frame.Although minute frame can adopt the method with the voice signal contiguous segmentation of input, but the method for the overlapping segmentation of general normal employing, namely former frame and a rear frame have common overlapping part, and this overlapping part is called frame and moves, can make between frame and the frame like this to seamlessly transit, keep its continuity.The ratio that frame moves with frame length (length of a frame voice signal) generally is taken as 0~1/2.In the present embodiment, the length of each frame voice signal is 8ms, and the zero-crossing rate in the subsequent step calculates and prediction and the estimation of ground unrest are all calculated according to the 8ms length data.Carrying out windowing about the voice signal to input, to divide the method for frame be the art conventional means, do not repeat them here.
Execution in step S102, the reference threshold of setting noise signal, described reference threshold comprises the first reference threshold and the second reference threshold.Because identify voice signal and noise signal in the voice signal, just need to analyze the difference of characteristic between noise signal and the voice signal, particularly various types of noise signals are analyzed.To this, just need to carry out in advance a large amount of experiments, each noise-like signal is analyzed, extract its characteristic parameter, for example: method commonly used is by noise signal is carried out time-domain analysis and frequency-domain analysis, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection, thereby count the scope that can fast, effectively identify the reference threshold of noise signal and voice signal.So, after the voice signal of input divides frame by step S101, in subsequent step, just can analyze each frame voice signal take frame as unit, and the result after will analyzing and described reference threshold compare, and is the signal that noise signal, voice signal or remain further judged thereby determine this frame voice signal based on different comparative results.Concrete decision process will be described in detail in the step below.
Need to prove, described reference threshold comprises the first reference threshold and the second reference threshold, wherein, described the first reference threshold is mainly used in realizing the identification to noise signal, and described the second reference threshold then is mainly used in realizing the identification to voice signal, finds by the statistics of great many of experiments, has the certain multiple relation between the second reference threshold and the first reference threshold, therefore, determine the first reference threshold, also just can determine the second reference threshold.In the present embodiment, described the second reference threshold is 1.3 times of the first reference threshold, for determining of " 1.3 times ", just is being based on multiple ground unrest is carried out great many of experiments, the value that obtains by the statistical computation result.
In addition, in the concrete assignment procedure of described the first reference threshold, be respectively arranged with maximum preset value, minimum preset value, the span of described the first reference threshold is shown [minimum preset value with interval table, maximum preset value], be described the first reference threshold more than or equal to described minimum preset value, and be less than or equal to described maximum preset value.Certainly, preset value in the middle of between described minimum preset value and maximum preset value, can also setting one, the span of preset value is shown (minimum preset value with interval table in the middle of described, the maximum preset value), preset value is greater than described minimum preset value and less than described maximum preset value in the middle of namely described.Also can affect the result of final decision for the setting of the maximum preset value of described the first reference threshold and minimum preset value, therefore, when setting described the first reference threshold, should arrange described maximum preset value and minimum preset value according to actual conditions.During implementation, the maximum preset value of described the first reference threshold is made as 350, and the minimum preset value of described the first reference threshold is made as 240, and the middle preset value of described the first reference threshold is made as 280.
In the present embodiment, described reference threshold (comprising the first reference threshold and the second reference threshold) is by N frame voice signal before in the voice signal that extracts described input and analyzes and obtain.Usually, the value of N larger (frame number that namely gathers is more), prediction effect for the ground unrest of current environment at the beginning of the voice call foundation is just better, certainly, if the frame number that gathers is more, the process of its analyzing and processing will be long, thereby definite process of reference threshold will take certain hour, can not finish in time the setting to the reference threshold of noise signal.Therefore, in the specific implementation, can determine according to actual conditions the value of N.By N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
In other embodiments, also can just select in advance suitable reference threshold to finish setting according to actual conditions, for example before voice call just the people in addition, can also adopt the reference threshold of the acquiescence that sets already for setting reference threshold.
Execution in step S103 judges whether described the first reference threshold is within the preset range.As previously mentioned, described the first reference threshold is mainly used in realizing the identification to noise signal, yet, consider that the special noise signal of a few classes is comparatively similar with voice signal on some characteristic, probably being difficult to effectively identify a certain frame voice signal according to described reference threshold is noise signal or voice signal, namely adopt this characteristic of more described reference threshold to be difficult to determine exactly noise signal, may produce thus flase drop and undetected situation.Because different noise signals has multiple different characteristic, therefore can be for other characteristics of the special noise signal of these several classes, for example under varying in size situation, rate of change and amplitude count different characteristics for different noise signals, adopt corresponding method that described voice signal is carried out preliminary judgement, can effectively identify like this part ground unrest (the special noise signal of described several classes).
Be in outside the preset range if judge described the first reference threshold by step S103, then execution in step S104 carries out time frequency analysis take frame as unit to the voice signal of input.Described time frequency analysis comprises time-domain analysis and frequency-domain analysis, is specially: a frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.Wherein, the frame voice signal based on after dividing frame level and smooth utilizes formula of variance to obtain the time domain variance; Frame voice signal after level and smooth based on minute frame is at first done Fast Fourier Transform (FFT) (FFT) to this signal, and the Fourier transform of obtaining is asked variance, asks mould as the rate of change of complex frequency domain to data at last.After the value of the value that obtains the time domain variance and frequency domain variance, again both be multiply by respectively certain weighting coefficient (the weighting coefficient sum that both take advantage of equals 1), the final value that obtains is the value of the comprehensive rate of change of this frame voice signal of reflection on time domain and frequency domain, i.e. result behind the described time frequency analysis.Method as for time-domain analysis and frequency-domain analysis is the art conventional means, does not repeat them here.
Be within the preset range if judge described the first reference threshold by step S103, then execution in step S105 carries out zero-crossing rate take frame as unit to the voice signal of input and calculates.It also is that comparatively commonly use a kind of carries out Time Domain Analysis to voice signal that described zero-crossing rate calculates.Those skilled in the art know, in zero-crossing rate (short-time zero-crossing rate) the expression one frame voice signal, its signal waveform is passed the number of times of transverse axis (zero level), spectral characteristic that can reflected signal, for continuous signal, zero passage means that namely time domain waveform passes through time shaft, and for discrete signal, if adjacent sampling value reindexing then be called zero passage.Zero-crossing rate is exactly the number of times of sample reindexing.The zero-crossing rate of voiceless sound and voiced sound distributes and roughly is Gaussian distribution, generally both zero-crossing rates have larger difference, can not distinguish voiceless sound and voiced sound fully although rely on zero-crossing rate, but because the number of times of the zero-crossing rate of the special noise signal of these several classes described in the present embodiment and the zero-crossing rate of voice signal have obvious difference, therefore compare by zero-crossing rate and a predefined threshold value that will calculate, can determine noise signal.Particularly, undertaken by step S105 after the calculating of zero-crossing rate, execution in step S106 then, judge that whether the zero-crossing rate that calculates is greater than predetermined threshold value, execution in step S104 then, take frame as unit the voice signal of input carried out time frequency analysis, otherwise execution in step S107 judges that this frame is noise signal.
Need to prove, very important for the selection of the predetermined threshold value of zero-crossing rate in the computation process of zero-crossing rate, selected the young pathbreaker to produce flase drop, it is undetected to select senior general to produce.Therefore, in the present embodiment, be based on the predetermined threshold value that described the first reference threshold is set described zero-crossing rate, can set out thus the predetermined threshold value of suitable zero-crossing rate.Particularly, described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with the maximum preset value of described the first reference threshold, and described the second preset range is relevant with middle preset value with the minimum preset value of described the first reference threshold; The described predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold comprises: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.Need to prove, be relevant with the type of noise signal for the setting of described the first predetermined threshold value and the second predetermined threshold value.As previously mentioned, there is the special noise signal of several classes can realize relatively easily judgement to it by the zero-crossing rate that calculates, but in these several noise-like signals, along with the difference of the type of noise signal, also variant to the standard (predetermined threshold value of described zero-crossing rate) that noise signal is judged.For instance: suppose to exist the special noise signal of two classes, for first kind noise signal, the zero-crossing rate that generally calculates is less than or equal to 19, then can be with 19 criterion as this noise-like signal, and for the Equations of The Second Kind noise signal, if still with 19 as criterion, then may exist undetected, the zero-crossing rate that namely much calculates is greater than 19 and be less than or equal to 28 voice signal and in fact all belong to noise signal, therefore, should be set as for the criterion of Equations of The Second Kind noise signal 28 proper.Otherwise, if with 28 criterion as first kind noise signal, then may have flase drop.Therefore, the residing preset range of described the first reference threshold is different, shows that the type of noise signal in the current voice signal is different, and the predetermined threshold value of the corresponding zero-crossing rate of setting is also different thus.
During implementation, described the first preset range is the maximum preset value greater than described the first parameter threshold, namely the first preset range is greater than 350, when described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value, and described the first predetermined threshold value is specially 28; Described the second preset range is between the middle preset value of the minimum preset value of described the first parameter threshold and described the first parameter threshold, namely the second preset range is 240~280, when described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value, and described the second predetermined threshold value is specially 19.For instance, if it is 360 that step S103 judges described the first reference threshold, this value is greater than 350, the first reference threshold is within described the first preset range, illustrate that then this frame might be special noise signal, need to be to its calculating of carrying out zero-crossing rate to determine whether that as noise signal this moment, the predetermined threshold value of zero-crossing rate was set as 28, if the zero-crossing rate that calculates is less than or equal to 28, determine that then this frame is noise signal; Similarly, if it is 260 that step S103 judges described the first reference threshold, this value is between 240~280, the first reference threshold is within described the second preset range, illustrate that then this frame also might be special noise signal, need to be to its calculating of carrying out zero-crossing rate to determine whether that as noise signal this moment, the predetermined threshold value of zero-crossing rate was set as 19, if the zero-crossing rate that calculates is less than or equal to 19, determine that then this frame is noise signal; If be 300 and step S103 judges described the first reference threshold, then the first reference threshold is in outside the described preset range, the predetermined threshold value of the zero-crossing rate of this moment generally is set as 1, this means and almost unlikely be judged to be noise signal, therefore, in actual implementation process, just no longer carry out the calculating of zero-crossing rate, but direct execution in step S104 carries out time frequency analysis take frame as unit to the voice signal of input.
After the result behind the step S104 acquisition time frequency analysis, execution in step S108 compares the result behind the time frequency analysis and described reference threshold.Particularly, if the result behind the time frequency analysis is less than or equal to the first reference threshold, then execution in step S109 judges that this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, execution in step S111 then, this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis is more than or equal to described the second reference threshold, then execution in step S110 judges that this frame is voice signal.
Wherein, based on the result of determination of next frame voice signal described signal undetermined is judged described in the step S111 and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.Particularly, if the next frame voice signal is judged to be voice signal, judge that then described signal undetermined is voice signal; If the next frame voice signal is judged to be noise signal, judge that then described signal undetermined is noise signal; If the next frame voice signal is judged to be signal undetermined, the result of determination that then is judged to be the next frame voice signal of signal undetermined based on this frame is again judged it.For instance, if determining, the 1st frame voice signal is noise signal, then directly it is abandoned, and the 2nd frame voice signal is judged to be signal undetermined, then it temporarily is stored among the buffer memory, wait for the result of determination of the 3rd frame voice signal, if the 3rd frame voice signal be judged to be voice signal, then the 2nd frame voice signal (signal undetermined) can be judged to be voice signal, certainly, if the 3rd frame voice signal still is judged to be signal undetermined, then continue to wait for the result of determination of the 4th frame voice signal, the 4th frame voice signal still is signal undetermined, then waits for the result of determination of the 5th frame voice signal, so until follow-uply have a frame to determine to be till noise signal or the voice signal.Thus, suppose that the 1st frame all is judged to be signal undetermined to the n frame, and the n+1 frame is judged to be noise signal, then the 1st frame all is judged to be noise signal to the n frame before, if the n+1 frame is judged to be voice signal, then the 1st frame all is judged to be voice signal to the n frame before.
Certainly, on the one hand because the finite capacity of buffer memory, can not preserve too many signal undetermined, on the other hand, the instantaneity requirement that voice signal is processed, also need not pass by for a long time signal undetermined on the holding time, therefore, a general consideration signal undetermined with predetermined quantity is stored among the buffer memory, with the result of determination of waiting for several frame signals in back it is further judged, when the frame number a predetermined level is exceeded of the signal undetermined of preserving in the buffer memory, that frame signal undetermined that then will deposit at first abandons, and namely observes the principle of first in first out for the preservation of signal undetermined.Illustrate, if described predetermined quantity is 8, suppose that the 1st frame to the 8 frame voice signals all are judged to be signal undetermined, this 8 frame voice signal all is kept in the buffer memory so, if the 9th frame is judged to be voice signal, then the 1st frame to the 8 frame voice signals all are voice signal, and the 1st frame voice signal can be used as the beginning of this section voice, if and the 9th frame is judged to be signal undetermined, then the 1st frame voice signal (being judged as signal undetermined) can be dropped; In like manner, if 10 frame voice signals after a certain frame voice signal all are judged to be signal undetermined, the 1st frame and the 2nd frame voice signal after this frame voice signal can be dropped, if and the 11st frame voice signal is noise signal, this 8 frame signal determining undetermined of then preserving is that noise signal is (during actual enforcement, for the naturalness that guarantees voice and the flatness of transition, this 8 frame signal undetermined can not be dropped, can after speech processes, export), this frame voice signal can be used as the end of this section voice.
In the present embodiment, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.When the voice signal that determines being carried out speech processes and exports, also described front P frame signal undetermined and rear Q frame signal undetermined also can be processed rear output, so, just can guarantee the naturalness of voice and the flatness of transition.Need to prove, the P here and Q refer to predefined in buffer memory the maximal value of the number of signals undetermined of preserving, when reality is implemented, the quantity of the signal undetermined that also can occur preserving in the buffer memory is less than the situation of P or Q, for example: if P=8, Q=5 supposes that the 1st frame to the 3 frames are judged to be signal undetermined, and follow-up a few frame voice signals all are voice signal, and then the quantity of the actual signal undetermined of preserving only has 3 frames in the buffer memory; In like manner, if 4 frame voice signals after a certain frame voice signal all are judged to be signal undetermined, the 5th frame after this frame voice signal is noise signal or voice signal, and then the quantity of the actual signal undetermined of preserving only has 3 frames in the buffer memory.In the present embodiment, get P=Q=3, certainly, the value of P, Q can be made suitable adjustment according to the actual requirements.
Especially, above-mentioned voice-activation detecting method based on the adaptive background noise can be applied to carry out on the voice conferencing server echo eliminates and noise remove, in voice conferencing, the voice signal of every road input can effectively be removed echo and noise that terminal is brought into by after the processing of the method.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 2 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention one, and as shown in Figure 2, the voice activation pick-up unit that present embodiment provides comprises: minute frame unit 101 is suitable for dividing frame to the voice signal of input; The first setup unit 102 is suitable for setting the reference threshold of noise signal, and described reference threshold comprises the first reference threshold and the second reference threshold, and described the second reference threshold and the first reference threshold have the multiple relation; The first identifying unit 103 links to each other with described the first setup unit 102, is suitable for judging whether described the first reference threshold that described the first setup unit 102 is set is within the preset range; Zero-crossing rate computing unit 104, link to each other with described minute frame unit 101, the first identifying unit 103, be suitable for judging described the first reference threshold when described the first identifying unit 103 and be within the preset range, take frame as unit the voice signal of input is carried out zero-crossing rate and calculate; The second identifying unit 105 links to each other with described zero-crossing rate computing unit 104, whether is suitable for judging the zero-crossing rate that calculates greater than predetermined threshold value, otherwise judges that this frame is noise signal; Time frequency analysis unit 106, link to each other with described minute frame unit 101, the first identifying unit 103, the second identifying unit 105, be suitable for judging described the first reference threshold when described the first identifying unit 103 and be in outside the preset range or described the second identifying unit 105 when judging the zero-crossing rate that calculates greater than described predetermined threshold value, take frame as unit time frequency analysis is carried out in the voice signal of input; The 3rd identifying unit 107 links to each other with described time frequency analysis unit 106, and the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; Result behind time frequency analysis is greater than the first reference threshold, and less than described the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Result behind time frequency analysis judges then that more than or equal to described the second reference threshold this frame is voice signal.Described the 3rd identifying unit 107 is judged described signal undetermined based on the result of determination of next frame voice signal and is specially: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
In the present embodiment, described voice activation pick-up unit also comprises the second setup unit 109, described the second setup unit 109 is connected with the first setup unit 102, the second identifying unit 105, is suitable for setting based on described the first reference threshold the predetermined threshold value of described zero-crossing rate.Particularly, described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with the maximum preset value of described the first reference threshold, and described the second preset range is relevant with middle preset value with the minimum preset value of described the first reference threshold; The predetermined threshold value that described the second setup unit 109 is set described zero-crossing rate based on described the first reference threshold is specially: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.
Described voice activation pick-up unit also comprises noise prediction unit 108, described noise prediction unit 108 is connected with a minute frame unit 101, the first setup unit 102, N frame voice signal and analyzing before being suitable for extracting in the voice signal of described input obtains the described reference threshold (comprising the first reference threshold and the second reference threshold) that described the first setup unit 102 is set.
In addition, described voice activation pick-up unit also comprises storage unit 110, described storage unit 110 is connected with the 3rd identifying unit 107, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
Voice-activation detecting method about the implementation of described voice activation pick-up unit can provide with reference to present embodiment does not repeat them here.
Embodiment two
Fig. 3 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention two.As shown in Figure 3, with embodiment one distinguishes to some extent be, in the present embodiment, after judging that this frame is as noise signal among step S107 or the step S109, also comprise execution in step S112, upgrade described reference threshold based on this frame noise signal.Particularly, describedly upgrade described reference threshold based on this frame noise signal and comprise: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described reference threshold.Because having determined with regard to a frame is with regard to the voice signal of noise signal, it is carried out the characteristic that result behind the time frequency analysis can show ground unrest under the current environment, so can be based on the result behind the time frequency analysis of this frame noise signal, multiply by certain weighting coefficient a, again with the maximum preset of described the first reference threshold weighting coefficient b with correspondence on duty, wherein, a+b=1, the value that obtains with both additions afterwards is as the first new reference threshold, again based on the multiple relation that has between described the first reference threshold and the second reference threshold with second reference threshold that must make new advances of the first reference threshold after upgrading.For instance, described the first reference threshold of supposing current setting is 260, one frame voice signal is carried out after the time frequency analysis, the result of the time frequency analysis that obtains is 250, then by behind the execution in step S108, judge the result of time frequency analysis less than the first reference threshold, execution in step S109 then, then, execution in step S112, upgrade described reference threshold based on this frame noise signal, described in embodiment one, the maximum preset value of described the first reference threshold is 350, supposes that the weighting coefficient to the result of time frequency analysis is 0.6, be 0.4 to the weighting coefficient of the maximum preset value of described the first reference threshold then, the value that then obtains at last should be the 250*0.6+350*0.4=150+140=290.So with 290 as the first reference threshold after upgrading, because the second reference threshold is 1.3 times of the first reference threshold in the present embodiment, the second reference threshold after upgrading so is 377.Certainly, the above is just to upgrading a kind of mode of described reference threshold based on this frame noise signal, in other embodiments, also can be in the result who determines time frequency analysis less than the first reference threshold, replace described the first reference threshold with the result of time frequency analysis.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
But the implementation of other step reference examples one of present embodiment does not repeat them here.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 4 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention two, as shown in Figure 4, the voice activation pick-up unit that present embodiment provides not only comprises each unit of voice activation pick-up unit described in the embodiment one, difference is to some extent, also comprise updating block 111, described updating block 111 is connected with the second identifying unit 105, the 3rd identifying unit 107, the first setup unit 102, be suitable for after described the second identifying unit 105 or the 3rd identifying unit 107 judge that this frame is as noise signal, upgrading described reference threshold based on this frame noise signal.Described updating block 111 upgrades described reference threshold based on this frame noise signal and is specially: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold, and based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after upgrading upgrade described the second reference threshold.
The implementation of the described voice activation pick-up unit of present embodiment can with reference to the described voice-activation detecting method of present embodiment, not repeat them here.
Embodiment three
Fig. 5 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention three.As shown in Figure 5, with voice-activation detecting method described in embodiment two, the embodiment three different be that present embodiment can be realized voice-activation detecting method provided by the invention by comparatively simple embodiment.In conjunction with Fig. 1 or Fig. 3, particularly, the voice-activation detecting method that present embodiment provides does not need to judge whether described the first reference threshold is in the step (step S103) within the preset range, also just do not need to carry out calculating and follow-up relevant determination step (the step S105 thereof of zero-crossing rate thus, step S106, step S107), in addition, the step that also need to not set the first reference threshold and second reference threshold of noise signal before carrying out time frequency analysis, the result behind the time frequency analysis can be directly compare with the first reference threshold and second reference threshold of a pre-stored acquiescence.
The voice-activation detecting method that present embodiment provides comprises: step S201, divide frame to the voice signal of inputting; Step S202 carries out time frequency analysis take frame as unit to the voice signal of input; Step S203 compares the result behind the time frequency analysis and the first reference threshold and the second reference threshold, if the result behind the time frequency analysis is less than or equal to the first reference threshold, then execution in step S204 judges that this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, execution in step S205 then, this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis is more than or equal to described the second reference threshold, then execution in step S206 judges that this frame is voice signal.
In the present embodiment, the length of each frame voice signal is 8ms.Described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.Described the second reference threshold is 1.3 times of the first reference threshold.Described result of determination based on the next frame voice signal is judged described signal undetermined and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.Described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.
In addition, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
In other embodiments, after judging that this frame is as noise signal, voice-activation detecting method can also comprise the step of upgrading described the first reference threshold and the second reference threshold based on this frame noise signal.Specifically comprise: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold and the second reference threshold.But the associated description for voice-activation detecting method described in the step reference example two of upgrading described the first reference threshold and the second reference threshold based on this frame noise signal does not repeat them here.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 6 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention three, and as shown in Figure 6, the voice activation pick-up unit that present embodiment provides comprises: minute frame unit 201 is suitable for dividing frame to the voice signal of input; Time frequency analysis unit 202 links to each other with described minute frame unit 201, is suitable for take frame as unit the voice signal of input being carried out time frequency analysis; The time frequency analysis that described time frequency analysis unit 202 carries out comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection; Identifying unit 203 links to each other with described time frequency analysis unit 202, and the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold.Described identifying unit 203 is judged described signal undetermined based on the result of determination of next frame voice signal and is specially: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
In the present embodiment, described voice activation pick-up unit also comprises noise prediction unit 204, described noise prediction unit 204 is connected with a minute frame unit 201, identifying unit 203, N frame voice signal and analyzing before being suitable for extracting in the voice signal of described input obtains described the first reference threshold and the second reference threshold.
In addition, described voice activation pick-up unit also comprises storage unit 205, described storage unit 205 is connected with identifying unit 203, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
In other embodiments, the voice activation pick-up unit can also comprise updating block, is suitable for upgrading described the first reference threshold and the second reference threshold based on this frame noise signal after identifying unit 203 judges that this frame is as noise signal.Described updating block upgrades described the first reference threshold based on this frame noise signal and the second reference threshold is specially: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold and the second reference threshold.
The implementation of the described voice activation pick-up unit of present embodiment can with reference to the correlation step of voice-activation detecting method described in present embodiment and the embodiment one, not repeat them here.
To sum up, the voice-activation detecting method that embodiment of the present invention provides and device have following beneficial effect at least:
Divide frame (every frame voice signal seamlessly transits) by the voice signal to input, take frame as unit the voice signal of input is carried out time frequency analysis again, the first reference threshold and second reference threshold of the result of time frequency analysis and pre-set noise signal are compared, thereby can fast, effectively identify a certain frame voice signal is voice signal or noise signal, reaches the purpose that reduces ground unrest when guaranteeing speech quality with realization.
Further, the noise signal special to a few classes, by judging whether the first reference threshold of setting is within the preset range, then to be in the corresponding predetermined threshold value of setting different zero-crossing rates of different preset range (different noise signal types) according to described the first reference threshold, calculate by take frame as unit the voice signal of input being carried out zero-crossing rate, then be judged to be noise signal for the zero-crossing rate that calculates less than described predetermined threshold value, otherwise further check with time frequency analysis, realized thus different noise signals are checked targetedly, thereby can avoid to a great extent producing flase drop and undetected situation, more effective to the identification of noise signal and voice signal.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are constantly upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
In addition, by N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described the first reference threshold and the second reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
Although the present invention with preferred embodiment openly as above; but it is not to limit the present invention; any those skilled in the art without departing from the spirit and scope of the present invention; can utilize method and the technology contents of above-mentioned announcement that technical solution of the present invention is made possible change and modification; therefore; every content that does not break away from technical solution of the present invention; to any simple modification, equivalent variations and modification that above embodiment does, all belong to the protection domain of technical solution of the present invention according to technical spirit of the present invention.