CN102314884B - Voice-activation detecting method and device - Google Patents

Voice-activation detecting method and device Download PDF

Info

Publication number
CN102314884B
CN102314884B CN2011102352285A CN201110235228A CN102314884B CN 102314884 B CN102314884 B CN 102314884B CN 2011102352285 A CN2011102352285 A CN 2011102352285A CN 201110235228 A CN201110235228 A CN 201110235228A CN 102314884 B CN102314884 B CN 102314884B
Authority
CN
China
Prior art keywords
reference threshold
frame
signal
voice
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2011102352285A
Other languages
Chinese (zh)
Other versions
CN102314884A (en
Inventor
吴飞飞
栗红霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZED-3 TECHNOLOGY CO., LTD.
Original Assignee
SHANGHAI GENER INFORMATION TECHNOLOGY Co Ltd
Czech Surway Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI GENER INFORMATION TECHNOLOGY Co Ltd, Czech Surway Technology (beijing) Co Ltd filed Critical SHANGHAI GENER INFORMATION TECHNOLOGY Co Ltd
Priority to CN2011102352285A priority Critical patent/CN102314884B/en
Publication of CN102314884A publication Critical patent/CN102314884A/en
Application granted granted Critical
Publication of CN102314884B publication Critical patent/CN102314884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a voice-activation detecting method and device, wherein the voice-activation detecting method comprises the following steps of: carrying out frame separation for an input sound signal; carrying out time-frequency analysis for the input sound signal by using a frame as a unit; if a result after the time-frequency analysis is smaller than or equal to a first reference threshold value, then judging the frame as a noise signal; if the result after the time-frequency analysis is larger than the first reference threshold value and smaller than a second reference threshold value, then judging the frame as an undetermined signal, and judging the undetermined signal on the basis of the judging result of a next-frame sound signal; and if the result after the time-frequency analysis is larger than or equal to the second reference threshold value, then judging the frame as a voice signal, wherein the second reference threshold value and the first reference threshold value have a multiple relationship. Through the technical scheme, the voice signal and the noise signal in the input sound signal can be rapidly and effectively identified, and the background noise is reduced while the conversation quality is ensured.

Description

Voice-activation detecting method and device
Technical field
The present invention relates to the Audio Signal Processing technical field, particularly a kind of voice-activation detecting method and device.
Background technology
It is a kind of pause and quiet interval by occurring in the specific decision rule judgement voice that voice activation detects (VAD, Voice Activity Detection), detects the technology of efficient voice part.Usually use this technology guaranteeing under the prerequisite of voice quality, adopt different bit numbers to encode to different classes of voice segments, thereby reduce the code rate of voice.Because in duplex communication system, one side only has the time about 35% to be in state of activation, the code rate that how to reduce the quiet phase has positive effect for reducing transmission bandwidth, power and capacity, so the VAD technology has important use value in the voice communication field.
Voice conferencing based on IP (Internet Protocol), generally all carrying out echo by terminal eliminates and denoising, but some terminal is not done these processing, causes meeting the inside echo and noise all very large, has had a strong impact on the quality of voice conferencing.In order to adapt to the terminal of various qualities, voice server (for example multimedia dispatching machine) is necessary that echo and noise that terminal is brought into process, and makes the voice conferencing quality reach available degree.And can distinguish voice signal and noise signal in the voice signal of transmission by the VAD technology, and remove noise signal to avoid the transmission of garbage signal, improve voice quality.At present, more for the research of VAD technology, for example:
(1) " based on the vad algorithm of three rank semi-invariants ", Beijing University of Post ﹠ Telecommunication, Wang Fan.This algorithm can be judged the voice that are submerged in the noise, but because noise signal and voiceless sound signal are obeyed the distribution character that comparatively approaches, this is just so that after utilizing this algorithm, speech quality is descended, and this is the deficiency that three rank semi-invariant theories can't overcome.
(2) " based on the vad algorithm of Higher-Order Cyclic semi-invariant ", the Central China University of Science and Technology, Zhu Xiaoliang.This algorithm adopts MA (Moving Average) model to the voice signal modeling, and selects the method for average amplitude poor (AMDF, Average Magnitude Difference Function) to estimate that cycle frequency is to reduce algorithm complex.This algorithm is to Gauss's (white or coloured) noise and other stationary noise adaptive ability is strong, the detection performance is outstanding, but not very desirable for the treatment effect of complex background noise.
Therefore, at present a lot of methods all remove to reduce ground unrest to sacrifice speech quality as cost, and not good for the treatment effect of complex background noise, when temporarily also having a kind of method to guarantee speech quality with reducing noise to minimum.
About the correlation technique of VAD, can be the Chinese patent application of CN 101320559A with reference to publication number also, this patent disclosure a kind of sound activation detection apparatus and method.
Summary of the invention
The problem that the present invention solves provides a kind of voice-activation detecting method and device, can fast, effectively identify voice signal and noise signal in the voice signal of input, reaches the purpose that reduces ground unrest when guaranteeing speech quality.
For addressing the above problem, technical scheme of the present invention provides a kind of voice-activation detecting method, comprising:
Voice signal to input divides frame;
Take frame as unit the voice signal of input carried out time frequency analysis;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; Described the second reference threshold and the first reference threshold have the multiple relation.
Optionally, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
Optionally, based on the result of determination of next frame voice signal described signal undetermined is judged and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
Optionally, after judging that this frame is as noise signal, also comprise based on this frame noise signal and upgrade described the first reference threshold and the second reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
Optionally, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
Optionally, described the second reference threshold is 1.3 times of the first reference threshold.
Optionally, the length of each frame voice signal is 8ms.
Optionally, described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice-activation detecting method, comprising:
Voice signal to input divides frame;
Set the first reference threshold and second reference threshold of noise signal, described the second reference threshold and the first reference threshold have the multiple relation;
Judge whether described the first reference threshold is within the preset range, otherwise take frame as unit the voice signal of input is carried out time frequency analysis; Be then take frame as unit the voice signal of input to be carried out zero-crossing rate to calculate, if the zero-crossing rate that calculates greater than predetermined threshold value, then carries out described time frequency analysis, otherwise judge that this frame is noise signal;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold.
Optionally, described voice-activation detecting method also comprises the predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with described maximum preset value, described the second preset range is relevant with middle preset value with described minimum preset value, described in the middle of preset value greater than minimum preset value, and less than the maximum preset value; The described predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold comprises: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.
Optionally, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
Optionally, based on the result of determination of next frame voice signal described signal undetermined is judged and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
Optionally, after judging that this frame is as noise signal, described voice-activation detecting method also comprises based on this frame noise signal and upgrades described the first reference threshold and the second reference threshold.
Optionally, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice activation pick-up unit, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The time frequency analysis unit is suitable for take frame as unit the voice signal of input being carried out time frequency analysis;
Identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; Described the second reference threshold and the first reference threshold have the multiple relation.
Optionally, described voice activation pick-up unit also comprises the noise prediction unit, be suitable for extracting in the voice signal of described input before N frame voice signal and analyzing, obtain described the first reference threshold and the second reference threshold.
Optionally, described voice activation pick-up unit also comprises updating block, is suitable for upgrading described the first reference threshold and the second reference threshold based on this frame noise signal after described identifying unit judges that this frame is as noise signal.
Optionally, described voice activation pick-up unit also comprises storage unit, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
For addressing the above problem, technical scheme of the present invention also provides a kind of voice activation pick-up unit, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The first setup unit is suitable for setting the first reference threshold and second reference threshold of noise signal, and described the second reference threshold and the first reference threshold have the multiple relation;
The first identifying unit is suitable for judging whether described the first reference threshold is within the preset range;
The zero-crossing rate computing unit is suitable for being within the preset range when judging described the first reference threshold, take frame as unit the voice signal of input is carried out zero-crossing rate and calculates;
Whether the second identifying unit is suitable for judging the zero-crossing rate that calculates greater than predetermined threshold value, otherwise judges that this frame is noise signal;
The time frequency analysis unit, be suitable for judging described the first reference threshold when described the first identifying unit and be in outside the preset range or described the second identifying unit when judging the zero-crossing rate that calculates greater than described predetermined threshold value, take frame as unit time frequency analysis is carried out in the voice signal of input;
The 3rd identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; Result behind time frequency analysis is greater than the first reference threshold, and less than described the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Result behind time frequency analysis judges then that more than or equal to described the second reference threshold this frame is voice signal.
Compared with prior art, the technical program has the following advantages:
Divide frame (every frame voice signal seamlessly transits) by the voice signal to input, take frame as unit the voice signal of input is carried out time frequency analysis again, the first reference threshold and second reference threshold of the result of time frequency analysis and pre-set noise signal are compared, thereby can fast, effectively identify a certain frame voice signal is voice signal or noise signal, reaches the purpose that reduces ground unrest when guaranteeing speech quality with realization.
By judging whether the first reference threshold of setting is within the preset range, then to be in the corresponding predetermined threshold value of setting different zero-crossing rates of different preset range (different noise signal types) according to described the first reference threshold, calculate by take frame as unit the voice signal of input being carried out zero-crossing rate, then be judged to be noise signal for the zero-crossing rate that calculates less than described predetermined threshold value, otherwise further check with time frequency analysis, realized thus different noise signals are checked targetedly, thereby can avoid to a great extent producing flase drop and undetected situation, more effective to the identification of noise signal and voice signal.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are constantly upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
In addition, by N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described the first reference threshold and the second reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
Description of drawings
Fig. 1 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention one;
Fig. 2 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention one;
Fig. 3 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention two;
Fig. 4 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention two;
Fig. 5 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention three;
Fig. 6 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention three.
Embodiment
As stated in the Background Art, a lot of methods all are that to sacrifice speech quality be that cost removes to reduce ground unrest in the prior art, and not good for the treatment effect of complex background noise.The technical program is by adopting various simulation tools to find the difference of characteristic between voice signal and the noise signal, then fully utilize voice level and smooth (minute frame), the time domain zero-crossing rate calculates, the time domain variance is calculated, the methods such as frequency domain variance calculating obtain the value of the comprehensive rate of change of voice signal on time domain and frequency domain of reflection input, and adopt the method for adaptive background noise to detect VAD, so that can fast, effectively identify voice signal and noise signal in the voice signal of input, reach the purpose of reduction voice real quality when removing noise signal.
For above-mentioned purpose of the present invention, feature and advantage can more be become apparent, below in conjunction with accompanying drawing the specific embodiment of the present invention is described in detail.Set forth detail in the following description so that fully understand the present invention.But the present invention can be different from alternate manner described here and implements with multiple, and those skilled in the art can do similar popularization in the situation of intension of the present invention.Therefore the present invention is not subjected to the restriction of following public embodiment.
Embodiment one
Fig. 1 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention one.As shown in Figure 1, described voice-activation detecting method may further comprise the steps:
At first execution in step S101 divides frame to the voice signal of inputting.
Those skilled in the art know, the purpose of speech signal analysis is exactly to be to extract easily and effectively and represent the information that voice signal is entrained, prerequisite and the basis that voice signal is processed, only analyze the parameter that can represent phonic signal character, just might utilize these parameters to carry out the processing such as efficient voice communication, phonetic synthesis and speech recognition.Voice generally are divided into unvoiced segments, voiceless sound section and voiced segments.Generally voiced sound is thought an oblique triangular pulse string take pitch period as the cycle, voiceless sound is modeled to random white noise.Because voice signal is a non stationary state process, the signal processing technology that can not use pats steady signal is carried out analyzing and processing to it.But because voice signal itself, in short time (for example 10~30ms even shorter time) scope, its characteristic can be regarded as a metastable state process, and namely voice signal has in short-term stationarity.Therefore, utilize the in short-term smooth performance of voice, the signal processing technology of processing stationary signal can be incorporated into going in the processing in short-term of voice signal, for example can adopt windowing to divide the method for frame that the voice signal (comprising voice signal and noise signal) of inputting is divided into the multiframe voice signal, each frame voice signal in short-term is called again an analysis frame (referred to as frame).Dividing frame is the voice signal formation analysis frame that intercepts input with the window function of finite length, and window function will need the sampling point zero setting outside the processing region to obtain current analysis frame.Although minute frame can adopt the method with the voice signal contiguous segmentation of input, but the method for the overlapping segmentation of general normal employing, namely former frame and a rear frame have common overlapping part, and this overlapping part is called frame and moves, can make between frame and the frame like this to seamlessly transit, keep its continuity.The ratio that frame moves with frame length (length of a frame voice signal) generally is taken as 0~1/2.In the present embodiment, the length of each frame voice signal is 8ms, and the zero-crossing rate in the subsequent step calculates and prediction and the estimation of ground unrest are all calculated according to the 8ms length data.Carrying out windowing about the voice signal to input, to divide the method for frame be the art conventional means, do not repeat them here.
Execution in step S102, the reference threshold of setting noise signal, described reference threshold comprises the first reference threshold and the second reference threshold.Because identify voice signal and noise signal in the voice signal, just need to analyze the difference of characteristic between noise signal and the voice signal, particularly various types of noise signals are analyzed.To this, just need to carry out in advance a large amount of experiments, each noise-like signal is analyzed, extract its characteristic parameter, for example: method commonly used is by noise signal is carried out time-domain analysis and frequency-domain analysis, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection, thereby count the scope that can fast, effectively identify the reference threshold of noise signal and voice signal.So, after the voice signal of input divides frame by step S101, in subsequent step, just can analyze each frame voice signal take frame as unit, and the result after will analyzing and described reference threshold compare, and is the signal that noise signal, voice signal or remain further judged thereby determine this frame voice signal based on different comparative results.Concrete decision process will be described in detail in the step below.
Need to prove, described reference threshold comprises the first reference threshold and the second reference threshold, wherein, described the first reference threshold is mainly used in realizing the identification to noise signal, and described the second reference threshold then is mainly used in realizing the identification to voice signal, finds by the statistics of great many of experiments, has the certain multiple relation between the second reference threshold and the first reference threshold, therefore, determine the first reference threshold, also just can determine the second reference threshold.In the present embodiment, described the second reference threshold is 1.3 times of the first reference threshold, for determining of " 1.3 times ", just is being based on multiple ground unrest is carried out great many of experiments, the value that obtains by the statistical computation result.
In addition, in the concrete assignment procedure of described the first reference threshold, be respectively arranged with maximum preset value, minimum preset value, the span of described the first reference threshold is shown [minimum preset value with interval table, maximum preset value], be described the first reference threshold more than or equal to described minimum preset value, and be less than or equal to described maximum preset value.Certainly, preset value in the middle of between described minimum preset value and maximum preset value, can also setting one, the span of preset value is shown (minimum preset value with interval table in the middle of described, the maximum preset value), preset value is greater than described minimum preset value and less than described maximum preset value in the middle of namely described.Also can affect the result of final decision for the setting of the maximum preset value of described the first reference threshold and minimum preset value, therefore, when setting described the first reference threshold, should arrange described maximum preset value and minimum preset value according to actual conditions.During implementation, the maximum preset value of described the first reference threshold is made as 350, and the minimum preset value of described the first reference threshold is made as 240, and the middle preset value of described the first reference threshold is made as 280.
In the present embodiment, described reference threshold (comprising the first reference threshold and the second reference threshold) is by N frame voice signal before in the voice signal that extracts described input and analyzes and obtain.Usually, the value of N larger (frame number that namely gathers is more), prediction effect for the ground unrest of current environment at the beginning of the voice call foundation is just better, certainly, if the frame number that gathers is more, the process of its analyzing and processing will be long, thereby definite process of reference threshold will take certain hour, can not finish in time the setting to the reference threshold of noise signal.Therefore, in the specific implementation, can determine according to actual conditions the value of N.By N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
In other embodiments, also can just select in advance suitable reference threshold to finish setting according to actual conditions, for example before voice call just the people in addition, can also adopt the reference threshold of the acquiescence that sets already for setting reference threshold.
Execution in step S103 judges whether described the first reference threshold is within the preset range.As previously mentioned, described the first reference threshold is mainly used in realizing the identification to noise signal, yet, consider that the special noise signal of a few classes is comparatively similar with voice signal on some characteristic, probably being difficult to effectively identify a certain frame voice signal according to described reference threshold is noise signal or voice signal, namely adopt this characteristic of more described reference threshold to be difficult to determine exactly noise signal, may produce thus flase drop and undetected situation.Because different noise signals has multiple different characteristic, therefore can be for other characteristics of the special noise signal of these several classes, for example under varying in size situation, rate of change and amplitude count different characteristics for different noise signals, adopt corresponding method that described voice signal is carried out preliminary judgement, can effectively identify like this part ground unrest (the special noise signal of described several classes).
Be in outside the preset range if judge described the first reference threshold by step S103, then execution in step S104 carries out time frequency analysis take frame as unit to the voice signal of input.Described time frequency analysis comprises time-domain analysis and frequency-domain analysis, is specially: a frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.Wherein, the frame voice signal based on after dividing frame level and smooth utilizes formula of variance to obtain the time domain variance; Frame voice signal after level and smooth based on minute frame is at first done Fast Fourier Transform (FFT) (FFT) to this signal, and the Fourier transform of obtaining is asked variance, asks mould as the rate of change of complex frequency domain to data at last.After the value of the value that obtains the time domain variance and frequency domain variance, again both be multiply by respectively certain weighting coefficient (the weighting coefficient sum that both take advantage of equals 1), the final value that obtains is the value of the comprehensive rate of change of this frame voice signal of reflection on time domain and frequency domain, i.e. result behind the described time frequency analysis.Method as for time-domain analysis and frequency-domain analysis is the art conventional means, does not repeat them here.
Be within the preset range if judge described the first reference threshold by step S103, then execution in step S105 carries out zero-crossing rate take frame as unit to the voice signal of input and calculates.It also is that comparatively commonly use a kind of carries out Time Domain Analysis to voice signal that described zero-crossing rate calculates.Those skilled in the art know, in zero-crossing rate (short-time zero-crossing rate) the expression one frame voice signal, its signal waveform is passed the number of times of transverse axis (zero level), spectral characteristic that can reflected signal, for continuous signal, zero passage means that namely time domain waveform passes through time shaft, and for discrete signal, if adjacent sampling value reindexing then be called zero passage.Zero-crossing rate is exactly the number of times of sample reindexing.The zero-crossing rate of voiceless sound and voiced sound distributes and roughly is Gaussian distribution, generally both zero-crossing rates have larger difference, can not distinguish voiceless sound and voiced sound fully although rely on zero-crossing rate, but because the number of times of the zero-crossing rate of the special noise signal of these several classes described in the present embodiment and the zero-crossing rate of voice signal have obvious difference, therefore compare by zero-crossing rate and a predefined threshold value that will calculate, can determine noise signal.Particularly, undertaken by step S105 after the calculating of zero-crossing rate, execution in step S106 then, judge that whether the zero-crossing rate that calculates is greater than predetermined threshold value, execution in step S104 then, take frame as unit the voice signal of input carried out time frequency analysis, otherwise execution in step S107 judges that this frame is noise signal.
Need to prove, very important for the selection of the predetermined threshold value of zero-crossing rate in the computation process of zero-crossing rate, selected the young pathbreaker to produce flase drop, it is undetected to select senior general to produce.Therefore, in the present embodiment, be based on the predetermined threshold value that described the first reference threshold is set described zero-crossing rate, can set out thus the predetermined threshold value of suitable zero-crossing rate.Particularly, described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with the maximum preset value of described the first reference threshold, and described the second preset range is relevant with middle preset value with the minimum preset value of described the first reference threshold; The described predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold comprises: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.Need to prove, be relevant with the type of noise signal for the setting of described the first predetermined threshold value and the second predetermined threshold value.As previously mentioned, there is the special noise signal of several classes can realize relatively easily judgement to it by the zero-crossing rate that calculates, but in these several noise-like signals, along with the difference of the type of noise signal, also variant to the standard (predetermined threshold value of described zero-crossing rate) that noise signal is judged.For instance: suppose to exist the special noise signal of two classes, for first kind noise signal, the zero-crossing rate that generally calculates is less than or equal to 19, then can be with 19 criterion as this noise-like signal, and for the Equations of The Second Kind noise signal, if still with 19 as criterion, then may exist undetected, the zero-crossing rate that namely much calculates is greater than 19 and be less than or equal to 28 voice signal and in fact all belong to noise signal, therefore, should be set as for the criterion of Equations of The Second Kind noise signal 28 proper.Otherwise, if with 28 criterion as first kind noise signal, then may have flase drop.Therefore, the residing preset range of described the first reference threshold is different, shows that the type of noise signal in the current voice signal is different, and the predetermined threshold value of the corresponding zero-crossing rate of setting is also different thus.
During implementation, described the first preset range is the maximum preset value greater than described the first parameter threshold, namely the first preset range is greater than 350, when described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value, and described the first predetermined threshold value is specially 28; Described the second preset range is between the middle preset value of the minimum preset value of described the first parameter threshold and described the first parameter threshold, namely the second preset range is 240~280, when described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value, and described the second predetermined threshold value is specially 19.For instance, if it is 360 that step S103 judges described the first reference threshold, this value is greater than 350, the first reference threshold is within described the first preset range, illustrate that then this frame might be special noise signal, need to be to its calculating of carrying out zero-crossing rate to determine whether that as noise signal this moment, the predetermined threshold value of zero-crossing rate was set as 28, if the zero-crossing rate that calculates is less than or equal to 28, determine that then this frame is noise signal; Similarly, if it is 260 that step S103 judges described the first reference threshold, this value is between 240~280, the first reference threshold is within described the second preset range, illustrate that then this frame also might be special noise signal, need to be to its calculating of carrying out zero-crossing rate to determine whether that as noise signal this moment, the predetermined threshold value of zero-crossing rate was set as 19, if the zero-crossing rate that calculates is less than or equal to 19, determine that then this frame is noise signal; If be 300 and step S103 judges described the first reference threshold, then the first reference threshold is in outside the described preset range, the predetermined threshold value of the zero-crossing rate of this moment generally is set as 1, this means and almost unlikely be judged to be noise signal, therefore, in actual implementation process, just no longer carry out the calculating of zero-crossing rate, but direct execution in step S104 carries out time frequency analysis take frame as unit to the voice signal of input.
After the result behind the step S104 acquisition time frequency analysis, execution in step S108 compares the result behind the time frequency analysis and described reference threshold.Particularly, if the result behind the time frequency analysis is less than or equal to the first reference threshold, then execution in step S109 judges that this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, execution in step S111 then, this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis is more than or equal to described the second reference threshold, then execution in step S110 judges that this frame is voice signal.
Wherein, based on the result of determination of next frame voice signal described signal undetermined is judged described in the step S111 and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.Particularly, if the next frame voice signal is judged to be voice signal, judge that then described signal undetermined is voice signal; If the next frame voice signal is judged to be noise signal, judge that then described signal undetermined is noise signal; If the next frame voice signal is judged to be signal undetermined, the result of determination that then is judged to be the next frame voice signal of signal undetermined based on this frame is again judged it.For instance, if determining, the 1st frame voice signal is noise signal, then directly it is abandoned, and the 2nd frame voice signal is judged to be signal undetermined, then it temporarily is stored among the buffer memory, wait for the result of determination of the 3rd frame voice signal, if the 3rd frame voice signal be judged to be voice signal, then the 2nd frame voice signal (signal undetermined) can be judged to be voice signal, certainly, if the 3rd frame voice signal still is judged to be signal undetermined, then continue to wait for the result of determination of the 4th frame voice signal, the 4th frame voice signal still is signal undetermined, then waits for the result of determination of the 5th frame voice signal, so until follow-uply have a frame to determine to be till noise signal or the voice signal.Thus, suppose that the 1st frame all is judged to be signal undetermined to the n frame, and the n+1 frame is judged to be noise signal, then the 1st frame all is judged to be noise signal to the n frame before, if the n+1 frame is judged to be voice signal, then the 1st frame all is judged to be voice signal to the n frame before.
Certainly, on the one hand because the finite capacity of buffer memory, can not preserve too many signal undetermined, on the other hand, the instantaneity requirement that voice signal is processed, also need not pass by for a long time signal undetermined on the holding time, therefore, a general consideration signal undetermined with predetermined quantity is stored among the buffer memory, with the result of determination of waiting for several frame signals in back it is further judged, when the frame number a predetermined level is exceeded of the signal undetermined of preserving in the buffer memory, that frame signal undetermined that then will deposit at first abandons, and namely observes the principle of first in first out for the preservation of signal undetermined.Illustrate, if described predetermined quantity is 8, suppose that the 1st frame to the 8 frame voice signals all are judged to be signal undetermined, this 8 frame voice signal all is kept in the buffer memory so, if the 9th frame is judged to be voice signal, then the 1st frame to the 8 frame voice signals all are voice signal, and the 1st frame voice signal can be used as the beginning of this section voice, if and the 9th frame is judged to be signal undetermined, then the 1st frame voice signal (being judged as signal undetermined) can be dropped; In like manner, if 10 frame voice signals after a certain frame voice signal all are judged to be signal undetermined, the 1st frame and the 2nd frame voice signal after this frame voice signal can be dropped, if and the 11st frame voice signal is noise signal, this 8 frame signal determining undetermined of then preserving is that noise signal is (during actual enforcement, for the naturalness that guarantees voice and the flatness of transition, this 8 frame signal undetermined can not be dropped, can after speech processes, export), this frame voice signal can be used as the end of this section voice.
In the present embodiment, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.When the voice signal that determines being carried out speech processes and exports, also described front P frame signal undetermined and rear Q frame signal undetermined also can be processed rear output, so, just can guarantee the naturalness of voice and the flatness of transition.Need to prove, the P here and Q refer to predefined in buffer memory the maximal value of the number of signals undetermined of preserving, when reality is implemented, the quantity of the signal undetermined that also can occur preserving in the buffer memory is less than the situation of P or Q, for example: if P=8, Q=5 supposes that the 1st frame to the 3 frames are judged to be signal undetermined, and follow-up a few frame voice signals all are voice signal, and then the quantity of the actual signal undetermined of preserving only has 3 frames in the buffer memory; In like manner, if 4 frame voice signals after a certain frame voice signal all are judged to be signal undetermined, the 5th frame after this frame voice signal is noise signal or voice signal, and then the quantity of the actual signal undetermined of preserving only has 3 frames in the buffer memory.In the present embodiment, get P=Q=3, certainly, the value of P, Q can be made suitable adjustment according to the actual requirements.
Especially, above-mentioned voice-activation detecting method based on the adaptive background noise can be applied to carry out on the voice conferencing server echo eliminates and noise remove, in voice conferencing, the voice signal of every road input can effectively be removed echo and noise that terminal is brought into by after the processing of the method.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 2 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention one, and as shown in Figure 2, the voice activation pick-up unit that present embodiment provides comprises: minute frame unit 101 is suitable for dividing frame to the voice signal of input; The first setup unit 102 is suitable for setting the reference threshold of noise signal, and described reference threshold comprises the first reference threshold and the second reference threshold, and described the second reference threshold and the first reference threshold have the multiple relation; The first identifying unit 103 links to each other with described the first setup unit 102, is suitable for judging whether described the first reference threshold that described the first setup unit 102 is set is within the preset range; Zero-crossing rate computing unit 104, link to each other with described minute frame unit 101, the first identifying unit 103, be suitable for judging described the first reference threshold when described the first identifying unit 103 and be within the preset range, take frame as unit the voice signal of input is carried out zero-crossing rate and calculate; The second identifying unit 105 links to each other with described zero-crossing rate computing unit 104, whether is suitable for judging the zero-crossing rate that calculates greater than predetermined threshold value, otherwise judges that this frame is noise signal; Time frequency analysis unit 106, link to each other with described minute frame unit 101, the first identifying unit 103, the second identifying unit 105, be suitable for judging described the first reference threshold when described the first identifying unit 103 and be in outside the preset range or described the second identifying unit 105 when judging the zero-crossing rate that calculates greater than described predetermined threshold value, take frame as unit time frequency analysis is carried out in the voice signal of input; The 3rd identifying unit 107 links to each other with described time frequency analysis unit 106, and the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; Result behind time frequency analysis is greater than the first reference threshold, and less than described the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Result behind time frequency analysis judges then that more than or equal to described the second reference threshold this frame is voice signal.Described the 3rd identifying unit 107 is judged described signal undetermined based on the result of determination of next frame voice signal and is specially: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
In the present embodiment, described voice activation pick-up unit also comprises the second setup unit 109, described the second setup unit 109 is connected with the first setup unit 102, the second identifying unit 105, is suitable for setting based on described the first reference threshold the predetermined threshold value of described zero-crossing rate.Particularly, described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with the maximum preset value of described the first reference threshold, and described the second preset range is relevant with middle preset value with the minimum preset value of described the first reference threshold; The predetermined threshold value that described the second setup unit 109 is set described zero-crossing rate based on described the first reference threshold is specially: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.
Described voice activation pick-up unit also comprises noise prediction unit 108, described noise prediction unit 108 is connected with a minute frame unit 101, the first setup unit 102, N frame voice signal and analyzing before being suitable for extracting in the voice signal of described input obtains the described reference threshold (comprising the first reference threshold and the second reference threshold) that described the first setup unit 102 is set.
In addition, described voice activation pick-up unit also comprises storage unit 110, described storage unit 110 is connected with the 3rd identifying unit 107, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
Voice-activation detecting method about the implementation of described voice activation pick-up unit can provide with reference to present embodiment does not repeat them here.
Embodiment two
Fig. 3 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention two.As shown in Figure 3, with embodiment one distinguishes to some extent be, in the present embodiment, after judging that this frame is as noise signal among step S107 or the step S109, also comprise execution in step S112, upgrade described reference threshold based on this frame noise signal.Particularly, describedly upgrade described reference threshold based on this frame noise signal and comprise: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described reference threshold.Because having determined with regard to a frame is with regard to the voice signal of noise signal, it is carried out the characteristic that result behind the time frequency analysis can show ground unrest under the current environment, so can be based on the result behind the time frequency analysis of this frame noise signal, multiply by certain weighting coefficient a, again with the maximum preset of described the first reference threshold weighting coefficient b with correspondence on duty, wherein, a+b=1, the value that obtains with both additions afterwards is as the first new reference threshold, again based on the multiple relation that has between described the first reference threshold and the second reference threshold with second reference threshold that must make new advances of the first reference threshold after upgrading.For instance, described the first reference threshold of supposing current setting is 260, one frame voice signal is carried out after the time frequency analysis, the result of the time frequency analysis that obtains is 250, then by behind the execution in step S108, judge the result of time frequency analysis less than the first reference threshold, execution in step S109 then, then, execution in step S112, upgrade described reference threshold based on this frame noise signal, described in embodiment one, the maximum preset value of described the first reference threshold is 350, supposes that the weighting coefficient to the result of time frequency analysis is 0.6, be 0.4 to the weighting coefficient of the maximum preset value of described the first reference threshold then, the value that then obtains at last should be the 250*0.6+350*0.4=150+140=290.So with 290 as the first reference threshold after upgrading, because the second reference threshold is 1.3 times of the first reference threshold in the present embodiment, the second reference threshold after upgrading so is 377.Certainly, the above is just to upgrading a kind of mode of described reference threshold based on this frame noise signal, in other embodiments, also can be in the result who determines time frequency analysis less than the first reference threshold, replace described the first reference threshold with the result of time frequency analysis.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
But the implementation of other step reference examples one of present embodiment does not repeat them here.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 4 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention two, as shown in Figure 4, the voice activation pick-up unit that present embodiment provides not only comprises each unit of voice activation pick-up unit described in the embodiment one, difference is to some extent, also comprise updating block 111, described updating block 111 is connected with the second identifying unit 105, the 3rd identifying unit 107, the first setup unit 102, be suitable for after described the second identifying unit 105 or the 3rd identifying unit 107 judge that this frame is as noise signal, upgrading described reference threshold based on this frame noise signal.Described updating block 111 upgrades described reference threshold based on this frame noise signal and is specially: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold, and based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after upgrading upgrade described the second reference threshold.
The implementation of the described voice activation pick-up unit of present embodiment can with reference to the described voice-activation detecting method of present embodiment, not repeat them here.
Embodiment three
Fig. 5 is the schematic flow sheet of the voice-activation detecting method that provides of the embodiment of the invention three.As shown in Figure 5, with voice-activation detecting method described in embodiment two, the embodiment three different be that present embodiment can be realized voice-activation detecting method provided by the invention by comparatively simple embodiment.In conjunction with Fig. 1 or Fig. 3, particularly, the voice-activation detecting method that present embodiment provides does not need to judge whether described the first reference threshold is in the step (step S103) within the preset range, also just do not need to carry out calculating and follow-up relevant determination step (the step S105 thereof of zero-crossing rate thus, step S106, step S107), in addition, the step that also need to not set the first reference threshold and second reference threshold of noise signal before carrying out time frequency analysis, the result behind the time frequency analysis can be directly compare with the first reference threshold and second reference threshold of a pre-stored acquiescence.
The voice-activation detecting method that present embodiment provides comprises: step S201, divide frame to the voice signal of inputting; Step S202 carries out time frequency analysis take frame as unit to the voice signal of input; Step S203 compares the result behind the time frequency analysis and the first reference threshold and the second reference threshold, if the result behind the time frequency analysis is less than or equal to the first reference threshold, then execution in step S204 judges that this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, execution in step S205 then, this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis is more than or equal to described the second reference threshold, then execution in step S206 judges that this frame is voice signal.
In the present embodiment, the length of each frame voice signal is 8ms.Described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.Described the second reference threshold is 1.3 times of the first reference threshold.Described result of determination based on the next frame voice signal is judged described signal undetermined and is comprised: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.Described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection.
In addition, described voice-activation detecting method also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
In other embodiments, after judging that this frame is as noise signal, voice-activation detecting method can also comprise the step of upgrading described the first reference threshold and the second reference threshold based on this frame noise signal.Specifically comprise: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold and the second reference threshold.But the associated description for voice-activation detecting method described in the step reference example two of upgrading described the first reference threshold and the second reference threshold based on this frame noise signal does not repeat them here.
Based on above-mentioned voice-activation detecting method, present embodiment also provides a kind of voice activation pick-up unit.Fig. 6 is the structural representation of the voice activation pick-up unit that provides of the embodiment of the invention three, and as shown in Figure 6, the voice activation pick-up unit that present embodiment provides comprises: minute frame unit 201 is suitable for dividing frame to the voice signal of input; Time frequency analysis unit 202 links to each other with described minute frame unit 201, is suitable for take frame as unit the voice signal of input being carried out time frequency analysis; The time frequency analysis that described time frequency analysis unit 202 carries out comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection; Identifying unit 203 links to each other with described time frequency analysis unit 202, and the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold.Described identifying unit 203 is judged described signal undetermined based on the result of determination of next frame voice signal and is specially: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
In the present embodiment, described voice activation pick-up unit also comprises noise prediction unit 204, described noise prediction unit 204 is connected with a minute frame unit 201, identifying unit 203, N frame voice signal and analyzing before being suitable for extracting in the voice signal of described input obtains described the first reference threshold and the second reference threshold.
In addition, described voice activation pick-up unit also comprises storage unit 205, described storage unit 205 is connected with identifying unit 203, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
In other embodiments, the voice activation pick-up unit can also comprise updating block, is suitable for upgrading described the first reference threshold and the second reference threshold based on this frame noise signal after identifying unit 203 judges that this frame is as noise signal.Described updating block upgrades described the first reference threshold based on this frame noise signal and the second reference threshold is specially: the maximum preset value of described the first reference threshold and the result behind the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold and the second reference threshold.
The implementation of the described voice activation pick-up unit of present embodiment can with reference to the correlation step of voice-activation detecting method described in present embodiment and the embodiment one, not repeat them here.
To sum up, the voice-activation detecting method that embodiment of the present invention provides and device have following beneficial effect at least:
Divide frame (every frame voice signal seamlessly transits) by the voice signal to input, take frame as unit the voice signal of input is carried out time frequency analysis again, the first reference threshold and second reference threshold of the result of time frequency analysis and pre-set noise signal are compared, thereby can fast, effectively identify a certain frame voice signal is voice signal or noise signal, reaches the purpose that reduces ground unrest when guaranteeing speech quality with realization.
Further, the noise signal special to a few classes, by judging whether the first reference threshold of setting is within the preset range, then to be in the corresponding predetermined threshold value of setting different zero-crossing rates of different preset range (different noise signal types) according to described the first reference threshold, calculate by take frame as unit the voice signal of input being carried out zero-crossing rate, then be judged to be noise signal for the zero-crossing rate that calculates less than described predetermined threshold value, otherwise further check with time frequency analysis, realized thus different noise signals are checked targetedly, thereby can avoid to a great extent producing flase drop and undetected situation, more effective to the identification of noise signal and voice signal.
Based on the noise signal that has identified, in time the first reference threshold and the second reference threshold are constantly upgraded, thereby can realize making the identification of noise signal and voice signal more accurate and effective to the self-adaptation of ground unrest variation in the current environment.
In addition, by N frame voice signal before in the voice signal that extracts input and analyze the mode that obtains described the first reference threshold and the second reference threshold, can at the beginning of setting up, voice call just set out the reference threshold of the noise signal of adaptation current environment according to residing varying environment, realized preferably the prediction to the ground unrest of current environment, it is more accurate to make the identification of noise signal.
Although the present invention with preferred embodiment openly as above; but it is not to limit the present invention; any those skilled in the art without departing from the spirit and scope of the present invention; can utilize method and the technology contents of above-mentioned announcement that technical solution of the present invention is made possible change and modification; therefore; every content that does not break away from technical solution of the present invention; to any simple modification, equivalent variations and modification that above embodiment does, all belong to the protection domain of technical solution of the present invention according to technical spirit of the present invention.

Claims (32)

1. a voice-activation detecting method is characterized in that, comprising:
Voice signal to input divides frame;
Take frame as unit the voice signal of input carried out time frequency analysis; Described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Described the second reference threshold is 1.3 times of the first reference threshold.
2. voice-activation detecting method according to claim 1 is characterized in that, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
3. voice-activation detecting method according to claim 1, it is characterized in that, based on the result of determination of next frame voice signal described signal undetermined is judged to comprise: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
4. voice-activation detecting method according to claim 1 is characterized in that, after judging that this frame is as noise signal, also comprises based on this frame noise signal and upgrades described the first reference threshold and the second reference threshold.
5. voice-activation detecting method according to claim 4 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
6. voice-activation detecting method according to claim 1 is characterized in that, also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
7. voice-activation detecting method according to claim 1 is characterized in that, the length of each frame voice signal is 8ms.
8. a voice-activation detecting method is characterized in that, comprising:
Voice signal to input divides frame;
Set the first reference threshold and second reference threshold of noise signal, described the second reference threshold is 1.3 times of the first reference threshold;
Judge whether described the first reference threshold is within the preset range, otherwise take frame as unit the voice signal of input is carried out time frequency analysis; Be then take frame as unit the voice signal of input to be carried out zero-crossing rate to calculate, if the zero-crossing rate that calculates greater than predetermined threshold value, then carries out described time frequency analysis, otherwise judge that this frame is noise signal; Described time frequency analysis comprises: this frame voice signal is asked variance in time domain and frequency domain respectively, obtain the value of its comprehensive rate of change on time domain and frequency domain of reflection;
If the result behind the time frequency analysis is less than or equal to the first reference threshold, judge that then this frame is noise signal; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged.
9. voice-activation detecting method according to claim 8 is characterized in that, also comprises the predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold.
10. voice-activation detecting method according to claim 9 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with described maximum preset value, described the second preset range is relevant with middle preset value with described minimum preset value, described in the middle of preset value greater than minimum preset value, and less than the maximum preset value; The described predetermined threshold value of setting described zero-crossing rate based on described the first reference threshold comprises: if described the first reference threshold is within described the first preset range, then the predetermined threshold value with described zero-crossing rate is set as the first predetermined threshold value; If described the first reference threshold is within described the second preset range, then the predetermined threshold value with described zero-crossing rate is set as the second predetermined threshold value.
11. voice-activation detecting method according to claim 8 is characterized in that, described the first reference threshold and the second reference threshold are by N frame voice signal before in the voice signal that extracts described input and analyze and obtain.
12. voice-activation detecting method according to claim 8, it is characterized in that, based on the result of determination of next frame voice signal described signal undetermined is judged to comprise: with described signal determining undetermined as consistent with the signal type of described next frame voice signal.
13. voice-activation detecting method according to claim 8 is characterized in that, after judging that this frame is as noise signal, also comprises based on this frame noise signal and upgrades described the first reference threshold and the second reference threshold.
14. voice-activation detecting method according to claim 13 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Describedly upgrade described the first reference threshold and the second reference threshold comprises based on this frame noise signal: the result behind described maximum preset value and the described time frequency analysis be multiply by respectively the value that addition obtains behind the default weighting coefficient upgrade described the first reference threshold; Upgrade described the second reference threshold based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after the renewal.
15. voice-activation detecting method according to claim 8 is characterized in that, also comprises: preserve the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
16. voice-activation detecting method according to claim 8 is characterized in that, the length of each frame voice signal is 8ms.
17. a voice activation pick-up unit is characterized in that, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The time frequency analysis unit is suitable for take frame as unit the voice signal of input being carried out time frequency analysis; Described time frequency analysis is asked variance in time domain and frequency domain respectively to this frame voice signal, obtains the value of its comprehensive rate of change on time domain and frequency domain of reflection;
Identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; If the result behind the time frequency analysis, judges then that this frame is voice signal more than or equal to described the second reference threshold; If the result behind the time frequency analysis is greater than described the first reference threshold, and less than the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged; Described the second reference threshold is 1.3 times of the first reference threshold.
18. voice activation pick-up unit according to claim 17 is characterized in that, also comprises the noise prediction unit, be suitable for extracting in the voice signal of described input before N frame voice signal and analyzing, obtain described the first reference threshold and the second reference threshold.
19. voice activation pick-up unit according to claim 17 is characterized in that, described identifying unit is consistent with the signal type of described next frame voice signal with described signal determining undetermined.
20. voice activation pick-up unit according to claim 17 is characterized in that, also comprises updating block, is suitable for upgrading described the first reference threshold and the second reference threshold based on this frame noise signal after described identifying unit judges that this frame is as noise signal.
21. voice activation pick-up unit according to claim 20 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; The result of described updating block after with described maximum preset value and described time frequency analysis multiply by respectively the value that addition obtains behind the default weighting coefficient and upgrades described the first reference threshold, and based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after upgrading upgrade described the second reference threshold.
22. voice activation pick-up unit according to claim 17 is characterized in that, also comprises storage unit, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
23. voice activation pick-up unit according to claim 17 is characterized in that, the length of each frame voice signal is 8ms.
24. a voice activation pick-up unit is characterized in that, comprising:
Divide frame unit, be suitable for dividing frame to the voice signal of input;
The first setup unit is suitable for setting the first reference threshold and second reference threshold of noise signal, and described the second reference threshold is 1.3 times of the first reference threshold;
The first identifying unit is suitable for judging whether described the first reference threshold is within the preset range;
The zero-crossing rate computing unit is suitable for being within the preset range when judging described the first reference threshold, take frame as unit the voice signal of input is carried out zero-crossing rate and calculates;
Whether the second identifying unit is suitable for judging the zero-crossing rate that calculates greater than predetermined threshold value, otherwise judges that this frame is noise signal;
The time frequency analysis unit is suitable for being in outside the preset range or when judging the zero-crossing rate that calculates greater than described predetermined threshold value, take frame as unit the voice signal of input being carried out the time frequency analysis when judging described the first reference threshold; Described time frequency analysis is asked variance in time domain and frequency domain respectively to this frame voice signal, obtains the value of its comprehensive rate of change on time domain and frequency domain of reflection;
The 3rd identifying unit, the result who is suitable for behind time frequency analysis is less than or equal to the first reference threshold, judges that then this frame is noise signal; Result behind time frequency analysis judges then that more than or equal to described the second reference threshold this frame is voice signal; Result behind time frequency analysis is greater than the first reference threshold, and less than described the second reference threshold, then this frame is signal undetermined, based on the result of determination of next frame voice signal it is judged.
25. voice activation pick-up unit according to claim 24 is characterized in that, also comprises the second setup unit, is suitable for setting based on described the first reference threshold the predetermined threshold value of described zero-crossing rate.
26. voice activation pick-up unit according to claim 25 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; Described preset range comprises the first preset range and the second preset range, described the first preset range is relevant with described maximum preset value, described the second preset range is relevant with middle preset value with described minimum preset value, described in the middle of preset value greater than minimum preset value, and less than the maximum preset value; If described the first reference threshold is within described the first preset range, then described the second setup unit is set as the first predetermined threshold value with the predetermined threshold value of described zero-crossing rate; If described the first reference threshold is within described the second preset range, then described the second setup unit is set as the second predetermined threshold value with the predetermined threshold value of described zero-crossing rate.
27. voice activation pick-up unit according to claim 24, it is characterized in that, also comprise the noise prediction unit, N frame voice signal and analyzing before being suitable for extracting in the voice signal of described input obtains described the first reference threshold and the second reference threshold that described the first setup unit is set.
28. voice activation pick-up unit according to claim 24 is characterized in that, described the 3rd identifying unit is consistent with the signal type of described next frame voice signal with described signal determining undetermined.
29. voice activation pick-up unit according to claim 24, it is characterized in that, also comprise updating block, be suitable for after described the second identifying unit or the 3rd identifying unit judge that this frame is as noise signal, upgrading described the first reference threshold and the second reference threshold based on this frame noise signal.
30. voice activation pick-up unit according to claim 29 is characterized in that, described the first reference threshold is more than or equal to minimum preset value, and is less than or equal to the maximum preset value; The result of described updating block after with described maximum preset value and described time frequency analysis multiply by respectively the value that addition obtains behind the default weighting coefficient and upgrades described the first reference threshold, and based on the multiple relation of described the second reference threshold and the first reference threshold and the first reference threshold after upgrading upgrade described the second reference threshold.
31. voice activation pick-up unit according to claim 24 is characterized in that, also comprises storage unit, is suitable for preserving the rear Q frame undetermined signal continuous with the continuous front P frame signal undetermined of the voice signal that determines and the voice signal preserving and determine.
32. voice activation pick-up unit according to claim 24 is characterized in that, the length of each frame voice signal is 8ms.
CN2011102352285A 2011-08-16 2011-08-16 Voice-activation detecting method and device Active CN102314884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102352285A CN102314884B (en) 2011-08-16 2011-08-16 Voice-activation detecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102352285A CN102314884B (en) 2011-08-16 2011-08-16 Voice-activation detecting method and device

Publications (2)

Publication Number Publication Date
CN102314884A CN102314884A (en) 2012-01-11
CN102314884B true CN102314884B (en) 2013-01-02

Family

ID=45427993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102352285A Active CN102314884B (en) 2011-08-16 2011-08-16 Voice-activation detecting method and device

Country Status (1)

Country Link
CN (1) CN102314884B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103674235B (en) * 2014-01-03 2015-09-09 哈尔滨工业大学 Based on the single-frequency alarm sound characteristic detection method of Short Time Fourier Transform
CN104091603B (en) * 2014-05-23 2017-06-09 普强信息技术(北京)有限公司 Endpoint detection system and its computational methods based on fundamental frequency
CN105530390B (en) * 2014-09-30 2018-07-31 华为技术有限公司 The method in Conference server and its echo source in detection meeting
CN104538041B (en) * 2014-12-11 2018-07-03 深圳市智美达科技有限公司 abnormal sound detection method and system
CN105810214B (en) * 2014-12-31 2019-11-05 展讯通信(上海)有限公司 Voice-activation detecting method and device
CN105261368B (en) * 2015-08-31 2019-05-21 华为技术有限公司 A kind of voice awakening method and device
CN107305774B (en) * 2016-04-22 2020-11-03 腾讯科技(深圳)有限公司 Voice detection method and device
CN106534461B (en) * 2016-11-04 2019-07-26 惠州Tcl移动通信有限公司 The noise reduction system and its noise-reduction method of earphone
KR102643501B1 (en) * 2016-12-26 2024-03-06 현대자동차주식회사 Dialogue processing apparatus, vehicle having the same and dialogue processing method
CN108447505B (en) * 2018-05-25 2019-11-05 百度在线网络技术(北京)有限公司 Audio signal zero-crossing rate processing method, device and speech recognition apparatus
CN110648660A (en) * 2018-06-27 2020-01-03 深圳联友科技有限公司 Voice activation method of BS (base station) end
CN109215647A (en) * 2018-08-30 2019-01-15 出门问问信息科技有限公司 Voice awakening method, electronic equipment and non-transient computer readable storage medium
CN110491403B (en) * 2018-11-30 2022-03-04 腾讯科技(深圳)有限公司 Audio signal processing method, device, medium and audio interaction equipment
CN110634497B (en) * 2019-10-28 2022-02-18 普联技术有限公司 Noise reduction method and device, terminal equipment and storage medium
CN112017639B (en) * 2020-09-10 2023-11-07 歌尔科技有限公司 Voice signal detection method, terminal equipment and storage medium
WO2023092399A1 (en) * 2021-11-25 2023-06-01 华为技术有限公司 Speech recognition method, speech recognition apparatus, and system
CN114724576B (en) * 2022-06-09 2022-10-04 广州市保伦电子有限公司 Method, device and system for updating threshold in howling detection in real time
CN115995231B (en) * 2023-03-21 2023-06-16 北京探境科技有限公司 Voice wakeup method and device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5828993A (en) * 1995-09-26 1998-10-27 Victor Company Of Japan, Ltd. Apparatus and method of coding and decoding vocal sound data based on phoneme
CN1285945A (en) * 1998-01-07 2001-02-28 艾利森公司 System and method for encoding voice while suppressing acoustic background noise
CN1363923A (en) * 2001-11-02 2002-08-14 北京阜国数字技术有限公司 Blocks length selection method based on adaptive threshold and typical sample predication
CN1624766A (en) * 2000-08-21 2005-06-08 康奈克森特系统公司 Method for noise robust classification in speech coding

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3593201B2 (en) * 1996-01-12 2004-11-24 ユナイテッド・モジュール・コーポレーション Audio decoding equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5828993A (en) * 1995-09-26 1998-10-27 Victor Company Of Japan, Ltd. Apparatus and method of coding and decoding vocal sound data based on phoneme
CN1285945A (en) * 1998-01-07 2001-02-28 艾利森公司 System and method for encoding voice while suppressing acoustic background noise
CN1624766A (en) * 2000-08-21 2005-06-08 康奈克森特系统公司 Method for noise robust classification in speech coding
CN1363923A (en) * 2001-11-02 2002-08-14 北京阜国数字技术有限公司 Blocks length selection method based on adaptive threshold and typical sample predication

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JP特开平9-200055A 1997.07.31

Also Published As

Publication number Publication date
CN102314884A (en) 2012-01-11

Similar Documents

Publication Publication Date Title
CN102314884B (en) Voice-activation detecting method and device
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
CN110473539B (en) Method and device for improving voice awakening performance
CN108922513B (en) Voice distinguishing method and device, computer equipment and storage medium
CN110517670A (en) Promote the method and apparatus for waking up performance
CN105118502A (en) End point detection method and system of voice identification system
CN110047470A (en) A kind of sound end detecting method
CN105118522B (en) Noise detection method and device
CN107305774A (en) Speech detection method and device
CN106328151B (en) ring noise eliminating system and application method thereof
CN110060665A (en) Word speed detection method and device, readable storage medium storing program for executing
CN104021789A (en) Self-adaption endpoint detection method using short-time time-frequency value
CN102714034B (en) Signal processing method, device and system
EP2927906B1 (en) Method and apparatus for detecting voice signal
CN111429932A (en) Voice noise reduction method, device, equipment and medium
CN107331386B (en) Audio signal endpoint detection method and device, processing system and computer equipment
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
CN103440872A (en) Transient state noise removing method
CN104091603A (en) Voice activity detection system based on fundamental frequency and calculation method thereof
CN108305639A (en) Speech-emotion recognition method, computer readable storage medium, terminal
CN105679312A (en) Phonetic feature processing method of voiceprint identification in noise environment
US20190057705A1 (en) Methods and apparatus to identify a source of speech captured at a wearable electronic device
CN111223492A (en) Echo path delay estimation method and device
CN108682432A (en) Speech emotion recognition device
Labied et al. An overview of automatic speech recognition preprocessing techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent for invention or patent application
CB02 Change of applicant information

Address after: 100080, Beijing, Haidian, Haidian District South Road, 21, Zhongguancun intellectual property building (former sea building), block B, 6

Applicant after: Czech surway Technology (Beijing) Co. Ltd.

Co-applicant after: Shanghai Gener Information Technology Co., Ltd.

Address before: 100080, Beijing City, Haidian District, No. 52 West Fourth Ring Road, SMIC building, 11 floor, 1102

Applicant before: Czech surway Technology (Beijing) Co. Ltd.

Co-applicant before: Shanghai Gener Information Technology Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Free format text: FORMER OWNER: SHANGHAI GENER INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20150320

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150320

Address after: 100080, No. 21 Haidian South Road, Beijing, block B, 6, Haidian District

Patentee after: Czech surway Technology (Beijing) Co. Ltd.

Address before: 100080, Beijing, Haidian, Haidian District South Road, 21, Zhongguancun intellectual property building (former sea building), block B, 6

Patentee before: Czech surway Technology (Beijing) Co. Ltd.

Patentee before: Shanghai Gener Information Technology Co., Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and device for voice activity detection (VAD) and encoder

Effective date of registration: 20150724

Granted publication date: 20130102

Pledgee: Beijing technology intellectual property financing Company limited by guarantee

Pledgor: Czech surway Technology (Beijing) Co. Ltd.

Registration number: 2015990000598

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20150819

Granted publication date: 20130102

Pledgee: Beijing technology intellectual property financing Company limited by guarantee

Pledgor: Czech surway Technology (Beijing) Co. Ltd.

Registration number: 2015990000598

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: 100080, No. 21 Haidian South Road, Beijing, block B, 6, Haidian District

Patentee after: BEIJING ZED-3 TECHNOLOGY CO., LTD.

Address before: 100080, No. 21 Haidian South Road, Beijing, block B, 6, Haidian District

Patentee before: Czech surway Technology (Beijing) Co. Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and device for voice activity detection (VAD) and encoder

Effective date of registration: 20161229

Granted publication date: 20130102

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: BEIJING ZED-3 TECHNOLOGY CO., LTD.

Registration number: 2016990001186

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20181218

Granted publication date: 20130102

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: BEIJING ZED-3 TECHNOLOGY CO., LTD.

Registration number: 2016990001186

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and device for voice activity detection (VAD) and encoder

Effective date of registration: 20181219

Granted publication date: 20130102

Pledgee: Beijing ustron Tongsheng financing Company limited by guarantee

Pledgor: BEIJING ZED-3 TECHNOLOGY CO., LTD.

Registration number: 2018990001231

CP02 Change in the address of a patent holder

Address after: 1110-08, 10th floor, No.8, Haidian North 2nd Street, Haidian District, Beijing 100080

Patentee after: BEIJING JIESIRUI TECHNOLOGY Co.,Ltd.

Address before: 100080, No. 21 Haidian South Road, Beijing, block B, 6, Haidian District

Patentee before: BEIJING JIESIRUI TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder