CN106847270A

CN106847270A - A kind of double threshold place name sound end detecting method

Info

Publication number: CN106847270A
Application number: CN201611135819.4A
Authority: CN
Inventors: 谢巍; 董万里
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-06-13
Anticipated expiration: 2036-12-09
Also published as: CN106847270B

Abstract

The invention discloses a kind of double threshold place name sound end detecting method, the energy and minimum energy threshold value, the size of highest energy threshold value of every frame voice signal are judged since the first frame signal, judge the size of zero-crossing rate and zero-crossing rate threshold value, so that it is determined that how next frame signal is detected, and in the case of possibly into voice status, the voice signal at the light time end of pronunciation above occurred to voice segments by increasing variable retains.The characteristics of present invention combines the place name voice signal of isolated word, traditional double threshold method is improved, ensure that schwa and the front portion of voice signal that the duration is very short will not be judged as noise, so as to avoid losing voice signal, the accuracy of end-point detection and the adaptability of live applied environment are improve, the requirement to environment is reduced.

Description

A kind of double threshold place name sound end detecting method

Technical field

The invention belongs to speech terminals detection field, more particularly to a kind of double threshold place name sound end detecting method.

Background technology

With becoming increasingly conspicuous for rapid development of economy and the trend of globalization, modern logistics industry is obtained in developed country Unprecedented development, and generate huge economic benefit and social benefit.Logistic resources have transport, storage, sorting, packaging, Dispatching etc., these resources are dispersed in multiple fields, including manufacturing industry, agricultural, circulation industry etc..

In link is sorted, substantially manually sorted at this stage, because workpeople is chronically at noisy building ring In border, certain sense of fatigue will certainly be produced with body at heart, and the unicity and repeatability of task can also make him Working condition excessively loosen, this necessarily cause sort accuracy decline, cause it is more it is irremediable sorting error thing Therefore occur, therefore carry out the mode of manual detection in industrial circle to the product on streamline and can not meet modernization industry Demand.

Speech recognition changes our life at many aspects till now as the important interface of man-machine interaction It is living, from the speech control system of smart home to vehicle-mounted voice identifying system etc., therefore speech recognition technology and logistics are sorted The fusion of link is the inevitable requirement of development of logistics line.

And in speech recognition technology, end-point detection technology is particularly important link in speech recognition, its effect it is good Bad to directly affect final recognition result, traditional end-point detecting method based on short-time energy and zero-crossing rate is in preferable ring Could be applicable in border, and for the place name voice signal of isolated word, the accuracy rate of end-point detection is relatively low.

The content of the invention

Shortcoming and deficiency it is an object of the invention to overcome prior art, there is provided a kind of double threshold place name sound end inspection Survey method, improves the accuracy of end-point detection.

A kind of double threshold place name sound end detecting method, comprises the following steps：Every frame is judged since the first frame signal The energy of voice signal and minimum energy threshold value, the size of highest energy threshold value, judge the size of zero-crossing rate and zero-crossing rate threshold value, So that it is determined that the appropriate method detected to next frame signal, and in the case of possibly into voice status, by increasing Variable retains come the voice signal of the pronunciation light time period above occurred to voice segments.

Comprise the following steps that：

1st, receive the place name voice signal by pretreatment, judge energy per frame voice signal and minimum energy threshold value, The size of highest energy threshold value and judge the size of zero-crossing rate and zero-crossing rate threshold value；

2nd, when the energy ＜ minimum energy threshold values of the i-th frame voice signal, state variable is set to 0, voice length gauge Number variable is arranged to 0, shows still in Jing Yin section, and continuing return to step 1 carries out next frame detection；

When the energy ＞ minimum energy threshold values of highest energy threshold value ＞ the i-th frame voice signals, and zero-crossing rate ＞ zero-crossing rate thresholds Value, 1 is set to by state variable, shows to be likely to be at voice segments, adds 1 by voice length counting variable, while will likely be in The variable of the length of voice segments adds 1, and return to step 1 carries out next frame detection；

If the 3, stateful variable is 1, the voice signal to being likely to be at voice segments is carried out according to certain standard Screening, further discriminates between noise section and voice segments；

4th, when the energy ＞ highest energy threshold values of the i-th frame voice signal, then state variable is set to 2, indicates entry into language Segment, while adding 1 by voice length counting variable, next frame detection is carried out according to step 5；

5th, the energy ＞ minimum energies threshold value of current frame speech signal or the zero-crossing rate ＞ mistakes of current frame speech signal are judged Whether zero rate threshold value is set up；

If so, represent also in voice segments, be not Jing Yin, state variable is remained 2, voice length counting variable adds 1, Continue next frame according to step 5 to detect；

If not, illustrate that signal turns to Jing Yin section from voice segments, then Jing Yin length is added 1, and to Jing Yin length Determine whether；Until finding all effective voice signals, state parameter is set to 3, terminates process.

Preferably, if stateful variable be 1, and voice signal energy be less than minimum energy threshold value when, judgement can Whether the variable of length that voice segments can be in is set up more than certain threshold value, if so, expression is currently noise section, give up before Phonological component, writ state variable, the variable of voice length counting variable and the length for being likely to be at voice segments is equal to and 0 and returns Step 1 continues next frame detection；If not, then express possibility also in voice segments, hold mode variable is equal to 1 and long by voice Degree counting variable adds 1, and the variable for being likely to be at the length of voice segments plus 1, and return to step 1 carries out next frame detection.

Further, above-mentioned certain threshold value is equal to 6.

Preferably, the step of being determined whether to Jing Yin length be：Whether judge the maximum Jing Yin length of Jing Yin length ＜ Set up；

If so, then hold mode variable is 2, adds 1 by voice length counting variable, and carry out next frame according to step 5 Detection；

If not, then judge whether voice length counting variable ＜ voice signals minimum length is set up；If voice length Counting variable ＜ voice signals minimum length is set up, and show above to detect is all noise, state variable is set to 0, Jing Yin segment length is set to 0, voice length counting variable and is set to 0, is further continued for inspection；If voice length counting variable ＜ voices Signal minimum length is invalid, represents that voice segments have found, it is believed that be effective voice signal, state parameter is set into 3, End process.

Preferably, under original state, writ state variable is equal to 0, and voice length counting variable is equal to 0, not true for calculating The variable that the length of voice segments is likely to be at when surely into voice segments is equal to 0, and Jing Yin length is equal to 0.

Preferably, the value of the minimum energy threshold value is 0.01, and the value of highest energy threshold value is 0.1, and zero-crossing rate threshold value is 100。

Preferably, the Jing Yin maximum length is equal to 10, and the voice signal minimum length is equal to 5.

Preferably, preprocessing process includes that preemphasis is processed and sub-frame processing.

Specifically, preemphasis treatment is come real by the digital filter of the lifting high frequency characteristics with 6dB/ octaves Existing, the high-pass filter meets H (z)=1- μ z^-1, μ=0.97；According to frame length 256, frame moves 128 pairs of voice signals and is divided Frame.

The present invention compared with prior art, has the following advantages that and beneficial effect：

The characteristics of present invention combines the place name voice signal of isolated word, is improved by traditional double threshold method, Add the variable slience1 variables for calculating the length for not determining that voice segments are likely to be at during into voice segments, and optimization Various end-point detection parameters, ensure that schwa and the front portion of interrupted place name voice signal that the duration is very short will not Noise is judged as, so as to avoid losing voice signal, the accuracy of end-point detection and fitting for live applied environment is improve Ying Xing, reduces requirement of the end-point detection to environment.

Brief description of the drawings

Fig. 1 is the process schematic of embodiment method.

Specific embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited In this.

During the end-point detection of place name voice signal, if one section of place name voice is in voice segments before this, then locate In Jing Yin section, normal voice section is entered back into, then can be considered for a section before normal voice section by traditional end-point detecting method Noise section, then shears voice signal again, and this results in the loss of voice signal, for example " Shijiazhuang " this pronunciation, " stone " Pronunciation is very light very short, not easy to identify.

And the double threshold place name sound end detecting method that the present embodiment is given, based on improved short-time average energy and mistake Zero rate, by adding the variable slience1 for calculating the length for not determining that voice segments are likely to be at during into voice segments, i.e., Make to run into above-mentioned situation, it is also possible to the voice length before normal voice section is preserved, as effective fragment, so as to improve end The validity of point detection.

Before end-point detection is carried out, place name voice signal is pre-processed, including preemphasis treatment (Pre- ) and sub-frame processing emphasis.

Because the mean power of voice signal is influenceed by glottal excitation and mouth and nose radiation, front end is about in more than 80Hz Fall by 6dB octaves, so when speech signal spec-trum is sought, frequency corresponding composition more high is smaller, the frequency spectrum of HFS It is more hard to find than low frequency part, therefore preemphasis treatment is carried out to voice signal.The central idea of preemphasis treatment is to utilize signal The difference of characteristic and noise characteristic is effectively processed signal, it is therefore an objective to is lifted HFS, is become the frequency spectrum of signal It is flat, be maintained at low frequency in the whole frequency band of high frequency, frequency spectrum can be sought with same signal to noise ratio, in order to spectrum analysis or sound Road Parameter analysis.Preemphasis is realized by the digital filter of the lifting high frequency characteristics with 6dB/ octaves, this implementation High-pass filter, the high-pass filter is used to meet H (z)=1- μ z in example^-1, μ=0.97.

In addition, voice signal is as a whole, its characteristic and characterizes the parameter of its substantive characteristics and change over time, But it has short-term stationarity characteristic again, be can be regarded as (generally in 10ms~30ms) in a short time one it is approximate constant Stationary process.

Current most of voice process technology is that voice signal is carried out at framing on the basis of in short-term Reason, then extracts characteristic parameter section to each frame respectively, in order that being smoothed between frame and frame, keeps continuity, general using friendship The method of folded framing, makes former frame and a later frame have intersection, and intersection is referred to as frame shifting, will be to frame length and frame during framing The length of shifting is selected, if using larger frame length, very little, amount of calculation can be small for frame number, and the speed of system treatment is fast, but Easily increase the error of end-point detection, if using less frame length, frame number is more, amount of calculation increases, the speed of system treatment Degree is slow.General frame number per second is about 33~100 frames, and frame is moved and typically takes the 1/3~2/3 of frame length, in the present embodiment, according to frame length 256, frame moves 128 pairs of voice signals and carries out framing, and 256,128 are sampled point number.

To place name voice signal by after pretreatment, you can carry out end-point detection, as shown in Figure 1, specific steps are such as Under：

Under original state, writ state variable status=0, voice length counting variable count=0 be not true for calculating The variable slience1=0, Jing Yin length slience=0 of the length of voice segments are likely to be at when surely into voice segments.

S1, reception judge the energy amp [i] and most low energy per frame voice signal by the place name voice signal of pretreatment Measure threshold value amp2, the size of highest energy threshold value amp1 and judge the size of zero-crossing rate zcr [i] and zero-crossing rate threshold value zcr, its In, the value of the minimum energy threshold value amp2 is 0.01, and the value of highest energy threshold value amp1 is 0.1, and zero-crossing rate threshold value zcr is 100。

These threshold values are the threshold values that voice signal is set after normalized, it is assumed that voice signal is x=[x₁, x₂... x_n], then normalized is：

After these treatment, all values in signal x are between [- 1,1].The threshold value set on the basis of this, with Lower data are the threshold values for setting after normalization.

This process is to be detected each frame of voice signal successively, according to the judged result of each frame of voice signal, The value of state variable status is set, so that it is determined that how next frame voice signal should be judged.

S2, as energy amp [i] ＜ minimum energy threshold value amp2 of the i-th frame voice signal, state variable status is set 0 is set to, voice length counting variable count is arranged to 0, shown still in Jing Yin section, continuing return S1 steps carries out next frame Detection；

S3, energy amp [i] ＞ minimum energy threshold value amp2 when highest energy threshold value amp1 ＞ the i-th frame voice signals, and Zero-crossing rate zcr [i] ＞ zero-crossing rate threshold value zcr, 1 is set to by state variable status, shows to be likely to be at voice segments, by voice Length counting variable count adds 1, while the variable sliencel of the length that will likely be in voice segments adds 1, and returns to S1 steps Carry out next frame detection.

S4, if state status=1 is come into, and when the energy of next frame voice signal is less than minimum energy threshold value During amp2, judge whether sliencel ＞ 6 set up, if so, expression is currently noise section, gives up phonological component above, is made State variable status=0, voice length counting variable count=0, is likely to be at the variable slience1 of the length of voice segments =0 and return S1 steps continue next frame detection；If not, then express possibility also in voice segments, hold mode variable Status=1 and voice length counting variable count is added 1, the variable slience1 for being likely to be at the length of voice segments plus 1, Returning to S1 steps carries out next frame detection.

S5, energy amp [i] ＞ highest energy threshold value amp1 when the i-th frame voice signal, then set state variable status 2 are set to, voice segments are indicated entry into, while adding 1 by voice length counting variable count, next frame detection are carried out according to S6 steps.

S6, energy amp [i] the ＞ minimum energy threshold value amp2 for judging current frame speech signal or current frame speech signal Whether zero-crossing rate zcr [i] ＞ zero-crossing rate threshold values zcr sets up.

If so, represent also in voice segments, be not Jing Yin, state variable status is remained 2, voice length is counted and become Amount count adds 1, continues next frame according to S6 steps and detects.

If not, illustrate that signal turns to Jing Yin section from voice segments, then Jing Yin length slience is added 1, it is quiet herein Duration of a sound degree slience variables are with judging whether voice signal terminates later, and to perform S9 steps.

S9, judge whether the maximum Jing Yin length maxslience of Jing Yin length slience ＜ sets up, wherein it is described it is Jing Yin most Long length maxslience=10；

If so, then express possibility also in voice segments, because after above there is voice signal, it is current Jing Yin section Voice length be not reaching to maximum Jing Yin length, then illustrate voice signal behind may not terminate also, may also have signal, because This may also in voice segments, and hold mode variable status is 2, adds 1 by voice length counting variable count, and according to step S6 carries out next frame detection.

If not, then judge whether voice length counting variable count ＜ voice signal minimum lengths minlen sets up, Wherein described voice signal minimum length minlen=5；If voice length counting variable count ＜ voice signal minimum lengths Minlen sets up, and show above to detect is all noise, because：Normal voice signal length should be more than voice Signal minimum length minlen, if being less than this length, is judged to noise, and state variable status is set into 0, Jing Yin segment length Degree slience is set to 0, voice length counting variable count and is set to 0, is further continued for inspection；If voice length counting variable Minlen is invalid for count ＜ voice signal minimum lengths, represents that voice segments have found, it is believed that be effective voice signal, Therefore whole process can be terminated, will state parameter status be set to 3, terminate process.

In the present embodiment for scope judgement all be with more than or less than expression, do not refer to and being equal to, can by be equal to return Become more than that class.

Above-described embodiment is the present invention preferably implementation method, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from Spirit Essence of the invention and the change, modification, replacement made under principle, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. a kind of double threshold place name sound end detecting method, it is characterised in that comprise the following steps：Since the first frame signal Judge the energy and minimum energy threshold value, the size of highest energy threshold value per frame voice signal, judge zero-crossing rate and zero-crossing rate threshold The size of value, so that it is determined that the appropriate method detected to next frame signal, and in the case of possibly into voice status, The voice signal of the pronunciation light time period above occurred to voice segments by increasing variable retains.

2. double threshold place name sound end detecting method according to claim 2, it is characterised in that comprise the following steps that：

(1) the place name voice signal by pretreatment, is received, energy per frame voice signal and minimum energy threshold value, most is judged The size of high-energy threshold value and judge the size of zero-crossing rate and zero-crossing rate threshold value；

(2), when the energy ＜ minimum energy threshold values of the i-th frame voice signal, state variable is set to 0, voice length is counted Variable is arranged to 0, shows still in Jing Yin section, and continuing return to step 1 carries out next frame detection；

When the energy ＞ minimum energy threshold values of highest energy threshold value ＞ the i-th frame voice signals, and zero-crossing rate ＞ zero-crossing rate threshold values, will State variable is set to 1, shows to be likely to be at voice segments, adds 1 by voice length counting variable, while voice segments will likely be in The variable of length add 1, and return to step 1 carries out next frame detection；

(3) if, stateful variable be 1, the voice signal to being likely to be at voice segments is sieved according to certain standard Choosing, further discriminates between noise section and voice segments；

(4), when the energy ＞ highest energy threshold values of the i-th frame voice signal, then state variable is set to 2, indicates entry into voice Section, while adding 1 by voice length counting variable, next frame detection is carried out according to step 5；

(5) the energy ＞ minimum energies threshold value of current frame speech signal or the zero-crossing rate ＞ zero passages of current frame speech signal, are judged Whether rate threshold value is set up；

If so, represent also in voice segments, be not Jing Yin, state variable is remained 2, voice length counting variable adds 1, according to Step 5 continues next frame detection；

If not, illustrate that signal turns to Jing Yin section from voice segments, then Jing Yin length added 1, and Jing Yin length is made into One step judges；Until finding all effective voice signals, state parameter is set to 3, terminates process.

3. double threshold place name sound end detecting method according to claim 2, it is characterised in that in step (3), if It is 1 through stateful variable, and the energy of voice signal, when being less than minimum energy threshold value, judgement is likely to be at the length of voice segments Whether variable is set up more than certain threshold value, if so, expression is currently noise section, gives up phonological component above, and writ state becomes The variable of amount, voice length counting variable and the length for being likely to be at voice segments is equal to 0 and return to step 1 continues next frame inspection Survey；If not, then express possibility also in voice segments, hold mode variable is equal to 1 and adds 1 by voice length counting variable, can The variable of the length that can be in voice segments adds 1, and return to step (1) carries out next frame detection.

4. double threshold place name sound end detecting method according to claim 3, it is characterised in that described certain threshold value etc. In 6.

5. double threshold place name sound end detecting method according to claim 2, it is characterised in that to quiet in step (5) The step of duration of a sound degree is determined whether be：Judge whether the maximum Jing Yin length of Jing Yin length ＜ is set up；

If so, then hold mode variable is 2, adds 1 by voice length counting variable, and carry out next frame detection according to step 5；

If not, then judge whether voice length counting variable ＜ voice signals minimum length is set up；If voice length is counted Variables L T.LT.LT voice signals minimum length is set up, and show above to detect is all noise, state variable is set into 0, Jing Yin Segment length is set to 0, voice length counting variable and is set to 0, is further continued for inspection；If voice length counting variable ＜ voice signals Minimum length is invalid, represents that voice segments have found, it is believed that be effective voice signal, and state parameter is set into 3, terminates Process.

6. double threshold place name sound end detecting method according to claim 2, it is characterised in that under original state, order State variable is equal to 0, and voice length counting variable is equal to 0, and voice segments are likely to be at when not determining into voice segments for calculating Length variable be equal to 0, Jing Yin length be equal to 0.

7. double threshold place name sound end detecting method according to claim 2, it is characterised in that the minimum energy threshold The value of value is 0.01, and the value of highest energy threshold value is 0.1, and zero-crossing rate threshold value is 100.

8. double threshold place name sound end detecting method according to claim 5, it is characterised in that described Jing Yin most to greatly enhance Degree is equal to 10, and the voice signal minimum length is equal to 5.

9. double threshold place name sound end detecting method according to claim 2, it is characterised in that preprocessing process includes Preemphasis treatment and sub-frame processing.

10. double threshold place name sound end detecting method according to claim 9, it is characterised in that preemphasis treatment is Realized by the digital filter of the lifting high frequency characteristics with 6dB/ octaves, the high-pass filter meets H (z)=1- μz^-1, μ=0.97；According to frame length 256, frame moves 128 pairs of voice signals and carries out framing.