CN104091593A - Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters - Google Patents

Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters Download PDF

Info

Publication number
CN104091593A
CN104091593A CN201410175090.8A CN201410175090A CN104091593A CN 104091593 A CN104091593 A CN 104091593A CN 201410175090 A CN201410175090 A CN 201410175090A CN 104091593 A CN104091593 A CN 104091593A
Authority
CN
China
Prior art keywords
voice
noise
parameter
spectrum
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410175090.8A
Other languages
Chinese (zh)
Other versions
CN104091593B (en
Inventor
吴迪
赵鹤鸣
陶智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Cheng Bang Energy Conservation Science & Technology Co Ltd
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410175090.8A priority Critical patent/CN104091593B/en
Publication of CN104091593A publication Critical patent/CN104091593A/en
Application granted granted Critical
Publication of CN104091593B publication Critical patent/CN104091593B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the field of voice recognition, and discloses a voice endpoint detection algorithm adopting a perceptual speech spectrum structure boundary Parameter (PSSB). After the voice enhancement based on the auditory perception characteristic is carried out on the noisy voice, aiming at different points between the continuous distribution characteristic of the voice signal and the random distribution characteristic of the residual noise, the time-frequency voice spectrum of the enhanced voice is subjected to two-dimensional enhancement, so that the voice spectrum structure of the continuously distributed pure voice is further highlighted. Through two-dimensional boundary detection of the enhanced speech spectral structure, PSSB parameters are provided and used for end point detection. Experimental results show that under the environment of various signal-to-noise ratios from-10 dB to 10dB of white noise, the endpoint detection algorithm adopting the PSSB parameters can more effectively detect the endpoint of the voice. At a very low signal-to-noise ratio of-10 dB, the proposed method still has 75.2% accuracy.

Description

Adopt the voice activity detection algorithm of perception language spectrum structure boundary parameter
Technical field
The invention belongs to field of speech recognition, relate to a kind of voice activity detection algorithm, relate in particular to a kind of voice activity detection algorithm that adopts perception language spectrum structure boundary parameter.
Background technology
As the basis of speech recognition and Speaker Identification, correct effectively end-point detection, can improve the discrimination of Speaker Recognition System and speech recognition system greatly.Under the high s/n ratio environment of laboratory, traditional end-point detection algorithm can detect sound end well.Yet under low signal-to-noise ratio environment, the performance of most of end-point detection algorithms all sharply declines.
In recent years, a lot of scholars are studied the end-point detection of noise robustness.Ganapathiraju (A. Ganapathiraju, et al. Comparison of Energy-Based Endpoint Detectors for Speech Signal Processing. In Proc. lEEE Publications, 1996; 500-503) etc. people adopts the method (Energy and Zero-Crossing Rate, EZCR) that short-time energy and short-time zero-crossing rate combine to carry out the research of end-point detection.This method is with respect to traditional ENERGY METHOD, and end-point detection has better robustness.Yet this method cannot play a role under the environment of low signal-to-noise ratio more.The people such as Chen Zhenbiao (Chen Zhenbiao, slow wave.Optimization voice activity detection algorithm research based on sub belt energy feature.Acoustic journal, 2005; 30 (2): 171-176) according to the frequency domain energy distribution feature of voice, studied subband amplitude [Sub-Band Amplitude, SBA] and energy, and adopt have more a plurality of sub belt energies of the property distinguished and noise immunity and image process in the detection algorithm that combines of conventional optimization rim detection carry out end-point detection, the performance of end-point detection under complicated noise had clear improvement.In addition; the people such as Zhang (Xueying Zhang; et al. A Speech Endpoint Detection Method Based on Wavelet Coefficient Variance and Sub-Band Amplitude Variance.. In Proc. lEEE ICICIC, 2006; 105-109) a kind of wavelet coefficient (Wavelet Coefficient that utilizes has been proposed, WC) method, utilize the method for wavelet analysis to carry out end-point detection, because the method can be at each dimensional analysis signal, so can distinguish to a certain extent voice segments and noise segment.The people such as Wu (Bing-Fei Wu; Kun-Ching Wang. Robust Endpoint Detection Algorithm Based on the Adaptive Band-Partitioning Spectral Entropy in Adverse Environments. IEEE Transactions on Speech and Audio Processing, 2005; 13 (5): 762-775) method of adaptive multi-band spectral entropy (Adaptive Band-Partitioning Spectral, ABSE) is used for to end-point detection.The method can well be distinguished subband signal and the noise of voice, and is obtaining good end-point detection accuracy containing under noisy environment.Li (Q.Li; et al. A Robust real-time endpoint detector with energy normalization for ASR in adverse environments. International Conference on Acoustics Speech and Signal Processing, 2001; 574-577) use for reference the method for optimization rim detection in image processing for the end-point detection of voice, adopt a wave filter to add that tri-state decision logic carries out end-point detection, therefore the in the situation that of different signal to noise ratio (S/N ratio), do not need to adjust thresholding.The method combines the algorithm that image is processed, and end-point detection has been played to good booster action.Yet above these methods, under low signal-to-noise ratio environment, all cannot obtain higher end-point detection accuracy.
Summary of the invention
The technical matters solving: under low signal-to-noise ratio environment, the low-down problem of end-point detection accuracy of conventional end-point detecting method.
Technical scheme: the different characteristic for voice signal under low signal-to-noise ratio and noise signal at time and frequency zone two-dimensional space, and in conjunction with the voice enhancement algorithm based on auditory perception property, propose perception language spectrum structure boundary parameter PSSB (Perception Spectrogram Structure Boundary), and use it for end-point detection.First, the voice that low signal-to-noise ratio voice carried out based on auditory masking characteristic strengthen.Compare with traditional voice enhancement algorithm, this method more effectively retains the appreciable phonetic element of people's ear.On this basis, in two-dimentional aspect, consider the continuous distribution characteristic of clean speech language spectrum on time shaft, noisy speech is carried out to two dimension and strengthen, the language spectrum structure of voice is further highlighted, suppressed the language spectrum structure of noise simultaneously.Finally find out the two-dimentional border of the clean speech language spectrum structure of continuous distribution, and propose PSSB parameter for end-point detection.
1. the voice based on auditory perception property strengthen
Under low signal-to-noise ratio environment, most of end-point detection algorithms cannot detect sound end, even complete failure well.And the mankind can identify voice segments in the stronger environment of noise.Under noisy environment, the auditory perception property of people's ear plays an important role.Adopt the auditory masking characteristic in human auditory system apperceive characteristic, can suppress noise to a certain extent and more retain phonetic element.The present invention proposes pSSBparameter, first adopts the voice based on auditory masking characteristic to strengthen, and on the basis of protection voice, suppresses as much as possible noise.This sound enhancement method, the most important thing is to calculate masking threshold.Calculating and the speech-enhancement system of masking threshold are as follows:
(1) Bark threshold power spectrum
Voice signal x (n)through fast fourier transform (FFT), become frequency-region signal , power spectrum signal is:
(1)
Bark power spectrum is:
B i = Σ k = b li b hi P ( k ) - - - ( 2 ) Wherein the energy that represents i section Bark frequency band, represent the frequency that i section is minimum, represent the highest frequency of i section.
(2) diffusion Bark territory power spectrum
Introduce spread function , it is a matrix, satisfies condition:
(3)
definition is as follows:
(4)
frequency reel number poor that represents two frequency bands.
C i = Σ j = 1 j max S ij · B i , i = 1,2 . . . i max - - - ( 5 )
(3) shelter the offset function of energy and masking threshold calculating
(6)
T i = 10 log 10 ( C i ) - ( O i / 10 ) - - - ( 7 ) value, between 0 and 1, is determined by voice content. be the masking threshold of i section Bark frequency band, renamed as , wherein the implication of b is identical with i above.
Threshold value with the quiet threshold of audibility:
(8)
Compare, get its maximal value, as the masking threshold of final matching.Wherein for corresponding Bark masking curve.
(4) spectrum subtraction and the adjusting that subtracts parameter
The gain function that spectral subtraction algorithm adopts is as follows:
H ( k ) = ( 1 - &alpha; &CenterDot; [ | D ( k ) | | Y ( k ) | ] &gamma; ) 1 / &gamma; , [ | D ( k ) | | Y ( k ) | ] &gamma; < 1 &alpha; + &beta; ( &beta; &CenterDot; [ | D ( k ) | | Y ( k ) | ] &gamma; ) 1 / &gamma; , else - - - ( 9 ) First calculate the masking by noise threshold value in the different B ark territory of each frame voice, then according to masking by noise threshold value, obtain the adaptive parameter that subtracts , : if masking threshold is higher, and residual noise can be very naturally masked and people's ear is not heard, in this case, subtracts the minimum value that parameter is got them; When masking threshold is lower, residual noise is very large on the impact of people's ear, is necessary to reduce it.For each frame m, masking threshold minimum value and every frame subtract parameter with maximal value relevant.The application that subtracts parameter has following relational expression:
(10)
Wherein, with be respectively minimum value and maximal value. , with , it is respectively parameter , minimum value and maximal value.When time, ; When time, .In formula with respectively minimum value and the maximal value of the masking threshold that obtains frame by frame.In experiment, we are as follows to the value of parameters:
(5) real-time noise power Spectral Estimation
Voice strengthen needs the extra high noise spectrum estimation method of real-time.The noise power spectrum method of estimation of employing based on constraint variance spectral smoothing and minimum value tracking.The core of this algorithm is the smoothing filter of constraint variance, and it has controlled the variance of level and smooth power spectrum in short-term, and it is more accurate to make the tracking of minimum value.The noise spectrum that the method is estimated can be followed the trail of noise sudden change in time, does not produce obvious noise spectrum time delay, and degree of accuracy is better than the noise spectrum that other method is estimated.
(6) speech-enhancement system
According to masking threshold, obtain the adaptive parameter that subtracts .Speech-enhancement system as shown in Figure 1.
The two dimension of 2 voice strengthens
After the voice of low signal-to-noise ratio strengthen through voice, due to the effect of spectrum subtraction, noise and voice are attenuated simultaneously.Yet, because voiced segments in voice contains the structures such as resonance peak that energy is higher, in two-dimentional time and frequency zone, even if the low frequency region of voice language spectrum is under noise, still there is higher signal to noise ratio (S/N ratio).And these structures that contain higher speech energy are continuous distribution normally in time.Therefore, as long as we are in the language spectrum of voice signal two dimension, find the high-energy region of these continuous distribution, and find out thus connected voiceless sound section, the initial sum that just can obtain voice stops end points.Boundary Detection is an algorithm of finding continuous distribution 2-D data structure in our method.
Yet no matter whether the voice signal of low signal-to-noise ratio strengthens through voice, noise (being musical residual noise after voice strengthen) all will, in Boundary Detection, leave the border of noise language spectrum structure.The language spectrum structure of clean speech will be disturbed and obscure by the language of noise spectrum structure, and this will compose the great interference effect of structure generation to finding the language of clean speech.As shown in Figures 2 and 3.
Fig. 2 is the sound spectrograph of the voice of contain-5dB white noise.In figure, can see, the black horizontal stripe of continuous distribution is voice signal (at high band, the voice signal that energy is lower is fallen by masking by noise, can't see the resonance peak structure of high-frequency region from sound spectrograph), and black flakes background is white noise.Fig. 3 is the sound spectrograph after voice strengthen, and noise is weakened after strengthening through voice widely, but the music noise that still exists residual power to differ.The present invention is divided into these residual noises residual noise and the weak residual noise of energy that energy is stronger, as Fig. 3.These noises, all will greatly disturb the end points of asking for voice.Therefore,, before asking for sound end, for the difference between the language spectrum structure of residual noise and the language spectrum structure of clean speech, the present invention carries out two dimension to voice and strengthens, and comprises two-dimentional noise erosion algorithm and two-dimentional voice expansion algorithm.
two dimension noise erosion algorithm
In the enhancing Processing Algorithm of 2-D data, erosion algorithm can weaken or eliminate specific two-dimensional structure.We find, in the voice language spectrum after voice strengthen, and the residual noise (gloomy flakes structure) that energy is weak, common is all stochastic distribution, as shown in Figure 3.And they have less size and energy.Although these structures are not as the white noise sound intensity in Fig. 3, but still the language spectrum structure boundary of clean speech is asked in interference.The present invention is directed to above feature, propose two-dimentional noise erosion algorithm, for weakening such two-dimensional structure.
Two-dimentional noise erosion algorithm to voice language spectrum, is determined by following process.First, voice are carried out to short time discrete Fourier transform, the frequency spectrum of each frame by following formula, calculated:
(11)
mframe voice signal, mthe frequency spectrum of frame voice signal. nfor length and the short time discrete Fourier transform of frame are counted. it is Hamming window.The voice signal power spectrum of every frame can be expressed as:
(12)
be defined as the language spectrum of voice signal.
Right the corrosion of two-dimentional noise be defined as:
(13)
Wherein structural element, be field of definition, be field of definition.Translation parameters must be field of definition in, and must be field of definition within.Signal is carried out to two-dimentional noise corrosion, and effect is dual: (1) if all elements all for just, the signal of output trends towards more weak than original signal; (2) in the language spectrum signal of input, if noise language spectrum structure and structural element are similar, it is by weakened, and the degree of weakening depends on the language spectrum planform of noise and the shape of structural element.
In the language spectrum structure of voice, erosion algorithm is attenuating noise and voice simultaneously.The object of the two-dimentional noise erosion algorithm that the present invention proposes, is exactly attenuating noise relatively more, and retains better voice.For the structural form of the weak residual noise language spectrum of energy, the structural element of two-dimentional noise erosion algorithm be defined as following formula:
(14)
Such structural element relatively approach the language spectrum structure (less point) of weak residual noise of energy.Therefore use structural element language spectrum is carried out to two-dimentional noise corrosion, can weaken to a certain extent this noise.
two dimension voice expansion algorithm
Voice are through two-dimentional noise erosion algorithm, and the weak residual noise of energy is well suppressed.Yet, between the stronger residual noise of energy (as Fig. 3) and clean speech, on energy, there is approximation, if exceedingly corroded, will weaken the two-dimensional structure of clean speech simultaneously.Expansion algorithm can make the two-dimentional language spectrum structure similar with structural element be enhanced, and dissimilar two-dimentional language spectrum structure is weakened relatively.Therefore, the present invention is directed to residual noise that energy is stronger and the difference between clean speech structure, propose two-dimentional voice expansion algorithm.Bar structure element definition of the present invention is the structure similar to the clean speech of continuous distribution.This noise structure of inhibition that so just can be relative.
Result for two-dimentional noise corrosion , two-dimentional voice expansion algorithm by following formula, defined:
(15)
Wherein structural element, be field of definition, be field of definition.Theoretically, can think all position translation of structural element in language spectrum, the value of structural element is added with the value of 2D signal, and calculating maximal value.It is double action that voice signal is carried out to two-dimentional voice expand: (1) if all elements all for just, the signal of exporting trends towards stronger than original signal; (2) in the language spectrum signal of input, whether certain structure is strengthened relatively, depends on value and the shape of the structural element used that expands.
Expansion algorithm, when strengthening phonetic structure, also can strengthen corresponding noise structure.The object of the two-dimentional voice expansion algorithm that the present invention proposes is, the enhancing phonetic structure of trying one's best, and relatively suppress noise structure.The language spectrum structure of clean speech signal voiced sound is all the strip stretching along time shaft conventionally, and the language spectrum structure of the stronger residual noise of energy is all square not of uniform size or circular conventionally, as shown in Figure 3.Therefore, bar structure element definition is the elongate in shape stretching along time shaft, with this, strengthens all similar structures, can relatively weaken the noise structure that structure is different simultaneously.
So, the structural element in two-dimentional voice expansion algorithm be defined as following shape:
(16)
Here it is the structural element stretching along time orientation of level.All with its similar structure, all will be enhanced.Due to the language of clean speech, composing structure is continuous distribution conventionally in time, and it is similar to , so the structure of clean speech is strengthened.And the language of the stronger residual noise of energy spectrum structure, normally large round dot or square point-like, its structure has been weakened relatively.
3 perception language spectrum structure boundary (PSSB)parameter and end-point detection algorithm
3.1 perception language spectrum structure boundary (PSSB)parameter
The present invention considers the continuous distribution characteristic of clean speech language spectrum on time shaft in two-dimentional aspect, noisy speech is carried out to two dimension and strengthen, and makes the language spectrum structure of voice, further highlights, and has suppressed the language spectrum structure of noise simultaneously.Afterwards, the present invention will find out the language spectrum structure boundary of clean speech continuous distribution, and proposes perception language spectrum structure boundary parameter pSSBfor end-point detection.
For perception language spectrum structure boundary parameter PSSB, first solve the boundary information of language spectrum structure.Boundary Detection is the important method that solves two-dimensional structure border.The border of 2D signal can represent by the definite gradient of first order derivative continuously.Neighbourhood model in formula for the present invention (17) approaches the result that voice two dimension strengthens gradient.
(17)
it is the central point of this neighbourhood model.And the gradient of center neighborhood can be expressed from the next:
(18)
with by formula (19) and formula (20), determined:
(19)
(20)
be border, it can describe the boundary information of the voice signal continuous distribution in noisy speech language spectrum.
By right analysis with voice language spectrum, we find that the signal in voice high-frequency region and language spectrum signature are all fallen by masking by noise under the environment of low signal-to-noise ratio, and at low frequency region, the language spectrum structure of voice voiced segments still relative noise has very high energy, has the language spectrum border that can solve.And more past low frequency place, this phenomenon is more obvious.This is because the energy of voice voiced segments mainly concentrates on the front several resonance peaks of medium and low frequency place.Therefore, on the border of having tried to achieve voice language spectrum afterwards, at language, compose on the frequency axis of each frame all be weighted summation, make low frequency region obtain higher weight, thereby obtain perception language spectrum structure boundary parameter PSSB.
Propose perception language spectrum structure boundary parameter PSSB as shown in the formula:
(21)
Wherein be the PSSB parameter of m frame, M is totalframes.
PSSB parameter the relative content that can well embody voice voiced segments signal in a frame, has good robustness to noise.
3.2 sound ends detect
In voice, voiced segments has the longer continuous distribution time conventionally.And voiceless sound section has two kinds of distribution patterns: (1) voiceless sound is distributed in the middle of voice segments; (2) voiceless sound is distributed in voice segments section start.
Found through experiments, the voiceless sound in the middle of voice segments can be well identified as voice segments (PSSB parameter is greater than threshold value 0.5).This be due to, the voiceless sound in the middle of phonetic word is conventionally shorter, and the present invention adopts, is overlapping 50% frame shifting method.This method can join together to carry out language analysis of spectrum the voiceless sound in the middle of word and the voiced sound on side, thereby in this unvoiced frames, embodies the information of side unvoiced frame.
Yet along with the reduction of signal to noise ratio (S/N ratio), during particularly lower than 0dB, the PSSB distinguishing characteristic of the voiceless sound of voice segments section start weakens (numerical value is less).If carry out end points division with a certain fixed threshold merely, for the detection of voiceless sound, performance can sharply decline.But although the PSSB relative dullness of voiceless sound is smaller, it still has certain PSSB distinguishing characteristic (numerical value is less but non-vanishing) conventionally.Therefore the present invention has adopted the detection method for voice continuity characteristic distributions, with this, treats the voiceless sound section at voiced segments and end points place with a certain discrimination.Concrete end-point detecting method is as follows:
(1) first detect the voice segments that PSSB parameter is greater than threshold value a and continuous distribution m frame, the voiced segments of this section for detecting.
(2) take this section as basis, all sections of following this section to connect together and being more than or equal to continuously threshold value b, are defined as voice segments.It is less that the value of threshold value b is got, and in experiment, the value of b is got 0.01 to 0.05 and all had good recognition result.The less voiceless sound section of PSSB numerical value can be identified like this.
(3) starting point of this voice segments and terminal are sound end.
Through experiment test, for white noise, work as a=0.5, b=0.01, during m=20, the better performances of system.
The block diagram of end-point detection algorithm of the present invention as shown in Figure 4.
Beneficial effect:
Experimental design is under different signal to noise ratio (S/N ratio) environment.The low signal-to-noise ratio voice of input are 16k samplings, 16 quantifications.Use Hamming window, frame length 256, frame moves 128.Voice are selected from TIMIT speech database, and white noise is from NoiseX-92 noise data storehouse.Fig. 5 is the oscillogram of the one section of voice example (artists) in database, and Fig. 6 adds white noise to make the low signal-to-noise ratio speech waveform of reach-10dB of signal to noise ratio (S/N ratio).
In Fig. 5, the starting point of voice is the 40th frames, and terminal is 87 frames.And work as voice signal, add white noise, while making reach-10dB of signal to noise ratio (S/N ratio), voice signal is submerged among white noise completely.Traditional end-point detection algorithm cannot effectively extract sound end from such voice signal.
Fig. 7 is the sound spectrograph of clean speech example (artists), the sound spectrograph of these low signal-to-noise ratio voice of Fig. 8, and Fig. 9 strengthens sound spectrograph afterwards through the voice based on auditory masking characteristic.
As can be seen from Figure 8, the voice under-10dB low signal-to-noise ratio, most of language spectrum structure is flooded by noise, and only the resonance peak structure at low frequency region can also separate with noise range.After strengthening through voice, as can be seen from Figure 9, the effect that noise signal and voice signal are strengthened by voice has simultaneously weakened, but also remains the music noise of stochastic distribution.This is to determine because spectrum subtracts the intrinsic characteristic of class algorithm itself.
If the border of directly asking for language spectrum from the language spectrum of Fig. 9, noise and voice are still difficult to distinguish.Therefore need in the language spectrum of voice, do two dimension strengthens again.As shown in Figure 10 and Figure 11.
Figure 10 is the result of Fig. 9 after two-dimentional noise erosion algorithm.With respect to Fig. 9, can find out, except the resonance peak structure of the stronger residual noise of energy and low frequency place voice, other residual noises are suppressed to a certain extent.Figure 11 carries out the result after two-dimentional voice expansion algorithm to the language spectrum structure of voice in Figure 10.Can find out, the stronger noise language spectrum structure of energy of stochastic distribution, is weakened relatively.The language spectrum structure of voice is strengthened relatively.
Afterwards, to Figure 11 Boundary Detection, as Figure 12.Can see, 40 frames are between 85 frames, and the voice language of low frequency region spectrum border structure is well solved out.Yet, due to the two-dimensional structure of residual a small amount of noise still, in non-voice region, have the border structure of a lot of high-frequency noises to be expressed out.This be do not wish descried.Therefore, exist pSSBin parameter, the border structure of low frequency region has been given higher weight.Like this, voice and noise, be just well distinguished and come.As Figure 13.
Figure 13 is solved by Figure 12 pSSBparameter.Clearly, the in the situation that of-10dB, voice signal pSSBparameter still can have very outstanding distinguishing characteristic on time shaft.When doing end-point detection, right pSSBparameter is done continuity and is detected, if pSSBparameter values is greater than 0 continuously, and the frame number that is greater than continuously threshold value 0.5 is greater than 20 frames, this hop count value is greater than to 0 continuously pSSBparameter is judged as voice segments.
In experiment, end-point detection algorithm of the present invention ( pSSB) contrast other four kinds of end-point detection algorithms, and compare their accuracy.These four kinds of methods are respectively: 1, and energy-short-time zero-crossing rate (EZCR); 2, subband amplitude method (SBA); 3, wavelet coefficient method (WC); 4, subband spectrum entropy method (ABSE).The present invention chooses in TIMIT speech database 70 words as the object of end-point detection, and end-point detection done 3 times in each word.By certain weights, add white noise in NoiseX-92 noise data storehouse, obtain the voice of different signal to noise ratio (S/N ratio)s.We set, and the end-point detection that error is less than 4 frames is correct result.Definition end-point detection accuracy=correct result/total voice segments quantity for end-point detection.Table 1 and Figure 14 have shown the end-point detection accuracy of various algorithms under different signal to noise ratio (S/N ratio)s.
The end-point detection accuracy (%) of table 1 under different signal to noise ratio (S/N ratio)s
" * " in table 1, represents that this algorithm lost efficacy with this understanding, now we to praise rate be zero.By table 1 and Figure 14 with can find out, the in the situation that of 10dB, tri-kinds of classic methods of EZCR, SBA and WC, end-point detection accuracy is lower than 86%.When signal to noise ratio (S/N ratio) is lower than zero time, these three kinds of method complete failures, illustrate that these methods do not have good robust performance to noise.ABSE method accuracy is relatively high, and this is because the method is also to analyze the high-energy composition of clean speech, and makes end-point detection.Employing of the present invention pSSBthe method of parameter relatively and ABSE have higher end points discrimination.The in the situation that of-10dB, still there is 75.2% correct recognition rata.
Accompanying drawing explanation:
Fig. 1 is the speech-enhancement system based on auditory properties;
The sound spectrograph of contain-5dB of Fig. 2 white noise voice;
Sound spectrograph after Fig. 3 voice strengthen;
Fig. 4 is for adopting the end-point detection algorithm of PSSB parameter;
Fig. 5 is clean speech;
Fig. 6 is-10dB low signal-to-noise ratio voice;
Fig. 7 is clean speech signal sound spectrograph;
Fig. 8 is-10dB Low SNR Speech Signal sound spectrograph;
Fig. 9 is that voice strengthen result;
Figure 10 is the sound spectrograph after two-dimentional noise erosion algorithm;
Figure 11 is the sound spectrograph after two-dimentional voice expansion algorithm;
Figure 12 is language spectrum border;
Figure 13 is PSSB parameter and end-point detection
Figure 14 is the contrast of end-point detection result.
Embodiment
Embodiment 1
The first step: the voice based on auditory perception property strengthen; The voice of employing based on auditory masking characteristic strengthen, and on the basis of protection voice, suppress as much as possible noise; In described sound enhancement method, the calculating of masking threshold and speech-enhancement system are as follows:
I .Bark threshold power spectrum
Voice signal x (n)through fast fourier transform (FFT), become frequency-region signal , power spectrum signal is:
(1)
Bark power spectrum is:
B i = &Sigma; k = b li b hi P ( k ) - - - ( 2 ) Wherein the energy that represents i section Bark frequency band, represent the frequency that i section is minimum, represent the highest frequency of i section;
II. diffusion Bark territory power spectrum
Introduce spread function , it is a matrix, satisfies condition:
(3)
definition is as follows:
(4)
frequency reel number poor that represents two frequency bands;
C i = &Sigma; j = 1 j max S ij &CenterDot; B i , i = 1,2 . . . i max - - - ( 5 ) III. shelter the offset function of energy and masking threshold calculating
(6)
T i = 10 log 10 ( C i ) - ( O i / 10 ) - - - ( 7 ) value, between 0 and 1, is determined by voice content; be the masking threshold of i section Bark frequency band, renamed as , wherein the implication of b is identical with i above;
Threshold value with the quiet threshold of audibility:
(8)
Compare, get its maximal value , as the masking threshold of final matching; Wherein for corresponding Bark masking curve;
IV. spectrum subtraction and the adjusting that subtracts parameter
The gain function that spectral subtraction algorithm adopts is as follows:
H ( k ) = ( 1 - &alpha; &CenterDot; [ | D ( k ) | | Y ( k ) | ] &gamma; ) 1 / &gamma; , [ | D ( k ) | | Y ( k ) | ] &gamma; < 1 &alpha; + &beta; ( &beta; &CenterDot; [ | D ( k ) | | Y ( k ) | ] &gamma; ) 1 / &gamma; , else - - - ( 9 ) First calculate the masking by noise threshold value in the different B ark territory of each frame voice, then according to masking by noise threshold value, obtain the adaptive parameter that subtracts , : if masking threshold is higher, and residual noise can be very naturally masked and people's ear is not heard, in this case, subtracts the minimum value that parameter is got them; When masking threshold is lower, residual noise is very large on the impact of people's ear, is necessary to reduce it; For each frame m, masking threshold minimum value and every frame subtract parameter with maximal value relevant; The application that subtracts parameter has following relational expression:
(10)
Wherein, with be respectively minimum value and maximal value; , with , it is respectively parameter , minimum value and maximal value; When time, ; When time, ; In formula with respectively minimum value and the maximal value of the masking threshold that obtains frame by frame; In experiment, we are as follows to the value of parameters:
V. real-time noise power Spectral Estimation; The noise power spectrum method of estimation of employing based on constraint variance spectral smoothing and minimum value tracking.
VI. speech-enhancement system; According to masking threshold obtain adaptive subtract parameter, ;
Second step: the two dimension of voice strengthens;
2.1 two-dimentional noise erosion algorithms
Two-dimentional noise erosion algorithm to voice language spectrum, is determined by following process; First, voice are carried out to short time discrete Fourier transform, the frequency spectrum of each frame by following formula, calculated:
(11)
mframe voice signal, mthe frequency spectrum of frame voice signal; nfor length and the short time discrete Fourier transform of frame are counted; it is Hamming window; The voice signal power spectrum of every frame can be expressed as:
(12)
be defined as the language spectrum of voice signal;
Right the corrosion of two-dimentional noise be defined as:
(13)
Wherein structural element, be field of definition, be field of definition; Translation parameters must be field of definition in, and must be field of definition within;
For the structural form of the weak residual noise language spectrum of energy, the structural element of two-dimentional noise erosion algorithm be defined as following formula:
(14)
2.2 two-dimentional voice expansion algorithms
Result for two-dimentional noise corrosion , two-dimentional voice expansion algorithm by following formula, defined:
(15)
Wherein structural element, be field of definition, be field of definition;
So, the structural element in two-dimentional voice expansion algorithm be defined as following shape:
(16)
The 3rd step: perception language spectrum structure boundary (PSSB) parameter and end-point detection algorithm
3.1 perception language spectrum structure boundary (PSSB) parameters
Neighbourhood model in formula for the present invention (17) approaches the result that voice two dimension strengthens gradient;
(17)
it is the central point of this neighbourhood model; And the gradient of center neighborhood can be expressed from the next:
(18)
with by formula (19) and formula (20), determined:
(19)
(20)
be border, it can describe the boundary information of the voice signal continuous distribution in noisy speech language spectrum.
Propose perception language spectrum structure boundary parameter PSSB as shown in the formula:
(21)
Wherein be the PSSB parameter of m frame, M is totalframes;
3.2 sound ends detect
Adopt the detection method for voice continuity characteristic distributions, with this, treated the voiceless sound section at voiced segments and end points place with a certain discrimination; Concrete end-point detecting method is as follows:
(1) first detect the voice segments that PSSB parameter is greater than threshold value a and continuous distribution m frame, the voiced segments of this section for detecting;
(2) take this section as basis, all sections of following this section to connect together and being more than or equal to continuously threshold value b, are defined as voice segments; It is less that the value of threshold value b is got, and in experiment, the value of b is got 0.01 to 0.05 and all had good recognition result.The less voiceless sound section of PSSB numerical value can be identified like this;
(3) starting point of this voice segments and terminal are sound end.
Experimental design is under different signal to noise ratio (S/N ratio) environment; The low signal-to-noise ratio voice of input are 16k samplings, 16 quantifications; Use Hamming window, frame length 256, frame moves 128; Voice are selected from TIMIT speech database, and white noise is from NoiseX-92 noise data storehouse.

Claims (5)

1. a voice activity detection algorithm that adopts perception language spectrum structure boundary parameter, is characterized in that described algorithm steps is as follows: (1) voice based on auditory perception property strengthen; (2) two dimension of voice strengthens, and comprises two-dimentional noise erosion algorithm and two-dimentional voice expansion algorithm; (3) perception language spectrum structure boundary (PSSB) parameter and sound end detect.
2. a kind of voice activity detection algorithm that adopts perception language spectrum structure boundary parameter according to claim 1, is characterized in that described described algorithm steps is as follows:
The first step: the voice based on auditory perception property strengthen; The voice of employing based on auditory masking characteristic strengthen, and on the basis of protection voice, suppress as much as possible noise; In described sound enhancement method, the calculating of masking threshold and speech-enhancement system are as follows:
I .Bark threshold power spectrum
Voice signal x (n)through fast fourier transform (FFT), become frequency-region signal , power spectrum signal is:
(1)
Bark power spectrum is:
wherein the energy that represents i section Bark frequency band, represent the frequency that i section is minimum, represent the highest frequency of i section;
II. diffusion Bark territory power spectrum
Introduce spread function , it is a matrix, satisfies condition:
(3)
definition is as follows:
(4)
frequency reel number poor that represents two frequency bands;
iII. shelter the offset function of energy and masking threshold calculating
(6)
value, between 0 and 1, is determined by voice content; be the masking threshold of i section Bark frequency band, renamed as , wherein the implication of b is identical with i above;
Threshold value with the quiet threshold of audibility:
(8)
Compare, get its maximal value , as the masking threshold of final matching; Wherein for corresponding Bark masking curve;
IV. spectrum subtraction and the adjusting that subtracts parameter
The gain function that spectral subtraction algorithm adopts is as follows:
First calculate the masking by noise threshold value in the different B ark territory of each frame voice, then according to masking by noise threshold value, obtain the adaptive parameter that subtracts , : if masking threshold is higher, and residual noise can be very naturally masked and people's ear is not heard, in this case, subtracts the minimum value that parameter is got them; When masking threshold is lower, residual noise is very large on the impact of people's ear, is necessary to reduce it; For each frame m, masking threshold minimum value and every frame subtract parameter with maximal value relevant; The application that subtracts parameter has following relational expression:
(10)
Wherein, with be respectively minimum value and maximal value; , with , it is respectively parameter , minimum value and maximal value; When time, ; When time, ; In formula with respectively minimum value and the maximal value of the masking threshold that obtains frame by frame; In experiment, we are as follows to the value of parameters:
V. real-time noise power Spectral Estimation; The noise power spectrum method of estimation of employing based on constraint variance spectral smoothing and minimum value tracking;
vI. speech-enhancement system; According to masking threshold, obtain the adaptive parameter that subtracts , ;
Second step: the two dimension of voice strengthens;
2.1 two-dimentional noise erosion algorithms
Two-dimentional noise erosion algorithm to voice language spectrum, is determined by following process; First, voice are carried out to short time discrete Fourier transform, the frequency spectrum of each frame by following formula, calculated:
(11)
mframe voice signal, mthe frequency spectrum of frame voice signal; nfor length and the short time discrete Fourier transform of frame are counted; it is Hamming window; The voice signal power spectrum of every frame can be expressed as:
(12)
be defined as the language spectrum of voice signal;
Right the corrosion of two-dimentional noise be defined as:
(13)
Wherein structural element, be field of definition, be field of definition; Translation parameters must be field of definition in, and must be field of definition within;
For the structural form of the weak residual noise language spectrum of energy, the structural element of two-dimentional noise erosion algorithm be defined as following formula:
(14)
2.2 two-dimentional voice expansion algorithms
Result for two-dimentional noise corrosion , two-dimentional voice expansion algorithm by following formula, defined:
(15)
Wherein structural element, be field of definition, be field of definition;
So, the structural element in two-dimentional voice expansion algorithm be defined as following shape:
(16)
The 3rd step: perception language spectrum structure boundary (PSSB) parameter and end-point detection algorithm
3.1 perception language spectrum structure boundary (PSSB) parameters
Neighbourhood model in formula for the present invention (17) approaches the result that voice two dimension strengthens gradient;
(17)
it is the central point of this neighbourhood model; And the gradient of center neighborhood can be expressed from the next:
(18)
with by formula (19) and formula (20), determined:
(19)
(20)
be border, it can describe the boundary information of the voice signal continuous distribution in noisy speech language spectrum;
Propose perception language spectrum structure boundary parameter PSSB as shown in the formula:
(21)
Wherein be the PSSB parameter of m frame, M is totalframes;
3.2 sound ends detect
Adopt the detection method for voice continuity characteristic distributions, with this, treated the voiceless sound section at voiced segments and end points place with a certain discrimination; Concrete end-point detecting method is as follows:
(1) first detect the voice segments that PSSB parameter is greater than threshold value a and continuous distribution m frame, the voiced segments of this section for detecting;
(2) take this section as basis, all sections of following this section to connect together and being more than or equal to continuously threshold value b, are defined as voice segments; It is less that the value of threshold value b is got, and in experiment, the value of b is got 0.01 to 0.05 and all had good recognition result; The less voiceless sound section of PSSB numerical value can be identified like this;
(3) starting point of this voice segments and terminal are sound end.
3. a kind of voice activity detection algorithm that adopts perception language spectrum structure boundary parameter according to claim 2, is characterized in that: experimental design is under different signal to noise ratio (S/N ratio) environment; The low signal-to-noise ratio voice of input are 16k samplings, 16 quantifications.
4. a kind of voice activity detection algorithm that adopts perception language spectrum structure boundary parameter according to claim 2, is characterized in that: use Hamming window, and frame length 256, frame moves 128.
5. a kind of voice activity detection algorithm that adopts perception language spectrum structure boundary parameter according to claim 2, is characterized in that: voice are selected from TIMIT speech database, and white noise is from NoiseX-92 noise data storehouse.
CN201410175090.8A 2014-04-29 2014-04-29 Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters Expired - Fee Related CN104091593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410175090.8A CN104091593B (en) 2014-04-29 2014-04-29 Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410175090.8A CN104091593B (en) 2014-04-29 2014-04-29 Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Publications (2)

Publication Number Publication Date
CN104091593A true CN104091593A (en) 2014-10-08
CN104091593B CN104091593B (en) 2017-02-15

Family

ID=51639303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410175090.8A Expired - Fee Related CN104091593B (en) 2014-04-29 2014-04-29 Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Country Status (1)

Country Link
CN (1) CN104091593B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104867493A (en) * 2015-04-10 2015-08-26 武汉工程大学 Multi-fractal dimension endpoint detection method based on wavelet transform
CN106653004A (en) * 2016-12-26 2017-05-10 苏州大学 Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient
CN107742522A (en) * 2017-10-23 2018-02-27 科大讯飞股份有限公司 Target voice acquisition methods and device based on microphone array
CN108122552A (en) * 2017-12-15 2018-06-05 上海智臻智能网络科技股份有限公司 Voice mood recognition methods and device
CN109979478A (en) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 Voice de-noising method and device, storage medium and electronic equipment
CN111028858A (en) * 2019-12-31 2020-04-17 云知声智能科技股份有限公司 Method and device for detecting voice start-stop time
CN111063371A (en) * 2019-12-21 2020-04-24 华南理工大学 Speech spectrum time difference-based speech syllable number estimation method
CN112557510A (en) * 2020-12-11 2021-03-26 广西交科集团有限公司 Concrete pavement void intelligent detection device and detection method thereof
CN112863517A (en) * 2021-01-19 2021-05-28 苏州大学 Speech recognition method based on perceptual spectrum convergence rate

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102982801A (en) * 2012-11-12 2013-03-20 中国科学院自动化研究所 Phonetic feature extracting method for robust voice recognition
CN103489446A (en) * 2013-10-10 2014-01-01 福州大学 Twitter identification method based on self-adaption energy detection under complex environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102982801A (en) * 2012-11-12 2013-03-20 中国科学院自动化研究所 Phonetic feature extracting method for robust voice recognition
CN103489446A (en) * 2013-10-10 2014-01-01 福州大学 Twitter identification method based on self-adaption energy detection under complex environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NIMA DERAKHSHAN ETC: "Noise power spectrum estimation using constrained variance spectral smoothing and minima tracking", 《SPEECH COMMUNICATION》 *
吴迪: "基于听觉特性及语谱特性的语音增强", 《中国优秀博硕士学位论文全文数据库(硕士)科技信息辑》 *
肖纯智 等: "一种基于语谱图分析的语音增强算法", 《电声技术》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104867493A (en) * 2015-04-10 2015-08-26 武汉工程大学 Multi-fractal dimension endpoint detection method based on wavelet transform
CN104867493B (en) * 2015-04-10 2018-08-03 武汉工程大学 Multifractal Dimension end-point detecting method based on wavelet transformation
CN106653004B (en) * 2016-12-26 2019-07-26 苏州大学 Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient
CN106653004A (en) * 2016-12-26 2017-05-10 苏州大学 Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient
CN107742522A (en) * 2017-10-23 2018-02-27 科大讯飞股份有限公司 Target voice acquisition methods and device based on microphone array
CN107742522B (en) * 2017-10-23 2022-01-14 科大讯飞股份有限公司 Target voice obtaining method and device based on microphone array
CN108122552A (en) * 2017-12-15 2018-06-05 上海智臻智能网络科技股份有限公司 Voice mood recognition methods and device
CN109979478A (en) * 2019-04-08 2019-07-05 网易(杭州)网络有限公司 Voice de-noising method and device, storage medium and electronic equipment
CN111063371A (en) * 2019-12-21 2020-04-24 华南理工大学 Speech spectrum time difference-based speech syllable number estimation method
CN111063371B (en) * 2019-12-21 2023-04-21 华南理工大学 Speech syllable number estimation method based on spectrogram time difference
CN111028858A (en) * 2019-12-31 2020-04-17 云知声智能科技股份有限公司 Method and device for detecting voice start-stop time
CN111028858B (en) * 2019-12-31 2022-02-18 云知声智能科技股份有限公司 Method and device for detecting voice start-stop time
CN112557510A (en) * 2020-12-11 2021-03-26 广西交科集团有限公司 Concrete pavement void intelligent detection device and detection method thereof
CN112863517A (en) * 2021-01-19 2021-05-28 苏州大学 Speech recognition method based on perceptual spectrum convergence rate
CN112863517B (en) * 2021-01-19 2023-01-06 苏州大学 Speech recognition method based on perceptual spectrum convergence rate

Also Published As

Publication number Publication date
CN104091593B (en) 2017-02-15

Similar Documents

Publication Publication Date Title
CN104091593A (en) Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
CN103236260B (en) Speech recognition system
Moritz et al. Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments
EP3118852B1 (en) Method and device for detecting audio signal
CN105427859A (en) Front voice enhancement method for identifying speaker
CN106653004B (en) Speaker identification feature extraction method for sensing speech spectrum regularization cochlear filter coefficient
Shi et al. Robust speaker recognition based on improved GFCC
Wang et al. Joint noise and mask aware training for DNN-based speech enhancement with sub-band features
Hu et al. Techniques for estimating the ideal binary mask
Meenakshi et al. Robust whisper activity detection using long-term log energy variation of sub-band signal
Gupta et al. Speech enhancement using MMSE estimation and spectral subtraction methods
Surendran et al. Variance normalized perceptual subspace speech enhancement
TWI749547B (en) Speech enhancement system based on deep learning
Kurpukdee et al. Improving voice activity detection by using denoising-based techniques with convolutional lstm
Ali et al. Auditory-based speech processing based on the average localized synchrony detection
Jayan et al. Detection of stop landmarks using Gaussian mixture modeling of speech spectrum
Sanam et al. Teager energy operation on wavelet packet coefficients for enhancing noisy speech using a hard thresholding function
Sulong et al. Speech enhancement based on wiener filter and compressive sensing
Lu et al. Reduction of residual noise using directional median filter
Kacur et al. ZCPA features for speech recognition
Otterson Improved location features for meeting speaker diarization.
Liu et al. An improved spectral subtraction method
Odelowo et al. A Mask-Based Post Processing Approach for Improving the Quality and Intelligibility of Deep Neural Network Enhanced Speech
Sarafnia et al. A spectral entropy-based measure for performance evaluation of a first-order differential microphone array

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Huang Xujiang

Inventor after: Wu Di

Inventor before: Wu Di

Inventor before: Zhao Heming

Inventor before: Tao Zhi

CB03 Change of inventor or designer information
TR01 Transfer of patent right

Effective date of registration: 20180321

Address after: Room 202, room two, No. 868, West Ring Road, Jiangsu, Jiangsu

Patentee after: Suzhou Cheng Bang energy conservation science & Technology Co., Ltd.

Address before: 215000 Suzhou Industrial Park, Jiangsu Road, No. 199

Patentee before: Soochow University

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170215

Termination date: 20180429

CF01 Termination of patent right due to non-payment of annual fee