Summary of the invention
The technical matters solving: under low signal-to-noise ratio environment, the low-down problem of end-point detection accuracy of conventional end-point detecting method.
Technical scheme: the different characteristic for voice signal under low signal-to-noise ratio and noise signal at time and frequency zone two-dimensional space, and in conjunction with the voice enhancement algorithm based on auditory perception property, propose perception language spectrum structure boundary parameter PSSB (Perception Spectrogram Structure Boundary), and use it for end-point detection.First, the voice that low signal-to-noise ratio voice carried out based on auditory masking characteristic strengthen.Compare with traditional voice enhancement algorithm, this method more effectively retains the appreciable phonetic element of people's ear.On this basis, in two-dimentional aspect, consider the continuous distribution characteristic of clean speech language spectrum on time shaft, noisy speech is carried out to two dimension and strengthen, the language spectrum structure of voice is further highlighted, suppressed the language spectrum structure of noise simultaneously.Finally find out the two-dimentional border of the clean speech language spectrum structure of continuous distribution, and propose PSSB parameter for end-point detection.
1. the voice based on auditory perception property strengthen
Under low signal-to-noise ratio environment, most of end-point detection algorithms cannot detect sound end, even complete failure well.And the mankind can identify voice segments in the stronger environment of noise.Under noisy environment, the auditory perception property of people's ear plays an important role.Adopt the auditory masking characteristic in human auditory system apperceive characteristic, can suppress noise to a certain extent and more retain phonetic element.The present invention proposes
pSSBparameter, first adopts the voice based on auditory masking characteristic to strengthen, and on the basis of protection voice, suppresses as much as possible noise.This sound enhancement method, the most important thing is to calculate masking threshold.Calculating and the speech-enhancement system of masking threshold are as follows:
(1) Bark threshold power spectrum
Voice signal
x (n)through fast fourier transform (FFT), become frequency-region signal
, power spectrum signal is:
(1)
Bark power spectrum is:
Wherein
the energy that represents i section Bark frequency band,
represent the frequency that i section is minimum,
represent the highest frequency of i section.
(2) diffusion Bark territory power spectrum
Introduce spread function
, it is a matrix, satisfies condition:
(3)
definition is as follows:
(4)
frequency reel number poor that represents two frequency bands.
(3) shelter the offset function of energy
and masking threshold
calculating
(6)
value, between 0 and 1, is determined by voice content.
be the masking threshold of i section Bark frequency band, renamed as
, wherein the implication of b is identical with i above.
Threshold value with the quiet threshold of audibility:
(8)
Compare, get its maximal value, as the masking threshold of final matching.Wherein
for
corresponding Bark masking curve.
(4) spectrum subtraction and the adjusting that subtracts parameter
The gain function that spectral subtraction algorithm adopts is as follows:
First calculate the masking by noise threshold value in the different B ark territory of each frame voice, then according to masking by noise threshold value, obtain the adaptive parameter that subtracts
,
: if masking threshold is higher, and residual noise can be very naturally masked and people's ear is not heard, in this case, subtracts the minimum value that parameter is got them; When masking threshold is lower, residual noise is very large on the impact of people's ear, is necessary to reduce it.For each frame m, masking threshold
minimum value and every frame subtract parameter
with
maximal value relevant.The application that subtracts parameter has following relational expression:
,
(10)
Wherein,
with
be respectively
minimum value and maximal value.
,
with
,
it is respectively parameter
,
minimum value and maximal value.When
time,
; When
time,
.In formula
with
respectively minimum value and the maximal value of the masking threshold that obtains frame by frame.In experiment, we are as follows to the value of parameters:
(5) real-time noise power Spectral Estimation
Voice strengthen needs the extra high noise spectrum estimation method of real-time.The noise power spectrum method of estimation of employing based on constraint variance spectral smoothing and minimum value tracking.The core of this algorithm is the smoothing filter of constraint variance, and it has controlled the variance of level and smooth power spectrum in short-term, and it is more accurate to make the tracking of minimum value.The noise spectrum that the method is estimated can be followed the trail of noise sudden change in time, does not produce obvious noise spectrum time delay, and degree of accuracy is better than the noise spectrum that other method is estimated.
(6) speech-enhancement system
According to masking threshold, obtain the adaptive parameter that subtracts
.Speech-enhancement system as shown in Figure 1.
The two dimension of 2 voice strengthens
After the voice of low signal-to-noise ratio strengthen through voice, due to the effect of spectrum subtraction, noise and voice are attenuated simultaneously.Yet, because voiced segments in voice contains the structures such as resonance peak that energy is higher, in two-dimentional time and frequency zone, even if the low frequency region of voice language spectrum is under noise, still there is higher signal to noise ratio (S/N ratio).And these structures that contain higher speech energy are continuous distribution normally in time.Therefore, as long as we are in the language spectrum of voice signal two dimension, find the high-energy region of these continuous distribution, and find out thus connected voiceless sound section, the initial sum that just can obtain voice stops end points.Boundary Detection is an algorithm of finding continuous distribution 2-D data structure in our method.
Yet no matter whether the voice signal of low signal-to-noise ratio strengthens through voice, noise (being musical residual noise after voice strengthen) all will, in Boundary Detection, leave the border of noise language spectrum structure.The language spectrum structure of clean speech will be disturbed and obscure by the language of noise spectrum structure, and this will compose the great interference effect of structure generation to finding the language of clean speech.As shown in Figures 2 and 3.
Fig. 2 is the sound spectrograph of the voice of contain-5dB white noise.In figure, can see, the black horizontal stripe of continuous distribution is voice signal (at high band, the voice signal that energy is lower is fallen by masking by noise, can't see the resonance peak structure of high-frequency region from sound spectrograph), and black flakes background is white noise.Fig. 3 is the sound spectrograph after voice strengthen, and noise is weakened after strengthening through voice widely, but the music noise that still exists residual power to differ.The present invention is divided into these residual noises residual noise and the weak residual noise of energy that energy is stronger, as Fig. 3.These noises, all will greatly disturb the end points of asking for voice.Therefore,, before asking for sound end, for the difference between the language spectrum structure of residual noise and the language spectrum structure of clean speech, the present invention carries out two dimension to voice and strengthens, and comprises two-dimentional noise erosion algorithm and two-dimentional voice expansion algorithm.
two dimension noise erosion algorithm
In the enhancing Processing Algorithm of 2-D data, erosion algorithm can weaken or eliminate specific two-dimensional structure.We find, in the voice language spectrum after voice strengthen, and the residual noise (gloomy flakes structure) that energy is weak, common is all stochastic distribution, as shown in Figure 3.And they have less size and energy.Although these structures are not as the white noise sound intensity in Fig. 3, but still the language spectrum structure boundary of clean speech is asked in interference.The present invention is directed to above feature, propose two-dimentional noise erosion algorithm, for weakening such two-dimensional structure.
Two-dimentional noise erosion algorithm to voice language spectrum, is determined by following process.First, voice are carried out to short time discrete Fourier transform, the frequency spectrum of each frame
by following formula, calculated:
(11)
mframe voice signal,
mthe frequency spectrum of frame voice signal.
nfor length and the short time discrete Fourier transform of frame are counted.
it is Hamming window.The voice signal power spectrum of every frame can be expressed as:
(12)
be defined as the language spectrum of voice signal.
Right
the corrosion of two-dimentional noise be defined as:
(13)
Wherein
structural element,
be
field of definition,
be
field of definition.Translation parameters
must be
field of definition in, and
must be
field of definition within.Signal is carried out to two-dimentional noise corrosion, and effect is dual: (1) if all elements all for just, the signal of output trends towards more weak than original signal; (2) in the language spectrum signal of input, if noise language spectrum structure and structural element are similar, it is by weakened, and the degree of weakening depends on the language spectrum planform of noise and the shape of structural element.
In the language spectrum structure of voice, erosion algorithm is attenuating noise and voice simultaneously.The object of the two-dimentional noise erosion algorithm that the present invention proposes, is exactly attenuating noise relatively more, and retains better voice.For the structural form of the weak residual noise language spectrum of energy, the structural element of two-dimentional noise erosion algorithm
be defined as following formula:
(14)
Such structural element
relatively approach the language spectrum structure (less point) of weak residual noise of energy.Therefore use structural element
language spectrum is carried out to two-dimentional noise corrosion, can weaken to a certain extent this noise.
two dimension voice expansion algorithm
Voice are through two-dimentional noise erosion algorithm, and the weak residual noise of energy is well suppressed.Yet, between the stronger residual noise of energy (as Fig. 3) and clean speech, on energy, there is approximation, if exceedingly corroded, will weaken the two-dimensional structure of clean speech simultaneously.Expansion algorithm can make the two-dimentional language spectrum structure similar with structural element be enhanced, and dissimilar two-dimentional language spectrum structure is weakened relatively.Therefore, the present invention is directed to residual noise that energy is stronger and the difference between clean speech structure, propose two-dimentional voice expansion algorithm.Bar structure element definition of the present invention is the structure similar to the clean speech of continuous distribution.This noise structure of inhibition that so just can be relative.
Result for two-dimentional noise corrosion
, two-dimentional voice expansion algorithm
by following formula, defined:
(15)
Wherein
structural element,
be
field of definition,
be
field of definition.Theoretically, can think all position translation of structural element in language spectrum, the value of structural element is added with the value of 2D signal, and calculating maximal value.It is double action that voice signal is carried out to two-dimentional voice expand: (1) if all elements all for just, the signal of exporting trends towards stronger than original signal; (2) in the language spectrum signal of input, whether certain structure is strengthened relatively, depends on value and the shape of the structural element used that expands.
Expansion algorithm, when strengthening phonetic structure, also can strengthen corresponding noise structure.The object of the two-dimentional voice expansion algorithm that the present invention proposes is, the enhancing phonetic structure of trying one's best, and relatively suppress noise structure.The language spectrum structure of clean speech signal voiced sound is all the strip stretching along time shaft conventionally, and the language spectrum structure of the stronger residual noise of energy is all square not of uniform size or circular conventionally, as shown in Figure 3.Therefore, bar structure element definition is the elongate in shape stretching along time shaft, with this, strengthens all similar structures, can relatively weaken the noise structure that structure is different simultaneously.
So, the structural element in two-dimentional voice expansion algorithm
be defined as following shape:
(16)
Here
it is the structural element stretching along time orientation of level.All with its similar structure, all will be enhanced.Due to the language of clean speech, composing structure is continuous distribution conventionally in time, and it is similar to
, so the structure of clean speech is strengthened.And the language of the stronger residual noise of energy spectrum structure, normally large round dot or square point-like, its structure has been weakened relatively.
3 perception language spectrum structure boundary
(PSSB)parameter and end-point detection algorithm
3.1 perception language spectrum structure boundary
(PSSB)parameter
The present invention considers the continuous distribution characteristic of clean speech language spectrum on time shaft in two-dimentional aspect, noisy speech is carried out to two dimension and strengthen, and makes the language spectrum structure of voice, further highlights, and has suppressed the language spectrum structure of noise simultaneously.Afterwards, the present invention will find out the language spectrum structure boundary of clean speech continuous distribution, and proposes perception language spectrum structure boundary parameter
pSSBfor end-point detection.
For perception language spectrum structure boundary parameter PSSB, first solve the boundary information of language spectrum structure.Boundary Detection is the important method that solves two-dimensional structure border.The border of 2D signal can represent by the definite gradient of first order derivative continuously.Neighbourhood model in formula for the present invention (17) approaches the result that voice two dimension strengthens
gradient.
(17)
it is the central point of this neighbourhood model.And the gradient of center neighborhood can be expressed from the next:
(18)
with
by formula (19) and formula (20), determined:
(19)
(20)
be
border, it can describe the boundary information of the voice signal continuous distribution in noisy speech language spectrum.
By right
analysis with voice language spectrum, we find that the signal in voice high-frequency region and language spectrum signature are all fallen by masking by noise under the environment of low signal-to-noise ratio, and at low frequency region, the language spectrum structure of voice voiced segments still relative noise has very high energy, has the language spectrum border that can solve.And more past low frequency place, this phenomenon is more obvious.This is because the energy of voice voiced segments mainly concentrates on the front several resonance peaks of medium and low frequency place.Therefore, on the border of having tried to achieve voice language spectrum
afterwards, at language, compose on the frequency axis of each frame all
be weighted summation, make low frequency region obtain higher weight, thereby obtain perception language spectrum structure boundary parameter PSSB.
Propose perception language spectrum structure boundary parameter PSSB as shown in the formula:
(21)
Wherein
be the PSSB parameter of m frame, M is totalframes.
PSSB parameter
the relative content that can well embody voice voiced segments signal in a frame, has good robustness to noise.
3.2 sound ends detect
In voice, voiced segments has the longer continuous distribution time conventionally.And voiceless sound section has two kinds of distribution patterns: (1) voiceless sound is distributed in the middle of voice segments; (2) voiceless sound is distributed in voice segments section start.
Found through experiments, the voiceless sound in the middle of voice segments can be well identified as voice segments (PSSB parameter is greater than threshold value 0.5).This be due to, the voiceless sound in the middle of phonetic word is conventionally shorter, and the present invention adopts, is overlapping 50% frame shifting method.This method can join together to carry out language analysis of spectrum the voiceless sound in the middle of word and the voiced sound on side, thereby in this unvoiced frames, embodies the information of side unvoiced frame.
Yet along with the reduction of signal to noise ratio (S/N ratio), during particularly lower than 0dB, the PSSB distinguishing characteristic of the voiceless sound of voice segments section start weakens (numerical value is less).If carry out end points division with a certain fixed threshold merely, for the detection of voiceless sound, performance can sharply decline.But although the PSSB relative dullness of voiceless sound is smaller, it still has certain PSSB distinguishing characteristic (numerical value is less but non-vanishing) conventionally.Therefore the present invention has adopted the detection method for voice continuity characteristic distributions, with this, treats the voiceless sound section at voiced segments and end points place with a certain discrimination.Concrete end-point detecting method is as follows:
(1) first detect the voice segments that PSSB parameter is greater than threshold value a and continuous distribution m frame, the voiced segments of this section for detecting.
(2) take this section as basis, all sections of following this section to connect together and being more than or equal to continuously threshold value b, are defined as voice segments.It is less that the value of threshold value b is got, and in experiment, the value of b is got 0.01 to 0.05 and all had good recognition result.The less voiceless sound section of PSSB numerical value can be identified like this.
(3) starting point of this voice segments and terminal are sound end.
Through experiment test, for white noise, work as a=0.5, b=0.01, during m=20, the better performances of system.
The block diagram of end-point detection algorithm of the present invention as shown in Figure 4.
Beneficial effect:
Experimental design is under different signal to noise ratio (S/N ratio) environment.The low signal-to-noise ratio voice of input are 16k samplings, 16 quantifications.Use Hamming window, frame length 256, frame moves 128.Voice are selected from TIMIT speech database, and white noise is from NoiseX-92 noise data storehouse.Fig. 5 is the oscillogram of the one section of voice example (artists) in database, and Fig. 6 adds white noise to make the low signal-to-noise ratio speech waveform of reach-10dB of signal to noise ratio (S/N ratio).
In Fig. 5, the starting point of voice is the 40th frames, and terminal is 87 frames.And work as voice signal, add white noise, while making reach-10dB of signal to noise ratio (S/N ratio), voice signal is submerged among white noise completely.Traditional end-point detection algorithm cannot effectively extract sound end from such voice signal.
Fig. 7 is the sound spectrograph of clean speech example (artists), the sound spectrograph of these low signal-to-noise ratio voice of Fig. 8, and Fig. 9 strengthens sound spectrograph afterwards through the voice based on auditory masking characteristic.
As can be seen from Figure 8, the voice under-10dB low signal-to-noise ratio, most of language spectrum structure is flooded by noise, and only the resonance peak structure at low frequency region can also separate with noise range.After strengthening through voice, as can be seen from Figure 9, the effect that noise signal and voice signal are strengthened by voice has simultaneously weakened, but also remains the music noise of stochastic distribution.This is to determine because spectrum subtracts the intrinsic characteristic of class algorithm itself.
If the border of directly asking for language spectrum from the language spectrum of Fig. 9, noise and voice are still difficult to distinguish.Therefore need in the language spectrum of voice, do two dimension strengthens again.As shown in Figure 10 and Figure 11.
Figure 10 is the result of Fig. 9 after two-dimentional noise erosion algorithm.With respect to Fig. 9, can find out, except the resonance peak structure of the stronger residual noise of energy and low frequency place voice, other residual noises are suppressed to a certain extent.Figure 11 carries out the result after two-dimentional voice expansion algorithm to the language spectrum structure of voice in Figure 10.Can find out, the stronger noise language spectrum structure of energy of stochastic distribution, is weakened relatively.The language spectrum structure of voice is strengthened relatively.
Afterwards, to Figure 11 Boundary Detection, as Figure 12.Can see, 40 frames are between 85 frames, and the voice language of low frequency region spectrum border structure is well solved out.Yet, due to the two-dimensional structure of residual a small amount of noise still, in non-voice region, have the border structure of a lot of high-frequency noises to be expressed out.This be do not wish descried.Therefore, exist
pSSBin parameter, the border structure of low frequency region has been given higher weight.Like this, voice and noise, be just well distinguished and come.As Figure 13.
Figure 13 is solved by Figure 12
pSSBparameter.Clearly, the in the situation that of-10dB, voice signal
pSSBparameter still can have very outstanding distinguishing characteristic on time shaft.When doing end-point detection, right
pSSBparameter is done continuity and is detected, if
pSSBparameter values is greater than 0 continuously, and the frame number that is greater than continuously threshold value 0.5 is greater than 20 frames, this hop count value is greater than to 0 continuously
pSSBparameter is judged as voice segments.
In experiment, end-point detection algorithm of the present invention (
pSSB) contrast other four kinds of end-point detection algorithms, and compare their accuracy.These four kinds of methods are respectively: 1, and energy-short-time zero-crossing rate (EZCR); 2, subband amplitude method (SBA); 3, wavelet coefficient method (WC); 4, subband spectrum entropy method (ABSE).The present invention chooses in TIMIT speech database 70 words as the object of end-point detection, and end-point detection done 3 times in each word.By certain weights, add white noise in NoiseX-92 noise data storehouse, obtain the voice of different signal to noise ratio (S/N ratio)s.We set, and the end-point detection that error is less than 4 frames is correct result.Definition end-point detection accuracy=correct result/total voice segments quantity for end-point detection.Table 1 and Figure 14 have shown the end-point detection accuracy of various algorithms under different signal to noise ratio (S/N ratio)s.
The end-point detection accuracy (%) of table 1 under different signal to noise ratio (S/N ratio)s
" * " in table 1, represents that this algorithm lost efficacy with this understanding, now we to praise rate be zero.By table 1 and Figure 14 with can find out, the in the situation that of 10dB, tri-kinds of classic methods of EZCR, SBA and WC, end-point detection accuracy is lower than 86%.When signal to noise ratio (S/N ratio) is lower than zero time, these three kinds of method complete failures, illustrate that these methods do not have good robust performance to noise.ABSE method accuracy is relatively high, and this is because the method is also to analyze the high-energy composition of clean speech, and makes end-point detection.Employing of the present invention
pSSBthe method of parameter relatively and ABSE have higher end points discrimination.The in the situation that of-10dB, still there is 75.2% correct recognition rata.
Embodiment 1
The first step: the voice based on auditory perception property strengthen; The voice of employing based on auditory masking characteristic strengthen, and on the basis of protection voice, suppress as much as possible noise; In described sound enhancement method, the calculating of masking threshold and speech-enhancement system are as follows:
I .Bark threshold power spectrum
Voice signal
x (n)through fast fourier transform (FFT), become frequency-region signal
, power spectrum signal is:
(1)
Bark power spectrum is:
Wherein
the energy that represents i section Bark frequency band,
represent the frequency that i section is minimum,
represent the highest frequency of i section;
II. diffusion Bark territory power spectrum
Introduce spread function
, it is a matrix, satisfies condition:
(3)
definition is as follows:
(4)
frequency reel number poor that represents two frequency bands;
III. shelter the offset function of energy
and masking threshold
calculating
(6)
value, between 0 and 1, is determined by voice content;
be the masking threshold of i section Bark frequency band, renamed as
, wherein the implication of b is identical with i above;
Threshold value with the quiet threshold of audibility:
(8)
Compare, get its maximal value
, as the masking threshold of final matching; Wherein
for
corresponding Bark masking curve;
IV. spectrum subtraction and the adjusting that subtracts parameter
The gain function that spectral subtraction algorithm adopts is as follows:
First calculate the masking by noise threshold value in the different B ark territory of each frame voice, then according to masking by noise threshold value, obtain the adaptive parameter that subtracts
,
: if masking threshold is higher, and residual noise can be very naturally masked and people's ear is not heard, in this case, subtracts the minimum value that parameter is got them; When masking threshold is lower, residual noise is very large on the impact of people's ear, is necessary to reduce it; For each frame m, masking threshold
minimum value and every frame subtract parameter
with
maximal value relevant; The application that subtracts parameter has following relational expression:
,
(10)
Wherein,
with
be respectively
minimum value and maximal value;
,
with
,
it is respectively parameter
,
minimum value and maximal value; When
time,
; When
time,
; In formula
with
respectively minimum value and the maximal value of the masking threshold that obtains frame by frame; In experiment, we are as follows to the value of parameters:
V. real-time noise power Spectral Estimation; The noise power spectrum method of estimation of employing based on constraint variance spectral smoothing and minimum value tracking.
VI. speech-enhancement system; According to masking threshold obtain adaptive subtract parameter,
;
Second step: the two dimension of voice strengthens;
2.1 two-dimentional noise erosion algorithms
Two-dimentional noise erosion algorithm to voice language spectrum, is determined by following process; First, voice are carried out to short time discrete Fourier transform, the frequency spectrum of each frame
by following formula, calculated:
(11)
mframe voice signal,
mthe frequency spectrum of frame voice signal;
nfor length and the short time discrete Fourier transform of frame are counted;
it is Hamming window; The voice signal power spectrum of every frame can be expressed as:
(12)
be defined as the language spectrum of voice signal;
Right
the corrosion of two-dimentional noise be defined as:
(13)
Wherein
structural element,
be
field of definition,
be
field of definition; Translation parameters
must be
field of definition in, and
must be
field of definition within;
For the structural form of the weak residual noise language spectrum of energy, the structural element of two-dimentional noise erosion algorithm
be defined as following formula:
(14)
2.2 two-dimentional voice expansion algorithms
Result for two-dimentional noise corrosion
, two-dimentional voice expansion algorithm
by following formula, defined:
(15)
Wherein
structural element,
be
field of definition,
be
field of definition;
So, the structural element in two-dimentional voice expansion algorithm
be defined as following shape:
(16)
The 3rd step: perception language spectrum structure boundary (PSSB) parameter and end-point detection algorithm
3.1 perception language spectrum structure boundary (PSSB) parameters
Neighbourhood model in formula for the present invention (17) approaches the result that voice two dimension strengthens
gradient;
(17)
it is the central point of this neighbourhood model; And the gradient of center neighborhood can be expressed from the next:
(18)
with
by formula (19) and formula (20), determined:
(19)
(20)
be
border, it can describe the boundary information of the voice signal continuous distribution in noisy speech language spectrum.
Propose perception language spectrum structure boundary parameter PSSB as shown in the formula:
(21)
Wherein
be the PSSB parameter of m frame, M is totalframes;
3.2 sound ends detect
Adopt the detection method for voice continuity characteristic distributions, with this, treated the voiceless sound section at voiced segments and end points place with a certain discrimination; Concrete end-point detecting method is as follows:
(1) first detect the voice segments that PSSB parameter is greater than threshold value a and continuous distribution m frame, the voiced segments of this section for detecting;
(2) take this section as basis, all sections of following this section to connect together and being more than or equal to continuously threshold value b, are defined as voice segments; It is less that the value of threshold value b is got, and in experiment, the value of b is got 0.01 to 0.05 and all had good recognition result.The less voiceless sound section of PSSB numerical value can be identified like this;
(3) starting point of this voice segments and terminal are sound end.
Experimental design is under different signal to noise ratio (S/N ratio) environment; The low signal-to-noise ratio voice of input are 16k samplings, 16 quantifications; Use Hamming window, frame length 256, frame moves 128; Voice are selected from TIMIT speech database, and white noise is from NoiseX-92 noise data storehouse.