CN104091593B

CN104091593B - Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Info

Publication number: CN104091593B
Application number: CN201410175090.8A
Authority: CN
Inventors: 吴迪; 赵鹤鸣; 陶智
Original assignee: Suzhou University
Current assignee: Suzhou Cheng Bang Energy Conservation Science & Technology Co Ltd
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2017-02-15
Anticipated expiration: 2034-04-29
Also published as: CN104091593A

Abstract

The invention belongs to the field of voice recognition, and discloses a voice endpoint detection algorithm adopting a perceptual speech spectrum structure boundary Parameter (PSSB). After the voice enhancement based on the auditory perception characteristic is carried out on the noisy voice, aiming at different points between the continuous distribution characteristic of the voice signal and the random distribution characteristic of the residual noise, the time-frequency voice spectrum of the enhanced voice is subjected to two-dimensional enhancement, so that the voice spectrum structure of the continuously distributed pure voice is further highlighted. Through two-dimensional boundary detection of the enhanced speech spectral structure, PSSB parameters are provided and used for end point detection. Experimental results show that under the environment of various signal-to-noise ratios from-10 dB to 10dB of white noise, the endpoint detection algorithm adopting the PSSB parameters can more effectively detect the endpoint of the voice. At a very low signal-to-noise ratio of-10 dB, the proposed method still has 75.2% accuracy.

Description

Speech endpoint detection algorithm adopting perceptual speech spectrum structure boundary parameters

Technical Field

The invention belongs to the field of voice recognition, relates to a voice endpoint detection algorithm, and particularly relates to a voice endpoint detection algorithm adopting a perceptual speech spectrum structure boundary parameter.

Background

The method is used as the basis of voice recognition and speaker recognition, and can be used for correctly and effectively detecting the end points, so that the recognition rate of a speaker recognition system and a voice recognition system can be greatly improved. Under the high signal-to-noise ratio environment of a laboratory, the traditional endpoint detection algorithm can well detect the voice endpoint. However, in low signal-to-noise ratio environments, the performance of most end-point detection algorithms drops dramatically.

In recent years, many scholars have studied noise-robust endpoint detection. Ganapathira ju (A. Ganapathira ju, et al, company of Energy-Based Endpoint Detectors for Speech Signal Processing. In Proc. lEEE Publications, 1996; 500-. Compared with the traditional energy method, the method has better robustness of endpoint detection. However, this approach does not work in environments with lower signal-to-noise ratios. The acoustic science report 2005 (30 (2): 171-. Furthermore, Zhang et al (Xueying Zhang, et al. A Speech Endpoint Detection Method Based on Wavelet Coefficient variance and Sub-Band Amplitude variance. In Proc. IEEE ICICIC, 2006; 105) propose a Method using Wavelet Coefficients (WC) for Endpoint Detection using Wavelet analysis, which can distinguish Speech segments and noise segments to a certain extent because of its ability to analyze signals at various scales. Wu et al (Bing-Fei Wu, Kun-Ching Wang, Robust Endpoint Detection, Adaptive Band-partial Spectral in addition, IEEE Transactions on Spectrum and Audio Processing, 2005; 13(5):762 775) used the method of Adaptive sub-Band Spectral Entropy (ABSE) for Endpoint Detection. The method can well distinguish the sub-band signal and the noise of the voice, and obtains better endpoint detection accuracy rate in the environment containing the noise. Li (Q.Li, et al. A Robust real-time end detector with energy analysis for ASR in additive environment. International Conference on optics Speech and Signal Processing, 2001; 574) uses a method for optimizing edge detection in image Processing for voice end detection, and uses a filter and tristate decision logic to perform end detection, so that the threshold does not need to be adjusted under the condition of different Signal-to-noise ratios. The method combines the algorithm of image processing, and plays a good auxiliary role in detecting the end points. However, the above methods cannot obtain a high accuracy of endpoint detection in a low snr environment.

Disclosure of Invention

The technical problem to be solved is as follows: under the environment of low signal-to-noise ratio, the accuracy of the endpoint detection of the conventional endpoint detection method is very low.

The technical scheme is as follows: aiming at different characteristics of a speech signal and a noise signal in a time-frequency domain two-dimensional space under a low signal-to-noise ratio and combining a speech enhancement algorithm based on auditory perception characteristics, a perceptual spectral Structure boundary parameter PSSB (perceptual Spectrometry Structure boundary) is provided and is used for endpoint detection. First, speech enhancement based on auditory masking characteristics is performed on low signal-to-noise ratio speech. This method more effectively preserves the speech component perceptible to the human ear than conventional speech enhancement algorithms. On the basis, the continuous distribution characteristic of a pure voice spectrum on a time axis is considered in a two-dimensional layer, and the two-dimensional enhancement is carried out on the voice containing noise, so that the voice spectrum structure of the voice is further highlighted, and the voice spectrum structure of the noise is restrained. And finally, finding out a two-dimensional boundary of a continuously distributed pure speech spectrum structure, and providing PSSB parameters for end point detection.

1. Speech enhancement based on auditory perception characteristics

Under the environment of low signal-to-noise ratio, most endpoint detection algorithms cannot well detect voice endpoints, and even fail completely. Human beings can recognize speech segments in a noisy environment. In noisy environments, the auditory perception characteristics of the human ear play an important role. By adopting the auditory masking characteristic in the auditory perception characteristic of human ears, noise can be suppressed to a certain degree, and more voice components are reserved. The invention proposesPSSBAnd the parameters adopt voice enhancement based on auditory masking characteristics to inhibit noise as much as possible on the basis of protecting voice. The most important of such speech enhancement methods is the calculation of the masking threshold. The computation of the masking threshold and the speech enhancement system are as follows:

(1) bark threshold power spectrum

Speech signalx(n)Through Fast Fourier Transform (FFT) to become frequency domain signalThe signal power spectrum is:

(1)

the Bark power spectrum is:

whereinRepresents the energy of the ith Bark band,indicates the lowest frequency of the i-th segment,indicating the highest frequency of the ith segment.

(2) Diffusion Bark domain power spectrum

Introducing a diffusion functionIt is a matrix, satisfying the condition:

(3)

the definition formula is as follows:

(4)

indicating the difference between the band numbers of the two bands.

(3) Offset function of masking energyAnd masking thresholdIs calculated by

(6)

The value is between 0 and 1, determined by the speech content.Is the masking threshold of the ith Bark band, which is referred to asWherein b has the same meaning as i above.

And threshold of quiet hearing threshold:

(8)

and comparing, and taking the maximum value as the masking threshold of the final fitting. WhereinIs composed ofCorresponding Bark masking curve.

(4) Spectral subtraction and adjustment of subtraction parameters

The gain function employed by the spectral subtraction algorithm is as follows:

firstly, calculating noise masking threshold values of different Bark domains of each frame of voice, and then obtaining self-adaptive subtraction parameters according to the noise masking threshold values、: if the masking threshold is high, the residual noise will be naturally masked and inaudible to the human ear, in which case the subtraction parameters take their minimum value; when the masking threshold is low, the residual noise has a large influence on the human ear, and it is necessary to reduce it. For each frame m, the masking thresholdMinimum of (2) and subtraction parameter per frameAndis related to the maximum value of (c). The application of the subtraction parameter has the following relation:

，

(10)

wherein,andare respectively asMinimum and maximum values of.，And，are respectively a parameter、Minimum and maximum values of. When in useWhen the temperature of the water is higher than the set temperature,(ii) a When in useWhen the temperature of the water is higher than the set temperature,. In the formulaAndrespectively the minimum and maximum values of the masking threshold obtained from frame to frame. In the experiment, the values of each parameter are as follows:

(5) real-time noise power spectrum estimation

Speech enhancement requires a noise spectrum estimation method that is particularly high in real-time. And adopting a noise power spectrum estimation method based on constrained variance spectrum smoothing and minimum value tracking. The kernel of the algorithm is a variance-constrained smoothing filter which controls the variance of the short-time smoothed power spectrum, so that the tracking of the minimum value is more accurate. The noise spectrum estimated by the method can track noise mutation in time, obvious noise spectrum delay is not generated, and the accuracy is superior to the noise spectrum estimated by other methods.

(6) Speech enhancement system

Obtaining adaptive subtraction parameters from masking thresholdsIn a similar manner to that of. A speech enhancement system is shown in fig. 1.

Two-dimensional enhancement of 2-speech

After speech enhancement of low signal-to-noise ratio speech, noise and speech are simultaneously attenuated due to spectral subtraction. However, since voiced segments in speech contain structures such as formants with higher energy, in the two-dimensional time-frequency domain, the low-frequency region of speech spectrum has a higher signal-to-noise ratio even under noise interference. And these structures containing higher speech energy are usually distributed continuously in time. Thus, as long as we find these continuously distributed high energy regions in the two-dimensional speech spectrum of the speech signal, and thus find the connected unvoiced segments, the start and end points of the speech can be found. Boundary detection, in our method, is an algorithm that finds a continuously distributed two-dimensional data structure.

However, whether or not a speech signal with a low signal-to-noise ratio is speech enhanced, noise (residual music noise after speech enhancement) will leave a boundary of the noisy speech spectral structure in the boundary detection. The speech spectrum structure of the clean speech will be confused by the speech spectrum structure interference of the noise, which will have a great disturbing effect on finding the speech spectrum structure of the clean speech. As shown in fig. 2 and 3.

FIG. 2 is a spectrogram of speech containing-5 dB white noise. It can be seen that the black horizontal stripes which are continuously distributed are voice signals (in a high frequency band, the voice signals with lower energy are masked by noise, and formant structures in a high frequency region are not visible in a spectrogram), and the black snowflake background is white noise. Fig. 3 is a spectrogram after speech enhancement, in which noise is greatly attenuated after speech enhancement, but residual music noise with different strengths still exists. The present invention separates these residual noises into stronger-energy residual noises and weaker-energy residual noises, as shown in fig. 3. These noises will interfere significantly with the endpoint of the speech being sought. Therefore, before the speech endpoint is obtained, aiming at the difference between the speech spectrum structure of the residual noise and the speech spectrum structure of the clean speech, the invention performs two-dimensional enhancement on the speech, including a two-dimensional noise erosion algorithm and a two-dimensional speech expansion algorithm.

Two-dimensional noise corrosion algorithm

In the two-dimensional data enhancement algorithm, the erosion algorithm may weaken or eliminate certain two-dimensional structures. We have found that in the speech spectrum after speech enhancement, the weaker residual noise (dark snowflake), which is usually randomly distributed, is present, as shown in fig. 3. And they have a small size and energy. These structures, while not as strong as white noise in fig. 3, still interfere with finding the speech spectral structure boundaries of clean speech. Aiming at the characteristics, the invention provides a two-dimensional noise corrosion algorithm for weakening the two-dimensional structure.

The two-dimensional noise corrusion algorithm for the voice spectrum is determined by the following process. First, a short-time Fourier transform is performed on the speech, the frequency spectrum of each frameCalculated from the following formula:

(11)

is the firstmThe frame of the speech signal is then decoded,is the firstmSpectrum of the frame speech signal.NThe length of the frame and the number of short-time fourier transform points.Is a Hamming window. The voice signal power spectrum per frame can be expressed as:

(12)

i.e. a speech spectrum defined as a speech signal.

To pairIs defined as:

(13)

whereinIs a structural element of the compound of the formula,is thatThe domain of definition of (a) is,is thatThe domain of definition of (1). Translation parametersMust be inWithin the definition domain of (1), andmust be inWithin the domain of definition of (c). The two-dimensional noise corrosion is carried out on the signals, and the effects are twofold: (1) if all elements are positive, the output signal tends to be weaker than the original signal; (2) in the input speech spectrum signal, if the noise speech spectrum structure is similar to the structural element, it will be weakened, and the weakening degree depends on the shape of the speech spectrum structure of the noise and the shape of the structural element.

In the speech spectral structure of speech, the erosion algorithm attenuates both noise and speech. The two-dimensional noise corrusion algorithm provided by the invention aims to weaken noise relatively more and better retain voice. Two-dimensional noise corrosion algorithm aiming at structural morphology of residual noise speech spectrum with weak energyStructural element ofIs defined as the following formula:

(14)

such structural elementsThe spectral structure closer to the residual noise with weaker energy (smaller dots). Thus using structural elementsTwo-dimensional noise corruptions are performed on the spectrum, which can be attenuated to some extent.

Two-dimensional speech expansion algorithm

The residual noise with weak energy is well inhibited by the voice through a two-dimensional noise corrosion algorithm. However, since there is an approximation in energy between the energy-intensive residual noise (fig. 3) and the clean speech, if it is excessively eroded, the two-dimensional structure of the clean speech will be simultaneously weakened. The expansion algorithm can enhance the two-dimensional speech spectrum structure similar to the structural elements, and the dissimilar two-dimensional speech spectrum structure is weakened relatively. Therefore, the invention provides a two-dimensional speech expansion algorithm aiming at the difference between the residual noise with stronger energy and the pure speech structure. The present invention defines the structural elements as structures similar to a continuous distribution of pure speech. This allows to suppress the noise structure relatively.

Results for two-dimensional noise corrosionTwo-dimensional speech expansion algorithmIs defined by the formula:

(15)

whereinIs a structural element of the compound of the formula,is thatThe domain of definition of (a) is,is thatThe domain of definition of (1). Theoretically, it can be considered that the structural element is translated at all positions in the speech spectrum, the value of the structural element is added to the value of the two-dimensional signal, and the maximum value is calculated. Two-dimensional speech expansion of speech signals is dual-purpose: (1) if all elements are positive, the output signal tends to be stronger than the original signal; (2) whether a certain structure is relatively enhanced in the input speech spectral signal depends on the value and shape of the structural element used for dilation.

The dilation algorithm, while enhancing the speech structure, also enhances the corresponding noise structure. The two-dimensional voice expansion algorithm provided by the invention aims to enhance the voice structure as much as possible and relatively restrain the noise structure. The spectral structure of voiced speech signals of clean speech signals is usually a long bar extending along the time axis, while the spectral structure of residual noise with high energy is usually a square or a circle with different sizes, as shown in fig. 3. Therefore, the structural elements are defined as long strips extending along the time axis, so that all similar structures are enhanced, and the different noise structures of the structures can be relatively weakened.

Therefore, the structural elements in the two-dimensional speech expansion algorithmIs defined as the following shape:

(16)

herein, theAre horizontally, temporally extending structural elements. All structures similar to it will be enhanced. Since the spectral structure of clean speech is usually continuously distributed in time, it is similar to that of pure speechAnd thus the structure of the clean speech is enhanced. The spectral structure of the residual noise with strong energy, which is usually a large dot or a square dot, is relatively weakened.

3-perception speech spectrum structure boundary(PSSB)Parameter and endpoint detection algorithm

3.1 perceptual Speech Spectrum Structure boundaries(PSSB)Parameter(s)

The invention considers the continuous distribution characteristic of the pure voice spectrum on the time axis on the two-dimensional level, and carries out two-dimensional enhancement on the voice containing noise, thereby further highlighting the voice spectrum structure of the voice and simultaneously inhibiting the voice spectrum structure of the noise. Then, the invention finds out the speech spectrum structure boundary of the pure speech continuous distribution and provides the boundary parameter of the perception speech spectrum structurePSSBFor endpoint detection.

For the perceptual speech spectrum structure boundary parameter PSSB, the boundary information of the speech spectrum structure is first solved. Boundary detection is an important method for solving the boundary of a two-dimensional structure. The boundary of the continuous two-dimensional signal can be oneThe gradient of the order derivative determination. The invention uses the neighborhood model in formula (17) to approximate the result of the two-dimensional speech enhancementOf the gradient of (c).

(17)

Is the center point of this neighborhood model. While the gradient of the central neighborhood can be represented by:

(18)

anddetermined by equation (19) and equation (20):

(19)

(20)

is thatOf which it isBoundary information that describes the continuous distribution of speech signals in a noisy speech spectrum.

By pairsAnd the analysis of a speech spectrum, we find that in the environment of low signal-to-noise ratio, the signal and the speech spectrum characteristics of the speech high-frequency region are masked by noise, and in the low-frequency region, the speech spectrum structure of the voiced speech segment still has high energy relative to the noise and has a solvable speech spectrum boundary. And the more toward low frequencies, the more pronounced this phenomenon is. This is because the energy of voiced speech segments is concentrated mainly at the first few formants of the medium and low frequencies. Therefore, the boundary of the speech spectrum is obtainedThen, all the signals are processed on the frequency axis of each frame of the speech spectrumAnd carrying out weighted summation to obtain higher weight in a low-frequency region so as to obtain a perceptual speech spectrum structure boundary parameter PSSB.

The perceptual speech spectrum structure boundary parameter PSSB is proposed as follows:

(21)

whereinIs the PSSB parameter for the mth frame, and M is the total frame number.

PSSB parameterThe method can well reflect the relative content of the voiced speech segment signals in one frame and has good robustness on noise.

3.2 Voice endpoint detection

Voiced segments in speech typically have a long continuous distribution time. While unvoiced segments have two distribution types: (1) the unvoiced sound is distributed in the middle of the voice section; (2) unvoiced sound is distributed at the beginning of a speech segment.

It has been found through experiments that the unvoiced sound in the middle of a speech segment can be well recognized as a speech segment (PSSB parameter greater than threshold 0.5). This is because the unvoiced sound in the middle of a phonetic word is usually shorter, whereas the present invention uses a frame shift method with 50% overlap. This method combines the unvoiced sound in the middle of the word with the voiced sound at the side to perform speech spectrum analysis, thereby embodying the information of the voiced frame at the side in the unvoiced frame.

However, as the signal-to-noise ratio decreases, particularly below 0dB, the PSSB discriminating characteristic of unvoiced speech at the beginning of a speech segment decreases (smaller values). If the endpoint division is simply performed with a certain fixed threshold, the performance will be rapidly reduced for the detection of unvoiced sound. However, while unvoiced PSSB is relatively small compared to voiced sound, it still generally has some PSSB distinguishing characteristics (small but not zero). Therefore, the invention adopts a detection method aiming at the characteristics of voice continuity distribution so as to distinguish the voiced sound segment and the unvoiced sound segment at the endpoint. The specific endpoint detection method is as follows:

(1) it is first detected that the PSSB parameter is greater than the threshold a and that a speech segment of m frames is continuously distributed, which is a detected voiced segment.

(2) On the basis of the segment, all segments which are connected with the segment and are continuously greater than or equal to the threshold b are defined as voice segments. The threshold value b is small, and in the experiment, the value of b is 0.01 to 0.05, so that the better recognition result is obtained. This allows unvoiced segments with smaller PSSB values to be identified.

(3) The starting point and the end point of the voice segment are voice end points.

Through experimental tests, for white noise, when a =0.5, b =0.01 and m =20, the performance of the system is better.

A block diagram of the endpoint detection algorithm of the present invention is shown in fig. 4.

Has the advantages that:

the experimental design was under different signal to noise ratio environments. The input low snr speech is 16k samples, 16 bits quantized. Using hamming window, frame length 256, frame shift 128. The speech was selected from the TIMIT Speech database and the white noise was from the NoiseX-92 noise database. Fig. 5 is a waveform diagram of an example of speech (artists) in a database, and fig. 6 is a low signal-to-noise ratio speech waveform with white noise added to bring the signal-to-noise ratio to-10 dB.

In fig. 5, the start point of the speech is the 40 th frame and the end point is the 87 th frame. When white noise is added to the voice signal to make the signal-to-noise ratio reach-10 dB, the voice signal is completely submerged in the white noise. Conventional endpoint detection algorithms fail to effectively extract voice endpoints from such voice signals.

FIG. 7 is a spectrogram of clean speech examples (artists), FIG. 8 is a spectrogram of this low signal-to-noise ratio speech, and FIG. 9 is a spectrogram after speech enhancement based on auditory masking properties.

As can be seen from FIG. 8, for speech at-10 dB low signal-to-noise ratio, most of the speech spectral structure is already drowned out by noise, and only the formant structure in the low frequency region can be distinguished from noise. After the speech enhancement, it can be seen from fig. 9 that the noise signal and the speech signal are simultaneously attenuated by the speech enhancement, and randomly distributed music noise remains. This is due to the inherent nature of the spectral subtraction algorithm itself.

If the boundaries of the speech spectrum are found directly from the speech spectrum of fig. 9, noise and speech are still difficult to distinguish. Therefore, two-dimensional enhancement needs to be performed in the speech spectrum of the speech. As shown in fig. 10 and 11.

Fig. 10 is the result of fig. 9 after a two-dimensional noise corruption algorithm. As can be seen with respect to fig. 9, the residual noise is suppressed to some extent, except for the residual noise having strong energy and the formant structure of the speech at low frequencies. FIG. 11 is the result of a two-dimensional speech dilation algorithm applied to the speech spectral structure of the speech in FIG. 10. It can be seen that the noise spectrum structure with strong energy distributed randomly is weakened relatively. The speech spectral structure of the speech is relatively enhanced.

After that, the boundary of fig. 11 is detected as in fig. 12. It can be seen that between 40 frames and 85 frames, the boundary structure of the speech spectrum in the low frequency region is well solved. However, due to the two-dimensional structure in which a small amount of noise still remains, in the non-speech region, there are many boundary structures of medium-high frequency noise represented. This is undesirable to see. Thus, inPSSBIn the parameters, the boundary structure of the low frequency region is given a higher weight. Thus, speech and noise are well separated. As shown in fig. 13.

FIG. 13 is derived from FIG. 12PSSBAnd (4) parameters. It is clear that in the-10 dB case, of the speech signalPSSBThe parameters can still have outstanding distinguishing characteristics on the time axis. In the process of endpoint detection, thePSSBThe parameters are continuously detected ifPSSBThe parameter value is continuously greater than 0, and the number of frames continuously greater than the threshold value 0.5 is greater than 20 frames, the number of the frames is continuously greater than 0PSSBThe parameters are judged as voice segments.

In the experiment, the endpoint detection algorithm of the invention (PSSB) Compare the other four endpoint detection algorithms and compare their accuracy. The four methods are respectively: 1, energy-short time zero crossing rate (EZCR); 2, subband amplitude method (SBA); wavelet coefficient method (WC); 4, sub-band spectral entropy method (ABSE). The invention selects 70 words in the TIMIT voice database as the object of end point detection, and each word is subjected to end point detection for 3 times. White noise in a noise database of NoiseX-92 is added according to a certain weight value to obtain voices with different signal-to-noise ratios. We set the end point detection with an error of less than 4 frames as the correct result. Define endpoint detection accuracy = correct result/total number of speech segments used for endpoint detection. Table 1 and fig. 14 show the endpoint detection accuracy for various algorithms at different signal-to-noise ratios.

TABLE 1 end-point detection accuracy (%) -at different signal-to-noise ratios

The "+" in table 1 indicates that the algorithm failed under this condition, when we consider the accuracy to be zero. From table 1 and fig. 14, it can be seen that the end point detection accuracy is already below 86% in the 10dB case for the three conventional methods EZCR, SBA and WC. When the signal-to-noise ratio is lower than zero, the three methods completely fail, and the methods do not have good robustness performance on noise. The ABSE method is relatively accurate because it also analyzes the high energy components of clean speech and makes endpoint detections. The invention adoptsPSSBThe parametric approach has a higher endpoint recognition rate relative to the ABSE. In the-10 dB case, there is still a 75.2% correct recognition rate.

Description of the drawings:

FIG. 1 is a speech enhancement system based on auditory properties;

FIG. 2 contains a spectrogram of-5 dB white noise speech;

FIG. 3 is a spectrogram after speech enhancement;

FIG. 4 is an endpoint detection algorithm using PSSB parameters;

FIG. 5 is a pure speech;

FIG. 6 is-10 dB low signal-to-noise ratio speech;

FIG. 7 is a spectrogram of a clean speech signal;

FIG. 8 is a spectrogram of a-10 dB low signal-to-noise ratio speech signal;

FIG. 9 is a result of speech enhancement;

FIG. 10 is a spectrogram after a two-dimensional noise-corrusion algorithm;

FIG. 11 is a spectrogram after a two-dimensional speech expansion algorithm;

FIG. 12 is a speech spectrum boundary;

FIG. 13 shows PSSB parameter and endpoint detection

Fig. 14 is a comparison of end point detection results.

Detailed Description

Example 1

The first step is as follows: speech enhancement based on auditory perception characteristics; the voice enhancement based on the auditory masking characteristic is adopted, and noise is suppressed as much as possible on the basis of protecting voice; the masking threshold calculation and speech enhancement system in the speech enhancement method is as follows:

bark threshold power spectrum

(1)

the Bark power spectrum is:

whereinRepresents the energy of the ith Bark band,indicates the lowest frequency of the i-th segment,represents the highest frequency of the ith segment;

ii diffusion Bark domain power spectrum

Introducing a diffusion functionIt is a matrix, satisfying the condition:

(3)

the definition formula is as follows:

(4)

represents the difference between the band numbers of the two bands;

masking energy iiiIs offset function ofAnd masking thresholdIs calculated by

(6)

The value is between 0 and 1 and is determined by the voice content;is the masking threshold of the ith Bark band, which is referred to asWherein b has the same meaning as i above;

and threshold of quiet hearing threshold:

(8)

comparing and taking the maximum valueAs the masking threshold of the final fit; whereinIs composed ofThe corresponding Bark masking curve;

iv. adjustment of spectral subtraction and subtraction parameters

The gain function employed by the spectral subtraction algorithm is as follows:

firstly, calculating noise masking threshold values of different Bark domains of each frame of voice, and then obtaining self-adaptive subtraction parameters according to the noise masking threshold values、: if the masking threshold is high, the residual noise will be naturally masked and inaudible to the human ear, in which case the subtraction parameters take their minimum value; when the masking threshold is low, the residual noise has a great influence on human ears, and it is necessary to reduce the residual noise; for each frame m, the masking thresholdMinimum of (2) and subtraction parameter per frameAndis related to the maximum value of; the application of the subtraction parameter has the following relation:

，

(10)

wherein,andare respectively asMinimum and maximum values of;，and，are respectively a parameter、Minimum and maximum values of; when in useWhen the temperature of the water is higher than the set temperature,(ii) a When in useWhen the temperature of the water is higher than the set temperature,(ii) a In the formulaAndthe minimum value and the maximum value of the masking threshold value obtained frame by frame respectively; in the experiment, the values of each parameter are as follows:

v, estimating a real-time noise power spectrum; and adopting a noise power spectrum estimation method based on constrained variance spectrum smoothing and minimum value tracking.

Vi. a speech enhancement system; obtaining an adaptive subtraction parameter according to the masking threshold,;

The second step is that: two-dimensional enhancement of speech;

2.1 two-dimensional noise corrosion algorithm

The two-dimensional noise corrosion algorithm for the voice spectrum is determined by the following process; first, a short-time Fourier transform is performed on the speech, the frequency spectrum of each frameCalculated from the following formula:

(11)

is the firstmThe frame of the speech signal is then decoded,is the firstmA frequency spectrum of the frame speech signal;Nthe length of the frame and the number of short-time Fourier transform points;is a Hamming window; the voice signal power spectrum per frame can be expressed as:

(12)

i.e. a speech spectrum defined as a speech signal;

to pairIs defined as:

(13)

whereinIs a structural element of the compound of the formula,is thatThe domain of definition of (a) is,is thatThe domain of (3); translation parametersMust be inWithin the definition domain of (1), andmust be inWithin the domain of definition of (c);

aiming at the structural form of the residual noise spectrum with weak energy, the structural element of the two-dimensional noise corrosion algorithmIs defined as the following formula:

(14)

2.2 two-dimensional Speech dilation Algorithm

(15)

whereinIs a structural element of the compound of the formula,is thatThe domain of definition of (a) is,is thatThe domain of (3);

(16)

the third step: perceptual speech spectrum structure boundary (PSSB) parameter and endpoint detection algorithm

3.1 perceptual Speech Spectrum Structure boundary (PSSB) parameters

The invention uses the neighborhood model in formula (17) to approximate the result of the two-dimensional speech enhancementA gradient of (a);

(17)

is the center point of this neighborhood model; while the gradient of the central neighborhood can be represented by:

(18)

anddetermined by equation (19) and equation (20):

(19)

(20)

is thatWhich may describe boundary information of a continuous distribution of speech signals in a noisy speech spectrum.

(21)

whereinIs the PSSB parameter of the mth frame, M is the total frame number;

3.2 Voice endpoint detection

A detection method aiming at the characteristics of voice continuity distribution is adopted, so that a voiced sound segment and an unvoiced sound segment at an end point are treated differently; the specific endpoint detection method is as follows:

(1) firstly, detecting a speech segment with PSSB parameters larger than a threshold value a and continuously distributed m frames, wherein the speech segment is a detected voiced speech segment;

(2) on the basis of the section, all sections which are connected with the section and are continuously greater than or equal to the threshold value b are defined as voice sections; the threshold value b is small, and in the experiment, the value of b is 0.01 to 0.05, so that the better recognition result is obtained. Therefore, unvoiced segments with smaller PSSB value can be identified;

The experimental design is under different signal-to-noise ratio environments; the input low signal-to-noise ratio speech is 16k samples, 16 bits quantization; using a hamming window, frame length 256, frame shift 128; the speech was selected from the TIMIT Speech database and the white noise was from the NoiseX-92 noise database.

Claims

1. A speech endpoint detection algorithm using perceptual speech spectrum structure boundary parameters is characterized in that the algorithm comprises the following steps:

bark threshold power spectrum

(1)

the Bark power spectrum is:

(2)

ii diffusion Bark domain power spectrum

Introducing a diffusion functionIt is a matrix, satisfying the condition:

(3)

the definition formula is as follows:

(4)

represents the difference between the band numbers of the two bands;

(5)

offset function of masking energy iiiAnd masking thresholdIs calculated by

(6)

(7)

and threshold of quiet hearing threshold:

(8)

iv. adjustment of spectral subtraction and subtraction parameters

The gain function employed by the spectral subtraction algorithm is as follows:

(9)

，

(10)

wherein,andare respectively asMinimum and maximum values of;，and，are respectively a parameter、Minimum sum ofA maximum value; when in useWhen the temperature of the water is higher than the set temperature,(ii) a When in useWhen the temperature of the water is higher than the set temperature,(ii) a In the formulaAndthe minimum value and the maximum value of the masking threshold value obtained frame by frame respectively; in the experiment, the values of each parameter are as follows:

Vi. a speech enhancement system; obtaining adaptive subtraction parameters from masking thresholds、;

The second step is that: two-dimensional enhancement of speech;

2.1 two-dimensional noise corrosion algorithm

(11)

(12)

i.e. a speech spectrum defined as a speech signal;

to pairIs defined as:

(13)

(14)

2.2 two-dimensional Speech dilation Algorithm

(15)

(16)

3.1 perceptual Speech Spectrum Structure boundary (PSSB) parameters

(17)

(18)

anddetermined by equation (19) and equation (20):

(19)

(20)

(21)

whereinIs the PSSB parameter of the mth frame, M is the total frame number;

3.2 Voice endpoint detection

(2) on the basis of the section, all sections which are connected with the section and are continuously greater than or equal to the threshold value b are defined as voice sections; the value of the threshold value b is small, and in the experiment, the value of b is 0.01 to 0.05, so that a good identification result is obtained; therefore, unvoiced segments with smaller PSSB value can be identified;

2. The speech endpoint detection algorithm using perceptual speech spectral structure boundary parameters according to claim 1, wherein: the experimental design is under different signal-to-noise ratio environments; the input low snr speech is 16k samples, 16 bits quantized.

3. The speech endpoint detection algorithm using perceptual speech spectral structure boundary parameters according to claim 1, wherein: using hamming window, frame length 256, frame shift 128.

4. The speech endpoint detection algorithm using perceptual speech spectral structure boundary parameters according to claim 1, wherein: the speech was selected from the TIMIT Speech database and the white noise was from the NoiseX-92 noise database.