CN110610724A

CN110610724A - Voice endpoint detection method and device based on non-uniform sub-band separation variance

Info

Publication number: CN110610724A
Application number: CN201910913537.XA
Authority: CN
Inventors: 黄翔东; 曹璐; 刘子楠
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2019-12-24

Abstract

The invention discloses a voice endpoint detection method and a device based on non-uniform sub-band separation variance, wherein the method comprises the following steps: calculating the amplitude spectrum of each frame of voice signals after framing; converting the effective frequency band of the voice signal into a Mel domain, uniformly dividing the effective frequency band of the voice signal into q sub-bands on the Mel domain, and converting the center frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency with Hz as a unit; expanding the amplitude spectrum through interpolation, calculating the average amplitude of the frequency spectrum in each sub-band by combining the converted actual frequency, solving the mean value of the sub-bands, and further calculating the variance of each frame sub-band; and calculating the mean variance value of the noise by using the leading non-speech segment, further setting an upper limit threshold and a lower limit threshold, and judging by using double thresholds to obtain a final voice endpoint detection result. The device comprises: an analog-to-digital converter and a DSP chip. The realization method of the invention has high efficiency and stronger robustness.

Description

Voice endpoint detection method and device based on non-uniform sub-band separation variance

Technical Field

The invention relates to the technical field of digital signal processing, in particular to a voice endpoint detection method and device based on non-uniform subband separation variance, and specifically relates to how to determine a starting point and an ending point of voice in a quiet environment and under the condition of containing noise.

Background

Voice Endpoint Detection (Endpoint Detection), also known as Voice Activity Detection (Voice Activity Detection), is commonly usedIn the front end of a speech processing system, the aim is to separate an effective speech signal from other undesired interference signals in sampled signal data in various environmental noises, and lay a foundation for further enhancing the speech processing performance subsequently. Generally, it is necessary to extract the noise-robust features from the samples to distinguish the speech signal from the non-speech signal and determine the start point and the end point of each speech segment, and for the speech intelligent recognition and speech enhancement systems widely used today, the end point detection accuracy is one of the important parameters of the overall system with excellent performance^[1]。

Starting from the first proposed speech signal endpoint detection in Bell laboratories, this technology has matured over nearly half a century of development, and a number of excellent methods continue to emerge. The method can be roughly divided into two categories, namely threshold-based and model-based: the threshold-based method extracts the time domain characteristic value of the voice different from the noise, compares the time domain characteristic value with the set threshold, and accordingly makes the final judgment^[2]. The main categories can be time domain, frequency domain and cepstral domain parameters, such as: energy value, zero crossing rate, cepstrum coefficient, spectral distance, spectral entropy, etc^[3]. Compared with a model method, the method is simple to operate and easy to realize, but the detection precision is low; the model-based method is complex, usually requires transforming the speech signal to another domain (such as discrete cosine transform domain) and extracting multi-dimensional features (such as Mel cepstrum, etc.) from the transformed speech signal, and is very dependent on the established model, and the feature dimension used is large, so that the method needs long transition time from transient state to steady state to adapt to the change of noise and interference, and has high computational complexity, so that the method is not suitable for real-time implementation (such as not suitable for the situation that the hearing aid detects the speech endpoint on line in real time).

For a pure speech signal, the boundary points of speech can be found out very accurately by the above two methods. In practice, most speech signals are in more than one type of complex noise background, and effectively distinguishing speech segments from noise segments becomes a first problem for detecting speech endpoints. Specifically, for the threshold decision method, a threshold criterion needs to be set first, and when the decision parameter of the speech signal exceeds the threshold criterion, the speech signal is considered, otherwise, the speech signal is considered as a noise signal. The selection of the characteristic parameters of the voice signals is crucial, and a good detection method needs to meet the following characteristics:

1) the accuracy is as follows: the determination of the boundary points of the speech segments must be accurate; 2) stability: the detection algorithm has to have better robustness, and the anti-noise performance is strong; 3) the decision criterion has self-adaptive characteristic and can not only fix threshold decision; 4) the computational complexity is: the detection algorithm has low operation intensity and small calculation amount, and is convenient for hardware realization.

Reference to the literature

[1] Zhao Li, speech signal processing [ M ].3 edition, Beijing: mechanical engineering Press 2016.n

[2] Hu boat. Speech Signal processing [ M ] Harbin: harbin university of Industrial university Press, 2000:163-17.

[3] Sumin, Speech enhancement technology and related technologies under low signal-to-noise ratio study [ D ]. Nanjing post and telecommunications university, 2018.

[4]Mark Marzinzik etc.Speech Pause Detection for Noise Spectrum Estimation by Tracking Tracking Power Envelope Dynamics.IEEE Transactions onSpeech and Audio Processing,2002,10(2):109-111.

[5] Von, large, adaptive voice endpoint detection technology research [ D ]. Beijing post and telecommunications university, 2008.

[6] Lejia Anna, a voice endpoint detection method in a noise environment, study [ D ]. southern China university of Rich, 2015.

[7]Ishizuka,J,et al.Study ofNoise Robust Voice Activity Detection Based on Periodic Component To Aperiodic Component Ratio.Proc.ofSAPA,2006,06(9):65-70.

[8] Li Zuipeng, Yao Yiyang, a new method for detecting the start and stop points of a voice section [ J ] telecommunication technology, 2003, 3:68-70.

[9]Tanyer S G,Ozer H.Voice Activity Detection in Non-stationary Noise[J].IEEE Transactions on Speech and Audio Processing,2000,8(4):478-482.

[10] The optimized voice endpoint detection algorithm based on self-contained energy characteristics is researched [ J ]. acoustics report, 2005,24(2): 171-.

[11] Zhanhui, digital speech processing and MATLAB simulation [ M ]. electronics industry press, 2016.

[12] Application of MATLAB in speech signal analysis and synthesis [ M ]. Beijing university of aerospace publishers, 2013.

Disclosure of Invention

The invention provides a voice endpoint detection method and a device based on non-uniform sub-band separation variance, which adopts a Mel domain sub-band division mode, calculates each sub-band variance, and utilizes double thresholds to realize final judgment, and the details are described as follows:

a method for detecting a voice endpoint based on non-uniform subband separation variance, the method comprising:

calculating the amplitude spectrum of each frame of voice signals after framing;

converting the effective frequency band of the voice signal into a Mel domain, uniformly dividing the effective frequency band of the voice signal into q sub-bands on the Mel domain, and converting the center frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency with Hz as a unit;

expanding the amplitude spectrum through interpolation, calculating the average amplitude of the frequency spectrum in each sub-band by combining the converted actual frequency, solving the mean value of the sub-bands, and further calculating the variance of each frame sub-band;

and calculating the average variance value of the noise based on the variance of each frame of sub-band, further setting an upper limit threshold and a lower limit threshold, and judging by using double thresholds to obtain a final voice endpoint detection result.

Setting an upper threshold and a lower threshold, and judging by using double thresholds to obtain a final voice endpoint detection result, wherein the final voice endpoint detection result is specifically as follows:

1) performing rough judgment once on an upper limit threshold selected on an envelope line of the voice Mel domain sub-band separation variance, wherein the threshold is higher than the upper limit threshold, the voice starting point is positioned outside a time point corresponding to the intersection point of the threshold and a voice sub-band variance graph line;

2) and determining a lower limit threshold, and respectively finding two points where the sub-band variance envelope is intersected with the lower limit threshold, wherein a line segment formed by the two points is the final voice segment.

A voice endpoint detection device based on non-uniform sub-band separation variance comprises an analog-to-digital converter and a DSP chip,

the audio signal is input into a DSP chip in a parallel digital input mode after passing through an analog-to-digital converter;

the DSP chip, when executing a program, implements the method steps of claim 1.

The technical scheme provided by the invention has the beneficial effects that:

1. calculating the variance of a Mel sub-band of a voice signal frequency domain, calculating the average variance of a noise section by using a leading non-speech section, and selecting different thresholds for judgment according to different signal-to-noise ratios;

2. the detection result can be used for the front end of a voice separation and enhancement system, so that the front end can conveniently carry out different processing on a voice silence section and a voice section;

3. the implementation method is high in efficiency and has strong robustness.

Drawings

FIG. 1 is a schematic diagram of speech signal framing;

FIG. 2 is a graph of an energy envelope of a speech signal and a noise signal;

FIG. 3 is a diagram illustrating Mel-domain sub-band division of 7 sub-bands;

FIG. 4 is a schematic diagram of a decision process of the double threshold method;

FIG. 5 is a diagram of a frequency domain subband variance envelope;

FIG. 6 is a comparison of the results of different detection methods;

FIG. 7 is a diagram of a hardware implementation of the present invention;

fig. 8 is a flow chart of the DSP internal.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

The invention utilizes the difference of attenuation characteristics of logarithmic energy envelopes of voice signals and background noise, sets a size threshold value according to the mean sub-band separation variance of a leading non-speech section by calculating the Mel domain sub-band separation variance, and utilizes a double-threshold method to realize final judgment.

Example 1

The invention proposes to process according to the following steps, namely realize the voice signal endpoint detection, which is described in detail in the following:

step 1: at a predetermined sampling rate f_sSampling an input voice signal to obtain sampling data x (n);

step 2: framing the sampled data x (n) to obtain the ith frame signal y_i(m) is:

y_i(m)＝w(m)·x((i-1)·L_s+m),1≤m≤L,1≤i≤F_n (1)

wherein w (m) is a window function, L_sFor frame shifting, L is the frame length, F_nIs the number of frames.

And step 3: calculating y_i(m) amplitude spectrum Y_i(k)：

Get Y_i(k) The front (L/2+1) spectral lines obtain the positive frequency amplitude Y of each frame signal_i＝{Y_i(1),Y_i(2),....,Y_i(L/2+1)}。

And 4, step 4: according to formula F_mel2595log (1+ f/700) of the effective frequency band (0 to f) of the voice signal_s/2Hz) to Mel domain, and uniformly dividing into q sub-bands in Mel domain, and converting the center frequency, lower limit frequency and upper limit frequency of each sub-band to actual frequency in Hz unit to obtain f_p,c、f_p,l、f_p,hWherein p is 1, 2.

And 5: the spectral line Y obtained in the step 3 is subjected to interpolation_iAnd expanding to enable the spectral line resolution to reach 1 Hz.

Step 6: calculate the average magnitude of the spectrum in each subband:

substituting it into the following equation to obtain the subband mean value:

calculating the variance D of each frame sub-band_i：

And 7: calculating the mean variance of noise using the leading silence segment(F_niNumber of preamble silence periods), a threshold is further set.

And 8: and (4) judging by using a double-threshold method according to the results obtained in the steps (6) and (7) to obtain a final result.

Example 2

The scheme of example 1 is further described in the following with reference to specific calculation formulas and examples, and the details are described in the following description

2.1 basic principles of Speech Signal processing

The speech generation is accomplished by means of the cooperation of a large number of organs and muscles of the human body, the complexity of the corresponding speech signals is high, and at present, no model can comprehensively describe various characteristics in the human speech. Speech signals are often considered in experiments as time-varying, non-stationary random signals. Due to the unsteadiness and time-variability of the speech signal, the speech signal is usually framed, i.e., the speech signal is approximated to be stationary and time-invariant in a short time. The short time is generally 10-30 ms, and is used as a frame. There are two methods for framing, one is frame-to-frame, i.e. continuous framing. Another is overlapping framing, i.e. between framesThere is overlap, and compared with the overlapping framing method, the overlapping framing method is more adopted, and it better takes into account the time-varying property of the speech signal, so that the transition between frames is smoother, and the overlapping part is typically 10% -50% of a frame. Fig. 1 illustrates the principle of framing. On the other hand, in consideration of the spectrum leakage problem, windowing is generally performed on each frame of speech signal^[7]Assuming that the speech signal is represented by x (n), the window function is represented by w (n), and the windowing is performed by convolution of the speech signal x (n) with a window w (n) of finite length to form windowed speech x_w(n)：

x_w(n)＝x(n)*w(n) (7)

In speech signal processing, there are more window functions used, including rectangular windows, hamming windows and hanning windows.

Based on the above analysis, a frame windowing function y for processing can be obtained from the original signal x (n)_i(m) Process: let the original signal be x (N), the length be N, the frame length be L, and the frame shift be L_sThen the overlapping part L between two adjacent frames_o：

L_o＝L-L_s (8)

Frame-divided windowed ith frame signal y_i(m) is:

y_i(m)＝w(m)·x((i-1)·L_s+m),1≤m≤L,1≤i≤F_n (9)

the frame number calculating method comprises the following steps:

F_n＝(N-L_o)/L_s (10)

2.2 logarithmic energy envelope fluctuation characteristics of Speech and noise signals

The speech signal is analyzed, firstly, the characteristic parameters reflecting the essence of the speech signal are extracted, and effective processing is carried out by utilizing the characteristic parameters, so that the characteristic parameter extraction is the basis of speech signal processing. The logarithmic energy envelope fluctuation of the signal is an important characteristic for distinguishing the speech signal from the noise signal, and the logarithmic energy envelope calculation method of the signal is as follows: suppose that the signal obtained by the original signal after being processed by frame division and window addition is y_i(m) then the envelope of the ith frame is：

And (3) taking logarithm of the envelope to obtain a logarithmic energy envelope which is as follows:

Z_i＝10log₁₀Y_i (12)

fig. 2 is a graph comparing the logarithmic energy envelopes of the speech signals with different types of background noise (including babble noise, f16 noise, white noise, Leopard noise), and it is obvious that the energy envelopes of the speech signals fluctuate in a large dynamic range, and the maximum fluctuation range reaches 50 dB. In contrast, the energy envelope of the background noise signal exhibits stationarity with a maximum fluctuation range of less than 10 dB. Therefore, the energy envelope fluctuation characteristic of the signal can be used as one of the bases of the voice signal endpoint detection algorithm.

2.3 endpoint detection method based on sub-band variance

From the above analysis, the logarithmic energy envelope fluctuation range of the speech signal is larger than that of the noise signal, and in an actual scene, due to the complexity of the noise intensity and type, the accuracy of the method needs to be improved. Alternatively, the severity of the change can be expressed in terms of the magnitude of the variance^[6][7]. The feature quantity in the time domain of the speech signal usually has no strong robustness and is easily submerged in strong background noise. The invention converts time domain analysis into frequency domain analysis, the energy of the voice signal has larger change along with the frequency band, and the formant has larger peak value^[8][9]And the energy in other frequency bands is relatively small, so the invention converts the fluctuation characteristic of the time domain into the variance of the frequency domain, and provides an improved endpoint detection method based on the Mel domain sub-band variance.

The basilar membrane of the human ear has a similar effect as a spectrum analyzer according to the auditory mechanism of the human ear, and its sensitivity response to signals of different frequencies differs. Voice signals between 200Hz and 4kHz have the greatest effect on intelligibility, bass masking is easy, treble masking is more difficult, and the critical bandwidth of sound masking at low frequencies is smaller than at the high frequency end. Mel domainThe nonlinear characteristic of human ear is well simulated, corresponding to a group of band-pass filters with gradually increasing bandwidth, fig. 3 is a diagram of subbands divided based on Mel domain when the number of subbands is 7, and the subband is in a down-conversion relationship with actual frequency in Hz^[10]：

F_mel＝2595log(1+f/700) (13)

According to the idea of Mel domain division, the uniform sub-band division on the Mel domain can be realized according to the step 4 in the example 1. The Mel frequency considers the physiological and auditory characteristics of human ears, the energy of the Mel frequency is mutually superposed in the same frequency group, and the method calculates the variance of each sub-band on the basis of Mel domain frequency band division and carries out end point detection according to the variance.

The invention extends the spectral lines by means of interpolation, the purpose of extending the spectral lines being to calculate the subband variance values more accurately: taking the sampling rate of 8000Hz as an example, when the frame length is 200, 101 positive frequency amplitude spectral lines exist, the frequency resolution is 40Hz, the frequencies corresponding to 1 to 4 of the 101 positive frequency amplitude spectral lines are 0Hz, 40Hz, 80Hz and 120Hz respectively, the range of the first sub-band of the Mel domain of 7 channels is 20 to 100Hz, that is, the first sub-band can only obtain 2 to 3 spectral lines, and the calculation of the variance by using 2 spectral lines will bring about a large error. Therefore, spectral line expansion is needed to improve the frequency resolution to 1Hz, so that the first sub-band easily contains 80 spectral lines of 20-100 Hz, and the calculation of the variance by using the 81 spectral lines has higher precision than that by using only 2 spectral lines.

Preprocessing, sub-band division and spectral line interpolation of the voice signals are completed through the steps, the spectral lines of the sub-bands for variance calculation are obtained, and variance calculation can be completed according to the formulas (3) to (5).

As can be seen from equation (5), the band variance contains two pieces of information: which reflects the fluctuation level of each frequency band of the frame; the second of which illustrates the short-time energy of this frame signal. When the energy is larger, the fluctuation is more intense, D_iThe larger the value, which is characteristic of speech signals; conversely, for noise, the smaller the energy, the flatter the undulation, D_iThe smaller the value.

2.4 decision threshold selection and detection Process

From the above analysis, the present invention performs endpoint detection according to the fluctuation degree of the signal, and introduces the concept of subband separation into the frequency band variance, and the strength of the fluctuation degree has no clear limit in the actual determination, so different threshold values are set to obtain different detection results. Therefore, proper threshold selection is crucial to the detection result. On the other hand, due to the complexity of human pronunciation process and individual difference, individual vocal tract difference and individual speed difference, the voice endpoint obtained by the voice endpoint detection algorithm cannot be completely accurate^[10]. Whereas humans are not sensitive to small time errors, speech endpoint detection errors within tens of milliseconds are acceptable. The invention adopts a double-threshold judgment method to judge: 1) a higher threshold value (threshold) T is selected on the envelope line of the separation variance of the sub-band in the voice Mel domain₁(shown as a dashed horizontal line in fig. 4) a rough decision is made, above which threshold must be speech (certainly speech between CD segments), and the speech onset should be outside the point in time (to the left of point C in the figure) corresponding to the intersection of this threshold with the speech subband variance plot.

2) Determining a lower threshold value (threshold) T₂(shown as a solid horizontal line in fig. 4), and searching from point C to the left and point D to the right to find two points B and E where the subband variance envelope intersects with the low threshold, respectively, so that the BE segment is a speech segment determined by the dual-threshold method according to the subband variance.

The invention calculates the average variance of the noise section according to the leading non-speech section, and determines the threshold according to the formula (6), namely the detection can be completed according to the double-threshold method.

Example 3

To evaluate the performance of the algorithm of the present invention, simulation experiments were performed^[11][12][13]. The experimental speech data was derived from a TIMIT corpus, which consisted of voices of 1680 speakers in 8 dialects. The voice signal index is 16kHz sampling rate, single sound channel and 16bit sampling digit. Typical noise was taken from the NOISEX-92 database: white gaussian noise, pink noise, and F16 noise. In data segmentation, each sub-segment is longThe degree is taken to be 256 samples (so the simulation time covered by each segment is 256/16000 ═ 0.016s ═ 16ms), and there is 50% overlap between adjacent sub-segments (so the observed sub-segments are further reduced to 8 ms). In the experiment, the spectrum Mel subband variance envelope of the clean speech and the noisy speech is firstly obtained, and fig. 5 shows the subband variance envelope obtained under the condition that the SNR is 10dB, so that even if the frequency domain subband variance tracks the change of the speech signal well under the noise adding condition, the subband variance has large fluctuation along with the large fluctuation of the signal fluctuation range, namely, the speech section.

For further evaluation of the algorithm of the present invention, the method, the conventional energy-to-zero ratio method (energy and zero-crossing rate) and the autocorrelation function method are respectively detected under the condition that the SNR of the signal-to-noise ratio is 20dB, the detection results are shown in fig. 6 (fig. 6, wherein (a), (b) and (c) are respectively the detection results of the energy-to-zero ratio method, the autocorrelation function method and the method), and the omission point of each method is marked. It should be noted that, due to individual differences, individual speeds of speech are different, and there is a pause between words when pronunciation is performed, so that errors within tens of milliseconds are allowed. It can be observed from the figure that the method has less omission than the other two methods. On the other hand, taking the first voice starting point as an example (within 0.2-0.5 s), the traditional detection method of the energy-to-zero ratio and the autocorrelation function has obvious errors at the voice ending point, namely the length of the voice tail is prolonged, and the starting point and the stopping point detected by the method are obviously more accurate than those of the other two methods.

Example 4

The hardware implementation diagram of the invention is shown in the figure, and the audio signal x (t) is subjected to analog-to-digital conversion at a sampling rate of 16kHz to obtain x (n), and enters a DSP chip in a parallel digital input mode, and the result is processed by an internal algorithm of the DSP chip and sent to a lower-level processing system.

The DSP (Digital Signal Processor) is a core device, and performs the following main functions in Signal processing: 1) performing frame windowing of the voice signal and calculating the variance of the signal frequency domain sub-band by using a core algorithm; determining a threshold value according to the actual signal, and detecting and judging; 2) sending the result to a lower-level processor for processing; the internal program flow chart of the DSP device is shown in FIG. 8, and the core detection algorithm of the voice endpoint detection method based on logarithmic energy envelope perception provided by the invention is implanted into the DSP device, so that efficient and accurate voice signal endpoint detection is completed based on the core detection algorithm.

The flow of fig. 8 is divided into the following steps:

1) performing frame windowing on the sampled voice signal to obtain y_i(m)；

2) For y_i(m) performing a discrete Fourier transform to obtain Y_i(k)；

3) Calculating the variance D of the Mel sub-band of each frame frequency domain_iAnd noise segment mean subband varianceA threshold value;

4) judging by using a double-threshold judging method to find a voice starting point and a voice ending point;

5) the result is saved and sent to the next processor for processing.

The embodiment of the invention does not limit the types of other devices except the types of the devices which are specially explained,

any device capable of performing the above functions may be used.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A speech endpoint detection method based on non-uniform subband separation variance is characterized by comprising the following steps:

calculating the amplitude spectrum of each frame signal after framing;

converting the effective frequency band of each frame of voice signal into a Mel domain, uniformly dividing the effective frequency band into q sub-bands on the Mel domain, and converting the central frequency, the lower limit frequency and the upper limit frequency of each sub-band into actual frequency taking Hz as a unit;

2. The method for detecting the voice endpoint based on the non-uniform subband separation variance as claimed in claim 1, wherein the setting of the upper and lower threshold values and the decision by using the double thresholds to obtain the final voice endpoint detection result specifically comprise:

3. A voice endpoint detection device based on non-uniform sub-band separation variance is characterized in that the device comprises an analog-to-digital converter and a DSP chip,

the DSP chip, when executing a program, implements the method steps of claim 1.