CN103646649B

CN103646649B - A kind of speech detection method efficiently

Info

Publication number: CN103646649B
Application number: CN201310743203.5A
Authority: CN
Inventors: 陶建华; 刘斌
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Priority date: 2013-12-30
Filing date: 2013-12-30
Publication date: 2016-04-13
Anticipated expiration: 2033-12-30
Also published as: CN103646649A

Abstract

The invention discloses a kind of speech detection method, the method comprises the following steps: in time domain, analyze original audio short-time energy and short-time zero-crossing rate, rejects part non-speech audio wherein; Frequency domain is analyzed the spectrum envelope characteristic of the sound signal subband remained and the entropy characteristic of subband, rejects part non-speech audio wherein further; The successive frame of feature similarity in each frame sound signal retained is formed audio section; Calculate the average of each frame mel cepstrum coefficients in every section audio, it is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.The present invention can detect voice signal under various complex environment from audio data stream, the border between location speech segments that can be relatively accurate and non-speech segment data.

Description

A kind of speech detection method efficiently

Technical field

The present invention relates to Intelligent Information Processing field, especially a kind of speech detection method efficiently.

Background technology

Voice are one of Main Means of Human communication's information, and speech detection technology occupies consequence always in field of voice signal; Speech detection system is as pretreatment module such as speech recognition, Speaker Identification, voice codings, and its robustness will directly affect the performance of other speech processing module.How random noise under various complex environment, navigate to speech segments accurately by a kind of means efficiently, effectively distinguish voice and non-speech audio, become current study hotspot both domestic and external, be more and more subject to extensive concern.Speech detection system has great practical value, and high-quality robust speech detection technique is obtained for general application in various communication system, multimedia system, speech recognition system and Voiceprint Recognition System.

The speech detection method of current main flow mainly comprises the speech detection method based on parameter and the speech detection method based on model.Speech detection method based on parameter is analyzed from signals layer voice signal, in time domain, frequency domain or other transform domain, calculate speech parameter, by arranging in rational threshold test audio stream whether comprise voice; Conventional speech parameter comprises the energy proportion, harmonic components etc. of short-time energy, short-time zero-crossing rate, each frequency band.Based on the speech detection method of model by extensive speech data training pattern, distinguish voice signal and various non-speech audio accurately by intelligentized mathematical model; Conventional method comprises the speech detection method based on gauss hybrid models, the speech detection method based on artificial neural network, speech detection method etc. based on Hidden Markov Model (HMM).Speech detection method based on model needs to mark to train reliable speech detection model to large-scale data, belongs to the speech detection method having supervision; Based on the speech detection method of parameter without the need to training mathematical model, belong to unsupervised speech detection method.The speech detection method of current various main flow, can detect voice signal fast and accurately under various quiet environment; Under stationary noise environment and under the nonstationary noise environment of various high s/n ratio, speech detection system has higher accuracy rate; But the various Non-Stationary random noises under various complex environment, the hydraulic performance decline of speech detection system is serious.

Summary of the invention

For solving above-mentioned one or more problems, the invention provides a kind of speech detection method efficiently, under various complex environment, voice signal can be detected fast and accurately from audio stream, the border between location speech segments that can be relatively accurate and non-speech segment data.

A kind of speech detection method provided by the invention comprises the following steps:

Step S10, obtains original audio, time domain is analyzed short-time energy and the short-time zero-crossing rate of described original audio, is rejected the part non-speech audio in original audio by short-time energy and short-time zero-crossing rate;

Step S20, for the sound signal that described step S10 remains, frequency domain is analyzed the spectrum envelope characteristic of its subband and the entropy characteristic of subband, rejects the part non-speech audio in described sound signal further;

Step S30, for the sound signal of respectively waiting to screen frame remained, forms an audio section by continuous some frames of feature similarity;

Step S40, screening audio section for each waiting, by gauss hybrid models for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtaining voice detection results.

As can be seen from technique scheme, the invention provides a kind of speech detection method of efficient robust, it has following beneficial effect:

(1) speech detection method provided by the invention can be applied to the front-end module of various speech recognition system, can be rejected the non-speech data in audio stream to be identified by this module accurately, improves efficiency and the robustness of speech recognition system;

(2) speech detection method provided by the invention can be applied to the front-end module of various speech coding system, the border of speech segments and non-speech segment data can be located accurately by this module, speech coding system is only transmitted speech segments, improves communication efficiency;

(3) speech detection method provided by the invention steadily and under Non-Stationary random noise environment can detect speech data fast and accurately various; Effectively can distinguish voice signal and various non-speech audio, by the restriction of speaker, environment and languages.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention;

Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention;

Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention;

Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention;

Fig. 5 be according to an embodiment of the invention in speech detection method by the process flow diagram of the gauss hybrid models section of carrying out level decision-making;

Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

It should be noted that, in accompanying drawing or instructions describe, similar or identical part all uses identical figure number.The implementation not illustrating in accompanying drawing or describe is form known to a person of ordinary skill in the art in art.In addition, although herein can providing package containing the demonstration of the parameter of particular value, should be appreciated that, parameter without the need to definitely equaling corresponding value, but can be similar to corresponding value in acceptable error margin or design constraint.

The present invention proposes a kind of voice detection mechanisms efficiently.This mechanism carries out the speech detection in two stages to audio stream.First non-speech data is divided into screen data with waiting original audio by temporal signatures and frequency domain character, then treat examination data by sound spectrograph feature and carry out segmentation, carry out speech detection piecemeal by the gauss hybrid models of speech data and the gauss hybrid models of non-speech data.

Generally speaking, described speech detection method comprises time-domain analysis step, frequency-domain analysis step, Audio clustering step and section level steps in decision-making, Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention, and as shown in Figure 1, described speech detection method comprises the following steps:

Utilize short-time energy effectively can detect voiced sound, utilize short-time zero-crossing rate effectively can detect voiceless sound, merge these two kinds of parameters and just can effectively reject part non-speech audio.

Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention, and as shown in Figure 2, described step S10 comprises the steps: further

Step S11, is divided into some frames at equal intervals by original audio, calculates short-time energy and the short-time zero-crossing rate of every frame original audio;

Step S12, the short-time energy of every frame original audio and short-time zero-crossing rate are compared with low, high two thresholdings preset respectively, every frame original audio is divided into quiet section, transition section and voice segments according to comparative result, remove quiet section in described original audio and transition section signal, only retain voice segments signal.

Described the short-time energy of every frame original audio and short-time zero-crossing rate to be compared with low, high two thresholdings preset respectively, every frame original audio is divided into quiet section according to comparative result, the step of transition section and voice segments is specially: if described short-time energy or short-time zero-crossing rate exceed low threshold, then mark enters transition section; In transition section, if two parameters all fall back to below low threshold, enter into quiet section; In transition section, if any one in two parameters exceedes high threshold, then think and enter voice segments; In voice segments, if two parameters all drop to below low threshold, and the duration is more than a predetermined threshold, then think that voice segments terminates.

The spectrum envelope characteristic that frequency domain is analyzed subband comprises the following steps:

First, described sound signal is divided into some subbands;

Then, in the frequency range of each subband, carry out bandpass filtering respectively, obtain the sound signal of each subband;

Then, Hilbert transform is carried out to each sub-band audio signal, obtain the spectrum envelope of each subband;

Finally, to comprising the subband of obvious resonance peak characteristic and comprising the statistical property of its spectrum envelope signal of Substrip analysis of more noise contribution.

The statistical property of described spectrum envelope signal comprises average and the variance of spectrum envelope, is specifically calculatively characterized as: (1) comprises the subband spectrum envelope variance of obvious resonance peak characteristic; (2) the equal value difference of the subband spectrum envelope comprising obvious resonance peak characteristic and the subband spectrum envelope comprising more noise contributions.

The entropy characteristic that frequency domain is analyzed subband comprises the following steps:

First, under long span pattern, utilize present frame to calculate the entropy of each frequency of present frame with the some frames be adjacent;

Then, within the scope of particular sub-band the average of statistical entropy and variance to determine the complexity of current speech frame.

Subband spectrum envelope characteristic under the short span mode of such fusion and the subband entropy characteristic under long span pattern just can reject part non-speech audio further, are specially:

For every frame voice signal, utilize the spectrum envelope characteristic of subband and the entropy characteristic of subband, under the ground unrest of various complexity, frequency-domain analysis is carried out to voice signal, and then voice signal and non-speech audio are classified, reject part non-speech audio further.

Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention, as shown in Figure 3, according to the spectrum envelope characteristic of subband and the entropy characteristic of subband, the step of the part non-speech audio of rejecting in described sound signal further comprises the following steps:

Step S21, for every frame voice signal, first high-pass filtering is carried out to remove the interference of power frequency component to it, in an embodiment of the present invention, described Hi-pass filter selects 4 rank Chebyshev's Hi-pass filters, then carry out windowing process to the sound signal through high-pass filtering, in an embodiment of the present invention, window function selects Hamming window;

Step S22, sound signal after windowing process is divided into N number of frequency range, in an embodiment of the present invention, described sound signal is divided into 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz and 3000-4000Hz is totally five frequency ranges, in these band limits, carry out bandpass filtering respectively to described sound signal, obtains the sound signal of N number of subband, in an embodiment of the present invention, bandpass filter adopts 6 rank Butterworth filters;

Step S23, carries out Hilbert transform to the sound signal of each subband, obtains corresponding spectrum envelope signal;

For Voiced signal, the spectrum envelope of 500-1000Hz frequency band comprises obvious resonance peak characteristic; And under environment made an uproar by band, the spectrum envelope of 3000-4000Hz frequency band comprises more noise contribution, in an embodiment of the present invention, only Hilbert transform is carried out to 500-1000Hz and 3000-4000Hz two band.

Step S24, carries out statistical characteristic analysis to the spectrum envelope signal that described step S23 obtains, and calculates their average and variance within the scope of respective sub-bands, and then obtains spectrum envelope judgement output;

If μ ₁represent the average of 500-1000Hz subband spectrum envelope, μ ₂represent the average of 3000-4000Hz subband spectrum envelope, σ ₁and σ ₂represent the variance of above-mentioned two subbands respectively, spectrum envelope judgement is set and exports as VAD _envelope, then it can be expressed as:

VAD _envelope=σ ₂-(μ ₂-μ ₅)，

This step is just by the analysis of antithetical phrase spectral envelope like this, obtains judgement and exports VAD _envelope.

Step S25, calculates Fourier modulus spectrum to the sound signal of current frame voice frequency signal and adjacent some frames, obtains Fourier's amplitude of each Frequency point of different frame; For different Frequency points, adjacent some frames are utilized to calculate the entropy of present frame at this Frequency point place; In the sub-band bin comprising obvious resonance peak characteristic, (in an embodiment of the present invention, selection 500-1000Hz frequency band) calculates the variance of each Frequency point entropy, exports VAD as long span judgement _entropy;

Step S26, merges two judgement outputs that described step S24 and step S25 obtains and comprehensively adjudicates, obtain final frequency domain decision result VAD _freq, be expressed as:

VAD _freq=ω ₁VAD _entropy+ω ₂VAD _entropy，

If frequency domain decision result VAD _freqhigher than a threshold value, then this frame is labeled as speech frame, if VAD _freqthen this frame is labeled as non-speech frame lower than this threshold value, in addition, the data being labeled as speech frame need to carry out extended length, by start frame expanded forward 3 frame of voice segments, the ending frame of voice segments are expanded 3 frames backward.

Sound signal after such process just eliminates part non-speech audio further.

Step S30, screens the sound signal of frame for respectively waiting of remaining, continuous some frames of feature similarity is formed an audio section, follow-uply in units of audio section, carries out speech detection;

Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention, and as shown in Figure 4, described step S30 is further comprising the steps:

Step S31, for respectively waiting the sound signal of screening frame, being considered human auditory system apperceive characteristic, in Mel territory, described sound signal being divided into some subbands, namely being obtained the sound signal of each subband by Mel wave filter;

Step S32, every frame sound signal is calculated to the entropy of each subband, to measure the proportion of each sub belt energy, the weight of each subband is set according to auditory perception property, can reflect that the low frequency sub-band weight of resonance peak characteristic is relatively large, and the weight of high-frequency sub-band is relatively little;

Step S33, with the entropy of each subband for characteristic parameter, calculate the similarity of adjacent speech frame, the weight of each subband is considered in computation process, then according to metric function conventional in prior art, the consecutive frame of feature similarity is classified as an audio section, for each frame data in each audio section, the distance between them is less than threshold value T.

By said method, just based on the entropy of speech frame each subband, sound signal can be divided into some audio sections, in each audio section, comprise similar speech frame, follow-uply in units of audio section, carry out speech detection.

Step S40, audio section is screened for each treating, the average of each frame mel cepstrum coefficients in difference compute segment, the Mean Parameters obtained is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.

Fig. 5 be according to an embodiment of the invention in speech detection method by the process flow diagram of the gauss hybrid models section of carrying out level decision-making, as shown in Figure 5, described step S40 is specially: first extract the static mel cepstrum coefficients in such as 13 rank, M rank waiting to screen each frame in audio section, then calculate their first order difference and second order difference respectively, finally obtain 3*M Jan Vermeer cepstrum coefficient; Calculate the average of each frame mel cepstrum coefficients, the average of 3*M Jan Vermeer cepstrum coefficient is utilized to carry out speech detection: the average of 3*M Jan Vermeer cepstrum coefficient to be input in the gauss hybrid models of voice signal and the gauss hybrid models of various non-speech audio respectively, if the maximum probability exported when being input to the gauss hybrid models of voice signal, judge that this section is as voice signal, otherwise be judged to be non-speech audio.

In described step S40, also need to select the gauss hybrid models of various types of audio frequency to the gauss hybrid models of voice signal and various non-speech audio to train, the robustness of model can be ensured like this, improving the accuracy rate of speech detection, needing to mark the classification of each audio file when training.

Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention, and as shown in Figure 6, the training for gauss hybrid models is further comprising the steps:

Step S41, carries out filtered audio for all training audio repository; Adopt the method for described step S10 and step S20 to carry out time and frequency domain analysis to sound signal respectively, reject part non-speech audio wherein, to remaining, subsequent step only waits that screening sound signal trains;

Step S42, according to audio categories mark, the sound signal after filtration is classified, be about to the sound signal after filtering and be divided into voice signal and non-speech audio, non-speech audio is needed further to classify (in an embodiment of the present invention according to the feature of sound signal to them, non-speech audio is divided into background music, animal sounds, stationary noise and nonstationary noise, respectively gauss hybrid models is trained to dissimilar non-voice);

Step S43, in units of frame, mel cepstrum coefficients is extracted to sorted sound signal, first extract M rank static parameter, then their first order difference and second order difference is calculated respectively, final extraction obtains 3*M and ties up parameter, adopt the method for described step S30 that continuous some frames of feature similarity are formed an audio section, in difference compute segment, the average of each frame mel cepstrum coefficients, it can be used as the characteristic parameter of training gauss hybrid models;

Step S44, the mel cepstrum coefficients on 3*M rank is adopted to carry out the training of gauss hybrid models respectively to voice signal and different classes of non-speech audio, namely the weight of each gauss component in different gauss hybrid models, average and variance is determined by EM repetitive exercise, in an embodiment of the present invention, 32 gauss components are comprised in each gauss hybrid models.

In sum, the present invention proposes a kind of speech detection method efficiently, this mechanism carries out the speech detection in two stages to audio stream.First in time domain He on frequency domain, voice signal being analyzed, being divided into non-speech data to screen data with waiting voice signal by arranging rational parameter threshold.Then treat examination data by the parameter model of robust to detect, judge wherein whether comprise voice.Speech detection method provided by the invention steadily and under Non-Stationary random noise environment can detect speech data fast and accurately various; Effectively can distinguish voice signal and various non-speech audio, by the restriction of speaker, environment and languages.

It should be noted that, the above-mentioned implementation to each parts is not limited in the various implementations mentioned in embodiment, and those of ordinary skill in the art can replace it with knowing simply, such as:

(1) when carrying out frequency-domain analysis to voice signal, according to the sense of hearing characteristic of people's ear, frequency band division is become 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz, 3000-4000Hz totally five subbands on frequency domain, voice signal is analyzed.Other sub-band division method can be used to substitute, divide each subband as used Mel wave filter.

(2) set up in gauss hybrid models process, the mixed Gauss model number of regulation also can adjust, and as voice gauss hybrid models comprises 32 Gaussian distribution, non-voice gauss hybrid models comprises 64 Gaussian distribution.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a speech detection method, is characterized in that, the method comprises the following steps:

Step S40, screening audio section for each waiting, by gauss hybrid models for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtaining voice detection results;

Wherein, described step S30 is further comprising the steps:

Step S31, for respectively waiting the sound signal of screening frame, considering human auditory system apperceive characteristic, in Mel territory, described sound signal being divided into some subbands;

Step S32, calculates the entropy of each subband to every frame sound signal, to measure the proportion of each sub belt energy, arrange the weight of each subband according to auditory perception property;

Step S33, with the entropy of each subband for characteristic parameter, calculates the similarity of adjacent speech frame, considers the weight of each subband in computation process, then according to metric function, the consecutive frame of feature similarity is classified as an audio section.

2. method according to claim 1, is characterized in that, described step S10 comprises the steps: further

3. method according to claim 2, is characterized in that, if described short-time energy or short-time zero-crossing rate exceed low threshold, then mark enters transition section; In transition section, if two parameters all fall back to below low threshold, enter into quiet section; In transition section, if any one in two parameters exceedes high threshold, then think and enter voice segments; In voice segments, if two parameters all drop to below low threshold, and the duration is more than a predetermined threshold, then think that voice segments terminates.

4. method according to claim 1, is characterized in that, in described step S20, the statistical property that frequency domain is analyzed the spectrum envelope of each subband comprises the following steps:

First, described sound signal is divided into some subbands;

Finally, to comprising obvious resonance peak characteristic and comprising the statistical property of its spectrum envelope signal of Substrip analysis of more noise contribution.

5. method according to claim 4, is characterized in that, the statistical property of described spectrum envelope signal comprises average and the variance of spectrum envelope, is specifically calculatively characterized as: the subband spectrum envelope variance comprising obvious resonance peak characteristic; The equal value difference of the subband spectrum envelope comprising obvious resonance peak characteristic and the subband spectrum envelope comprising more noise contributions.

6. method according to claim 1, is characterized in that, in described step S20, the entropy characteristic that frequency domain is analyzed subband comprises the following steps:

7. method according to claim 1, is characterized in that, in described step S20, according to spectrum envelope statistical property and the entropy characteristic of each subband, the step of the part non-speech audio of rejecting in described sound signal further comprises the following steps:

Step S21, for every frame voice signal, first carries out high-pass filtering to remove the interference of power frequency component to it, then carries out windowing process to the sound signal through high-pass filtering;

Step S22, is divided into N number of frequency range by the sound signal after windowing process, in these band limits, carry out bandpass filtering respectively to described sound signal, obtains the sound signal of N number of subband;

Step S24, carries out statistical characteristic analysis to the spectrum envelope signal that described step S23 obtains, and obtains spectrum envelope judgement and exports;

Step S25, calculates Fourier modulus spectrum to the sound signal of current frame voice frequency signal and adjacent some frames, obtains Fourier's amplitude of each Frequency point of different frame; For different Frequency points, adjacent some frames are utilized to calculate the entropy of present frame at this Frequency point place; In the sub-band bin comprising obvious resonance peak characteristic, calculate the variance of each Frequency point entropy, export as long span judgement;

Step S26, merges two judgement outputs that described step S24 and step S25 obtains and comprehensively adjudicates, obtain final frequency domain decision result; If frequency domain decision result is higher than a threshold value, then this frame is labeled as speech frame, if lower than this threshold value, this frame is labeled as non-speech frame.

8. method according to claim 1, is characterized in that, described step S40 is specially:

Audio section is screened for each treating, the average of each frame mel cepstrum coefficients in difference compute segment, the Mean Parameters obtained is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.

9. method according to claim 1, is characterized in that, is specially for the training of gauss hybrid models in described step S40:

Step S41, filtered audio is carried out for all training audio repository, adopt the method for described step S10 and step S20 to carry out time and frequency domain analysis to sound signal respectively, reject part non-speech audio wherein, to remaining, subsequent step only waits that screening sound signal trains;

Step S42, classifies to the sound signal after filtration according to audio categories mark, is about to the sound signal after filtering and is divided into voice signal and non-speech audio;

Step S44, carries out the training of gauss hybrid models respectively to voice signal and different classes of non-speech audio, namely determine the weight of each gauss component in different gauss hybrid models, average and variance by EM repetitive exercise.