CN103646649B - A kind of speech detection method efficiently - Google Patents

A kind of speech detection method efficiently Download PDF

Info

Publication number
CN103646649B
CN103646649B CN201310743203.5A CN201310743203A CN103646649B CN 103646649 B CN103646649 B CN 103646649B CN 201310743203 A CN201310743203 A CN 201310743203A CN 103646649 B CN103646649 B CN 103646649B
Authority
CN
China
Prior art keywords
audio
speech
frame
subband
sound signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310743203.5A
Other languages
Chinese (zh)
Other versions
CN103646649A (en
Inventor
陶建华
刘斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201310743203.5A priority Critical patent/CN103646649B/en
Publication of CN103646649A publication Critical patent/CN103646649A/en
Application granted granted Critical
Publication of CN103646649B publication Critical patent/CN103646649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of speech detection method, the method comprises the following steps: in time domain, analyze original audio short-time energy and short-time zero-crossing rate, rejects part non-speech audio wherein; Frequency domain is analyzed the spectrum envelope characteristic of the sound signal subband remained and the entropy characteristic of subband, rejects part non-speech audio wherein further; The successive frame of feature similarity in each frame sound signal retained is formed audio section; Calculate the average of each frame mel cepstrum coefficients in every section audio, it is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.The present invention can detect voice signal under various complex environment from audio data stream, the border between location speech segments that can be relatively accurate and non-speech segment data.

Description

A kind of speech detection method efficiently
Technical field
The present invention relates to Intelligent Information Processing field, especially a kind of speech detection method efficiently.
Background technology
Voice are one of Main Means of Human communication's information, and speech detection technology occupies consequence always in field of voice signal; Speech detection system is as pretreatment module such as speech recognition, Speaker Identification, voice codings, and its robustness will directly affect the performance of other speech processing module.How random noise under various complex environment, navigate to speech segments accurately by a kind of means efficiently, effectively distinguish voice and non-speech audio, become current study hotspot both domestic and external, be more and more subject to extensive concern.Speech detection system has great practical value, and high-quality robust speech detection technique is obtained for general application in various communication system, multimedia system, speech recognition system and Voiceprint Recognition System.
The speech detection method of current main flow mainly comprises the speech detection method based on parameter and the speech detection method based on model.Speech detection method based on parameter is analyzed from signals layer voice signal, in time domain, frequency domain or other transform domain, calculate speech parameter, by arranging in rational threshold test audio stream whether comprise voice; Conventional speech parameter comprises the energy proportion, harmonic components etc. of short-time energy, short-time zero-crossing rate, each frequency band.Based on the speech detection method of model by extensive speech data training pattern, distinguish voice signal and various non-speech audio accurately by intelligentized mathematical model; Conventional method comprises the speech detection method based on gauss hybrid models, the speech detection method based on artificial neural network, speech detection method etc. based on Hidden Markov Model (HMM).Speech detection method based on model needs to mark to train reliable speech detection model to large-scale data, belongs to the speech detection method having supervision; Based on the speech detection method of parameter without the need to training mathematical model, belong to unsupervised speech detection method.The speech detection method of current various main flow, can detect voice signal fast and accurately under various quiet environment; Under stationary noise environment and under the nonstationary noise environment of various high s/n ratio, speech detection system has higher accuracy rate; But the various Non-Stationary random noises under various complex environment, the hydraulic performance decline of speech detection system is serious.
Summary of the invention
For solving above-mentioned one or more problems, the invention provides a kind of speech detection method efficiently, under various complex environment, voice signal can be detected fast and accurately from audio stream, the border between location speech segments that can be relatively accurate and non-speech segment data.
A kind of speech detection method provided by the invention comprises the following steps:
Step S10, obtains original audio, time domain is analyzed short-time energy and the short-time zero-crossing rate of described original audio, is rejected the part non-speech audio in original audio by short-time energy and short-time zero-crossing rate;
Step S20, for the sound signal that described step S10 remains, frequency domain is analyzed the spectrum envelope characteristic of its subband and the entropy characteristic of subband, rejects the part non-speech audio in described sound signal further;
Step S30, for the sound signal of respectively waiting to screen frame remained, forms an audio section by continuous some frames of feature similarity;
Step S40, screening audio section for each waiting, by gauss hybrid models for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtaining voice detection results.
As can be seen from technique scheme, the invention provides a kind of speech detection method of efficient robust, it has following beneficial effect:
(1) speech detection method provided by the invention can be applied to the front-end module of various speech recognition system, can be rejected the non-speech data in audio stream to be identified by this module accurately, improves efficiency and the robustness of speech recognition system;
(2) speech detection method provided by the invention can be applied to the front-end module of various speech coding system, the border of speech segments and non-speech segment data can be located accurately by this module, speech coding system is only transmitted speech segments, improves communication efficiency;
(3) speech detection method provided by the invention steadily and under Non-Stationary random noise environment can detect speech data fast and accurately various; Effectively can distinguish voice signal and various non-speech audio, by the restriction of speaker, environment and languages.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention;
Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention;
Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention;
Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention;
Fig. 5 be according to an embodiment of the invention in speech detection method by the process flow diagram of the gauss hybrid models section of carrying out level decision-making;
Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
It should be noted that, in accompanying drawing or instructions describe, similar or identical part all uses identical figure number.The implementation not illustrating in accompanying drawing or describe is form known to a person of ordinary skill in the art in art.In addition, although herein can providing package containing the demonstration of the parameter of particular value, should be appreciated that, parameter without the need to definitely equaling corresponding value, but can be similar to corresponding value in acceptable error margin or design constraint.
The present invention proposes a kind of voice detection mechanisms efficiently.This mechanism carries out the speech detection in two stages to audio stream.First non-speech data is divided into screen data with waiting original audio by temporal signatures and frequency domain character, then treat examination data by sound spectrograph feature and carry out segmentation, carry out speech detection piecemeal by the gauss hybrid models of speech data and the gauss hybrid models of non-speech data.
Generally speaking, described speech detection method comprises time-domain analysis step, frequency-domain analysis step, Audio clustering step and section level steps in decision-making, Fig. 1 is the process flow diagram of speech detection method according to an embodiment of the invention, and as shown in Figure 1, described speech detection method comprises the following steps:
Step S10, obtains original audio, time domain is analyzed short-time energy and the short-time zero-crossing rate of described original audio, is rejected the part non-speech audio in original audio by short-time energy and short-time zero-crossing rate;
Utilize short-time energy effectively can detect voiced sound, utilize short-time zero-crossing rate effectively can detect voiceless sound, merge these two kinds of parameters and just can effectively reject part non-speech audio.
Fig. 2 is the process flow diagram of time-domain analysis part in speech detection method according to an embodiment of the invention, and as shown in Figure 2, described step S10 comprises the steps: further
Step S11, is divided into some frames at equal intervals by original audio, calculates short-time energy and the short-time zero-crossing rate of every frame original audio;
Step S12, the short-time energy of every frame original audio and short-time zero-crossing rate are compared with low, high two thresholdings preset respectively, every frame original audio is divided into quiet section, transition section and voice segments according to comparative result, remove quiet section in described original audio and transition section signal, only retain voice segments signal.
Described the short-time energy of every frame original audio and short-time zero-crossing rate to be compared with low, high two thresholdings preset respectively, every frame original audio is divided into quiet section according to comparative result, the step of transition section and voice segments is specially: if described short-time energy or short-time zero-crossing rate exceed low threshold, then mark enters transition section; In transition section, if two parameters all fall back to below low threshold, enter into quiet section; In transition section, if any one in two parameters exceedes high threshold, then think and enter voice segments; In voice segments, if two parameters all drop to below low threshold, and the duration is more than a predetermined threshold, then think that voice segments terminates.
Step S20, for the sound signal that described step S10 remains, frequency domain is analyzed the spectrum envelope characteristic of its subband and the entropy characteristic of subband, rejects the part non-speech audio in described sound signal further;
The spectrum envelope characteristic that frequency domain is analyzed subband comprises the following steps:
First, described sound signal is divided into some subbands;
Then, in the frequency range of each subband, carry out bandpass filtering respectively, obtain the sound signal of each subband;
Then, Hilbert transform is carried out to each sub-band audio signal, obtain the spectrum envelope of each subband;
Finally, to comprising the subband of obvious resonance peak characteristic and comprising the statistical property of its spectrum envelope signal of Substrip analysis of more noise contribution.
The statistical property of described spectrum envelope signal comprises average and the variance of spectrum envelope, is specifically calculatively characterized as: (1) comprises the subband spectrum envelope variance of obvious resonance peak characteristic; (2) the equal value difference of the subband spectrum envelope comprising obvious resonance peak characteristic and the subband spectrum envelope comprising more noise contributions.
The entropy characteristic that frequency domain is analyzed subband comprises the following steps:
First, under long span pattern, utilize present frame to calculate the entropy of each frequency of present frame with the some frames be adjacent;
Then, within the scope of particular sub-band the average of statistical entropy and variance to determine the complexity of current speech frame.
Subband spectrum envelope characteristic under the short span mode of such fusion and the subband entropy characteristic under long span pattern just can reject part non-speech audio further, are specially:
For every frame voice signal, utilize the spectrum envelope characteristic of subband and the entropy characteristic of subband, under the ground unrest of various complexity, frequency-domain analysis is carried out to voice signal, and then voice signal and non-speech audio are classified, reject part non-speech audio further.
Fig. 3 is the process flow diagram of speech detection method frequency domain analysis part according to an embodiment of the invention, as shown in Figure 3, according to the spectrum envelope characteristic of subband and the entropy characteristic of subband, the step of the part non-speech audio of rejecting in described sound signal further comprises the following steps:
Step S21, for every frame voice signal, first high-pass filtering is carried out to remove the interference of power frequency component to it, in an embodiment of the present invention, described Hi-pass filter selects 4 rank Chebyshev's Hi-pass filters, then carry out windowing process to the sound signal through high-pass filtering, in an embodiment of the present invention, window function selects Hamming window;
Step S22, sound signal after windowing process is divided into N number of frequency range, in an embodiment of the present invention, described sound signal is divided into 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz and 3000-4000Hz is totally five frequency ranges, in these band limits, carry out bandpass filtering respectively to described sound signal, obtains the sound signal of N number of subband, in an embodiment of the present invention, bandpass filter adopts 6 rank Butterworth filters;
Step S23, carries out Hilbert transform to the sound signal of each subband, obtains corresponding spectrum envelope signal;
For Voiced signal, the spectrum envelope of 500-1000Hz frequency band comprises obvious resonance peak characteristic; And under environment made an uproar by band, the spectrum envelope of 3000-4000Hz frequency band comprises more noise contribution, in an embodiment of the present invention, only Hilbert transform is carried out to 500-1000Hz and 3000-4000Hz two band.
Step S24, carries out statistical characteristic analysis to the spectrum envelope signal that described step S23 obtains, and calculates their average and variance within the scope of respective sub-bands, and then obtains spectrum envelope judgement output;
If μ 1represent the average of 500-1000Hz subband spectrum envelope, μ 2represent the average of 3000-4000Hz subband spectrum envelope, σ 1and σ 2represent the variance of above-mentioned two subbands respectively, spectrum envelope judgement is set and exports as VAD envelope, then it can be expressed as:
VAD envelope2-(μ 25),
This step is just by the analysis of antithetical phrase spectral envelope like this, obtains judgement and exports VAD envelope.
Step S25, calculates Fourier modulus spectrum to the sound signal of current frame voice frequency signal and adjacent some frames, obtains Fourier's amplitude of each Frequency point of different frame; For different Frequency points, adjacent some frames are utilized to calculate the entropy of present frame at this Frequency point place; In the sub-band bin comprising obvious resonance peak characteristic, (in an embodiment of the present invention, selection 500-1000Hz frequency band) calculates the variance of each Frequency point entropy, exports VAD as long span judgement entropy;
Step S26, merges two judgement outputs that described step S24 and step S25 obtains and comprehensively adjudicates, obtain final frequency domain decision result VAD freq, be expressed as:
VAD freq1VAD entropy2VAD entropy
If frequency domain decision result VAD freqhigher than a threshold value, then this frame is labeled as speech frame, if VAD freqthen this frame is labeled as non-speech frame lower than this threshold value, in addition, the data being labeled as speech frame need to carry out extended length, by start frame expanded forward 3 frame of voice segments, the ending frame of voice segments are expanded 3 frames backward.
Sound signal after such process just eliminates part non-speech audio further.
Step S30, screens the sound signal of frame for respectively waiting of remaining, continuous some frames of feature similarity is formed an audio section, follow-uply in units of audio section, carries out speech detection;
Fig. 4 is the process flow diagram of speech detection method sound intermediate frequency frame cluster part according to an embodiment of the invention, and as shown in Figure 4, described step S30 is further comprising the steps:
Step S31, for respectively waiting the sound signal of screening frame, being considered human auditory system apperceive characteristic, in Mel territory, described sound signal being divided into some subbands, namely being obtained the sound signal of each subband by Mel wave filter;
Step S32, every frame sound signal is calculated to the entropy of each subband, to measure the proportion of each sub belt energy, the weight of each subband is set according to auditory perception property, can reflect that the low frequency sub-band weight of resonance peak characteristic is relatively large, and the weight of high-frequency sub-band is relatively little;
Step S33, with the entropy of each subband for characteristic parameter, calculate the similarity of adjacent speech frame, the weight of each subband is considered in computation process, then according to metric function conventional in prior art, the consecutive frame of feature similarity is classified as an audio section, for each frame data in each audio section, the distance between them is less than threshold value T.
By said method, just based on the entropy of speech frame each subband, sound signal can be divided into some audio sections, in each audio section, comprise similar speech frame, follow-uply in units of audio section, carry out speech detection.
Step S40, audio section is screened for each treating, the average of each frame mel cepstrum coefficients in difference compute segment, the Mean Parameters obtained is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.
Fig. 5 be according to an embodiment of the invention in speech detection method by the process flow diagram of the gauss hybrid models section of carrying out level decision-making, as shown in Figure 5, described step S40 is specially: first extract the static mel cepstrum coefficients in such as 13 rank, M rank waiting to screen each frame in audio section, then calculate their first order difference and second order difference respectively, finally obtain 3*M Jan Vermeer cepstrum coefficient; Calculate the average of each frame mel cepstrum coefficients, the average of 3*M Jan Vermeer cepstrum coefficient is utilized to carry out speech detection: the average of 3*M Jan Vermeer cepstrum coefficient to be input in the gauss hybrid models of voice signal and the gauss hybrid models of various non-speech audio respectively, if the maximum probability exported when being input to the gauss hybrid models of voice signal, judge that this section is as voice signal, otherwise be judged to be non-speech audio.
In described step S40, also need to select the gauss hybrid models of various types of audio frequency to the gauss hybrid models of voice signal and various non-speech audio to train, the robustness of model can be ensured like this, improving the accuracy rate of speech detection, needing to mark the classification of each audio file when training.
Fig. 6 is the process flow diagram of the off-line training process of gauss hybrid models in speech detection method according to an embodiment of the invention, and as shown in Figure 6, the training for gauss hybrid models is further comprising the steps:
Step S41, carries out filtered audio for all training audio repository; Adopt the method for described step S10 and step S20 to carry out time and frequency domain analysis to sound signal respectively, reject part non-speech audio wherein, to remaining, subsequent step only waits that screening sound signal trains;
Step S42, according to audio categories mark, the sound signal after filtration is classified, be about to the sound signal after filtering and be divided into voice signal and non-speech audio, non-speech audio is needed further to classify (in an embodiment of the present invention according to the feature of sound signal to them, non-speech audio is divided into background music, animal sounds, stationary noise and nonstationary noise, respectively gauss hybrid models is trained to dissimilar non-voice);
Step S43, in units of frame, mel cepstrum coefficients is extracted to sorted sound signal, first extract M rank static parameter, then their first order difference and second order difference is calculated respectively, final extraction obtains 3*M and ties up parameter, adopt the method for described step S30 that continuous some frames of feature similarity are formed an audio section, in difference compute segment, the average of each frame mel cepstrum coefficients, it can be used as the characteristic parameter of training gauss hybrid models;
Step S44, the mel cepstrum coefficients on 3*M rank is adopted to carry out the training of gauss hybrid models respectively to voice signal and different classes of non-speech audio, namely the weight of each gauss component in different gauss hybrid models, average and variance is determined by EM repetitive exercise, in an embodiment of the present invention, 32 gauss components are comprised in each gauss hybrid models.
In sum, the present invention proposes a kind of speech detection method efficiently, this mechanism carries out the speech detection in two stages to audio stream.First in time domain He on frequency domain, voice signal being analyzed, being divided into non-speech data to screen data with waiting voice signal by arranging rational parameter threshold.Then treat examination data by the parameter model of robust to detect, judge wherein whether comprise voice.Speech detection method provided by the invention steadily and under Non-Stationary random noise environment can detect speech data fast and accurately various; Effectively can distinguish voice signal and various non-speech audio, by the restriction of speaker, environment and languages.
It should be noted that, the above-mentioned implementation to each parts is not limited in the various implementations mentioned in embodiment, and those of ordinary skill in the art can replace it with knowing simply, such as:
(1) when carrying out frequency-domain analysis to voice signal, according to the sense of hearing characteristic of people's ear, frequency band division is become 0-500Hz, 500-1000Hz, 1000-2000Hz, 2000-3000Hz, 3000-4000Hz totally five subbands on frequency domain, voice signal is analyzed.Other sub-band division method can be used to substitute, divide each subband as used Mel wave filter.
(2) set up in gauss hybrid models process, the mixed Gauss model number of regulation also can adjust, and as voice gauss hybrid models comprises 32 Gaussian distribution, non-voice gauss hybrid models comprises 64 Gaussian distribution.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. a speech detection method, is characterized in that, the method comprises the following steps:
Step S10, obtains original audio, time domain is analyzed short-time energy and the short-time zero-crossing rate of described original audio, is rejected the part non-speech audio in original audio by short-time energy and short-time zero-crossing rate;
Step S20, for the sound signal that described step S10 remains, frequency domain is analyzed the spectrum envelope characteristic of its subband and the entropy characteristic of subband, rejects the part non-speech audio in described sound signal further;
Step S30, for the sound signal of respectively waiting to screen frame remained, forms an audio section by continuous some frames of feature similarity;
Step S40, screening audio section for each waiting, by gauss hybrid models for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtaining voice detection results;
Wherein, described step S30 is further comprising the steps:
Step S31, for respectively waiting the sound signal of screening frame, considering human auditory system apperceive characteristic, in Mel territory, described sound signal being divided into some subbands;
Step S32, calculates the entropy of each subband to every frame sound signal, to measure the proportion of each sub belt energy, arrange the weight of each subband according to auditory perception property;
Step S33, with the entropy of each subband for characteristic parameter, calculates the similarity of adjacent speech frame, considers the weight of each subband in computation process, then according to metric function, the consecutive frame of feature similarity is classified as an audio section.
2. method according to claim 1, is characterized in that, described step S10 comprises the steps: further
Step S11, is divided into some frames at equal intervals by original audio, calculates short-time energy and the short-time zero-crossing rate of every frame original audio;
Step S12, the short-time energy of every frame original audio and short-time zero-crossing rate are compared with low, high two thresholdings preset respectively, every frame original audio is divided into quiet section, transition section and voice segments according to comparative result, remove quiet section in described original audio and transition section signal, only retain voice segments signal.
3. method according to claim 2, is characterized in that, if described short-time energy or short-time zero-crossing rate exceed low threshold, then mark enters transition section; In transition section, if two parameters all fall back to below low threshold, enter into quiet section; In transition section, if any one in two parameters exceedes high threshold, then think and enter voice segments; In voice segments, if two parameters all drop to below low threshold, and the duration is more than a predetermined threshold, then think that voice segments terminates.
4. method according to claim 1, is characterized in that, in described step S20, the statistical property that frequency domain is analyzed the spectrum envelope of each subband comprises the following steps:
First, described sound signal is divided into some subbands;
Then, in the frequency range of each subband, carry out bandpass filtering respectively, obtain the sound signal of each subband;
Then, Hilbert transform is carried out to each sub-band audio signal, obtain the spectrum envelope of each subband;
Finally, to comprising obvious resonance peak characteristic and comprising the statistical property of its spectrum envelope signal of Substrip analysis of more noise contribution.
5. method according to claim 4, is characterized in that, the statistical property of described spectrum envelope signal comprises average and the variance of spectrum envelope, is specifically calculatively characterized as: the subband spectrum envelope variance comprising obvious resonance peak characteristic; The equal value difference of the subband spectrum envelope comprising obvious resonance peak characteristic and the subband spectrum envelope comprising more noise contributions.
6. method according to claim 1, is characterized in that, in described step S20, the entropy characteristic that frequency domain is analyzed subband comprises the following steps:
First, under long span pattern, utilize present frame to calculate the entropy of each frequency of present frame with the some frames be adjacent;
Then, within the scope of particular sub-band the average of statistical entropy and variance to determine the complexity of current speech frame.
7. method according to claim 1, is characterized in that, in described step S20, according to spectrum envelope statistical property and the entropy characteristic of each subband, the step of the part non-speech audio of rejecting in described sound signal further comprises the following steps:
Step S21, for every frame voice signal, first carries out high-pass filtering to remove the interference of power frequency component to it, then carries out windowing process to the sound signal through high-pass filtering;
Step S22, is divided into N number of frequency range by the sound signal after windowing process, in these band limits, carry out bandpass filtering respectively to described sound signal, obtains the sound signal of N number of subband;
Step S23, carries out Hilbert transform to the sound signal of each subband, obtains corresponding spectrum envelope signal;
Step S24, carries out statistical characteristic analysis to the spectrum envelope signal that described step S23 obtains, and obtains spectrum envelope judgement and exports;
Step S25, calculates Fourier modulus spectrum to the sound signal of current frame voice frequency signal and adjacent some frames, obtains Fourier's amplitude of each Frequency point of different frame; For different Frequency points, adjacent some frames are utilized to calculate the entropy of present frame at this Frequency point place; In the sub-band bin comprising obvious resonance peak characteristic, calculate the variance of each Frequency point entropy, export as long span judgement;
Step S26, merges two judgement outputs that described step S24 and step S25 obtains and comprehensively adjudicates, obtain final frequency domain decision result; If frequency domain decision result is higher than a threshold value, then this frame is labeled as speech frame, if lower than this threshold value, this frame is labeled as non-speech frame.
8. method according to claim 1, is characterized in that, described step S40 is specially:
Audio section is screened for each treating, the average of each frame mel cepstrum coefficients in difference compute segment, the Mean Parameters obtained is input to respectively in voice gauss hybrid models and various non-voice gauss hybrid models, according to the output probability of each model for whether comprising the decision-making of the speech data section of carrying out level in this audio section, finally obtain voice detection results.
9. method according to claim 1, is characterized in that, is specially for the training of gauss hybrid models in described step S40:
Step S41, filtered audio is carried out for all training audio repository, adopt the method for described step S10 and step S20 to carry out time and frequency domain analysis to sound signal respectively, reject part non-speech audio wherein, to remaining, subsequent step only waits that screening sound signal trains;
Step S42, classifies to the sound signal after filtration according to audio categories mark, is about to the sound signal after filtering and is divided into voice signal and non-speech audio;
Step S43, in units of frame, mel cepstrum coefficients is extracted to sorted sound signal, first extract M rank static parameter, then their first order difference and second order difference is calculated respectively, final extraction obtains 3*M and ties up parameter, adopt the method for described step S30 that continuous some frames of feature similarity are formed an audio section, in difference compute segment, the average of each frame mel cepstrum coefficients, it can be used as the characteristic parameter of training gauss hybrid models;
Step S44, carries out the training of gauss hybrid models respectively to voice signal and different classes of non-speech audio, namely determine the weight of each gauss component in different gauss hybrid models, average and variance by EM repetitive exercise.
CN201310743203.5A 2013-12-30 2013-12-30 A kind of speech detection method efficiently Active CN103646649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310743203.5A CN103646649B (en) 2013-12-30 2013-12-30 A kind of speech detection method efficiently

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310743203.5A CN103646649B (en) 2013-12-30 2013-12-30 A kind of speech detection method efficiently

Publications (2)

Publication Number Publication Date
CN103646649A CN103646649A (en) 2014-03-19
CN103646649B true CN103646649B (en) 2016-04-13

Family

ID=50251851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310743203.5A Active CN103646649B (en) 2013-12-30 2013-12-30 A kind of speech detection method efficiently

Country Status (1)

Country Link
CN (1) CN103646649B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102214888B1 (en) * 2016-10-12 2021-02-15 어드밴스드 뉴 테크놀로지스 씨오., 엘티디. Method and device for detecting an audio signal

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318927A (en) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 Anti-noise low-bitrate speech coding method and decoding method
CN104464722B (en) * 2014-11-13 2018-05-25 北京云知声信息技术有限公司 Voice activity detection method and apparatus based on time domain and frequency domain
CN104934043A (en) * 2015-06-17 2015-09-23 广东欧珀移动通信有限公司 Audio processing method and device
CN105118522B (en) * 2015-08-27 2021-02-12 广州市百果园网络科技有限公司 Noise detection method and device
CN105788592A (en) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 Audio classification method and apparatus thereof
CN105843400A (en) * 2016-05-05 2016-08-10 广东小天才科技有限公司 Somatosensory interaction method and device and wearable device
CN106020445A (en) * 2016-05-05 2016-10-12 广东小天才科技有限公司 Method for automatically identifying wearing by left hand and right hand and wearing equipment
CN107919116B (en) * 2016-10-11 2019-09-13 芋头科技(杭州)有限公司 A kind of voice-activation detecting method and device
KR102179511B1 (en) 2016-10-14 2020-11-16 코우리츠 다이가꾸 호우진 오사카 Swallowing diagnostic device and program
CN107957918B (en) * 2016-10-14 2019-05-10 腾讯科技(深圳)有限公司 Data reconstruction method and device
CN106548782A (en) * 2016-10-31 2017-03-29 维沃移动通信有限公司 The processing method and mobile terminal of acoustical signal
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data
CN106782508A (en) * 2016-12-20 2017-05-31 美的集团股份有限公司 The cutting method of speech audio and the cutting device of speech audio
CN107039035A (en) * 2017-01-10 2017-08-11 上海优同科技有限公司 A kind of detection method of voice starting point and ending point
CN107045870B (en) * 2017-05-23 2020-06-26 南京理工大学 Speech signal endpoint detection method based on characteristic value coding
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection
CN108269566B (en) * 2018-01-17 2020-08-25 南京理工大学 Rifling wave identification method based on multi-scale sub-band energy set characteristics
CN109036470B (en) * 2018-06-04 2023-04-21 平安科技(深圳)有限公司 Voice distinguishing method, device, computer equipment and storage medium
CN108831508A (en) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 Voice activity detection method, device and equipment
CN109147795B (en) * 2018-08-06 2021-05-14 珠海全志科技股份有限公司 Voiceprint data transmission and identification method, identification device and storage medium
CN109347580B (en) * 2018-11-19 2021-01-19 湖南猎航电子科技有限公司 Self-adaptive threshold signal detection method with known duty ratio
CN111261143B (en) * 2018-12-03 2024-03-22 嘉楠明芯(北京)科技有限公司 Voice wakeup method and device and computer readable storage medium
CN109448750B (en) * 2018-12-20 2023-06-23 西京学院 Speech enhancement method for improving speech quality of biological radar
CN109801646B (en) * 2019-01-31 2021-11-16 嘉楠明芯(北京)科技有限公司 Voice endpoint detection method and device based on fusion features
CN111916068B (en) * 2019-05-07 2024-07-23 北京地平线机器人技术研发有限公司 Audio detection method and device
CN110097895B (en) * 2019-05-14 2021-03-16 腾讯音乐娱乐科技(深圳)有限公司 Pure music detection method, pure music detection device and storage medium
CN110349597B (en) * 2019-07-03 2021-06-25 山东师范大学 Voice detection method and device
CN110600010B (en) * 2019-09-20 2022-05-17 度小满科技(北京)有限公司 Corpus extraction method and apparatus
CN110636176B (en) * 2019-10-09 2022-05-17 科大讯飞股份有限公司 Call fault detection method, device, equipment and storage medium
CN111415685A (en) * 2020-03-26 2020-07-14 腾讯科技(深圳)有限公司 Audio signal detection method, device, equipment and computer readable storage medium
CN111398944B (en) * 2020-04-09 2022-05-17 浙江大学 Radar signal processing method for identity recognition
CN111883182B (en) * 2020-07-24 2024-03-19 平安科技(深圳)有限公司 Human voice detection method, device, equipment and storage medium
CN112466331A (en) * 2020-11-11 2021-03-09 昆明理工大学 Voice music classification model based on beat spectrum characteristics
CN112562735B (en) * 2020-11-27 2023-03-24 锐迪科微电子(上海)有限公司 Voice detection method, device, equipment and storage medium
CN112528920A (en) * 2020-12-21 2021-03-19 杭州格像科技有限公司 Pet image emotion recognition method based on depth residual error network
CN112767920A (en) * 2020-12-31 2021-05-07 深圳市珍爱捷云信息技术有限公司 Method, device, equipment and storage medium for recognizing call voice
CN113160853A (en) * 2021-03-31 2021-07-23 深圳鱼亮科技有限公司 Voice endpoint detection method based on real-time face assistance
CN113192488B (en) * 2021-04-06 2022-05-06 青岛信芯微电子科技股份有限公司 Voice processing method and device
CN113541867A (en) * 2021-06-30 2021-10-22 南京奥通智能科技有限公司 Remote communication module for converged terminal
CN113593599A (en) * 2021-09-02 2021-11-02 北京云蝶智学科技有限公司 Method for removing noise signal in voice signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (en) * 2006-12-07 2008-06-11 华为技术有限公司 Sound activity detecting method and detector thereof
CN102473412A (en) * 2009-07-21 2012-05-23 日本电信电话株式会社 Audio signal section estimateing apparatus, audio signal section estimateing method, program therefor and recording medium
CN103165127A (en) * 2011-12-15 2013-06-19 佳能株式会社 Sound segmentation equipment, sound segmentation method and sound detecting system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100513175B1 (en) * 2002-12-24 2005-09-07 한국전자통신연구원 A Voice Activity Detector Employing Complex Laplacian Model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (en) * 2006-12-07 2008-06-11 华为技术有限公司 Sound activity detecting method and detector thereof
CN102473412A (en) * 2009-07-21 2012-05-23 日本电信电话株式会社 Audio signal section estimateing apparatus, audio signal section estimateing method, program therefor and recording medium
CN103165127A (en) * 2011-12-15 2013-06-19 佳能株式会社 Sound segmentation equipment, sound segmentation method and sound detecting system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
话者识别中结合模型和能量的语音激活检测算法;章钊、郭武;《小型微型计算机系统》;20100930;第31卷(第9期);1914-1917 *
语音激活检测技术算法研究及其在语音编码器中的应用;沈红丽;《万方数据》;20120426;第2.2.3、3.3节 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102214888B1 (en) * 2016-10-12 2021-02-15 어드밴스드 뉴 테크놀로지스 씨오., 엘티디. Method and device for detecting an audio signal

Also Published As

Publication number Publication date
CN103646649A (en) 2014-03-19

Similar Documents

Publication Publication Date Title
CN103646649B (en) A kind of speech detection method efficiently
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
Evangelopoulos et al. Multiband modulation energy tracking for noisy speech detection
CN103489446B (en) Based on the twitter identification method that adaptive energy detects under complex environment
CN101197130B (en) Sound activity detecting method and detector thereof
Meyer et al. Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition
US20090076814A1 (en) Apparatus and method for determining speech signal
CN104318927A (en) Anti-noise low-bitrate speech coding method and decoding method
CN104157290A (en) Speaker recognition method based on depth learning
CN104008751A (en) Speaker recognition method based on BP neural network
CN103489454A (en) Voice endpoint detection method based on waveform morphological characteristic clustering
Ghaemmaghami et al. Noise robust voice activity detection using features extracted from the time-domain autocorrelation function
CN110136709A (en) Audio recognition method and video conferencing system based on speech recognition
Couvreur et al. Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
Zhang et al. Fault diagnosis method based on MFCC fusion and SVM
Chu et al. A noise-robust FFT-based auditory spectrum with application in audio classification
Thomas et al. Acoustic and data-driven features for robust speech activity detection
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
Papadopoulos et al. Global SNR Estimation of Speech Signals for Unknown Noise Conditions Using Noise Adapted Non-Linear Regression.
CN110265049A (en) A kind of audio recognition method and speech recognition system
TWI749547B (en) Speech enhancement system based on deep learning
Park et al. Frequency of Interest-based Noise Attenuation Method to Improve Anomaly Detection Performance
Pasad et al. Voice activity detection for children's read speech recognition in noisy conditions
CN115662464B (en) Method and system for intelligently identifying environmental noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170508

Address after: 100094, No. 4, building A, No. 1, building 2, wing Cheng North Road, No. 405-346, Beijing, Haidian District

Patentee after: Beijing Rui Heng Heng Xun Technology Co., Ltd.

Address before: 100190 Zhongguancun East Road, Beijing, No. 95, No.

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right

Effective date of registration: 20181218

Address after: 100190 Zhongguancun East Road, Haidian District, Haidian District, Beijing

Patentee after: Institute of Automation, Chinese Academy of Sciences

Address before: 100094 No. 405-346, 4th floor, Building A, No. 1, Courtyard 2, Yongcheng North Road, Haidian District, Beijing

Patentee before: Beijing Rui Heng Heng Xun Technology Co., Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190528

Address after: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee after: Limit element (Hangzhou) intelligent Polytron Technologies Inc

Address before: 100190 Zhongguancun East Road, Haidian District, Haidian District, Beijing

Patentee before: Institute of Automation, Chinese Academy of Sciences

TR01 Transfer of patent right
CP01 Change in the name or title of a patent holder

Address after: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee after: Zhongke extreme element (Hangzhou) Intelligent Technology Co., Ltd

Address before: 310019 1105, 11 / F, 4 building, 9 Ring Road, Jianggan District nine, Hangzhou, Zhejiang.

Patentee before: Limit element (Hangzhou) intelligent Polytron Technologies Inc.

CP01 Change in the name or title of a patent holder