CN106024017A - Voice detection method and device - Google Patents
Voice detection method and device Download PDFInfo
- Publication number
- CN106024017A CN106024017A CN201510119374.XA CN201510119374A CN106024017A CN 106024017 A CN106024017 A CN 106024017A CN 201510119374 A CN201510119374 A CN 201510119374A CN 106024017 A CN106024017 A CN 106024017A
- Authority
- CN
- China
- Prior art keywords
- cepstrum
- speech detection
- frame
- voiced frame
- voiced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a voice detection method and device, and the method comprises the steps: enabling collected voice signals to be overlapped and framed, and obtaining a plurality of corresponding sound frames; carrying out the windowing of the plurality of obtained sound frames; carrying out the frequency domain conversion of the sound frames after windowing, and obtaining frequency spectrums corresponding to the sound frames; carrying out the cepstrum domain conversion of the frequency spectrums corresponding to the obtained sound frames, and obtaining corresponding cepstrums; calculating the cepstrum distance between the cepstrums of two adjacent sound frames; and carrying out the voice detection of the collected signals when the calculated cepstrum distance is greater than a preset distance threshold value. According to the above scheme, the method can save the time of voice detection.
Description
Technical field
The present invention relates to speech detection technical field, particularly relate to a kind of speech detection method and device.
Background technology
Mobile terminal, refers to the computer equipment that can use in movement, include in a broad aspect mobile phone,
Notebook, panel computer, POS, vehicle-mounted computer etc..Along with developing rapidly of integrated circuit technique, move
Dynamic terminal has had powerful disposal ability, and mobile terminal becomes one from simple call instrument
Individual integrated information processing platform, this also adds broader development space to mobile terminal.
The use of mobile terminal, it usually needs user concentrates certain attention.Mobile terminal of today sets
For being equipped with touch screen, user needs to touch described touch screen, to perform corresponding operation.But,
When user cannot touch mobile terminal device, operation mobile terminal will become highly inconvenient.Such as,
When having carried article during user drives vehicle or hands when.
Speech detection method and always listen the use of system (AlwaysListeningSystem) so that can be right
Mobile terminal carries out non-manual activation and operation.When described always listen system acoustical signal to be detected time, voice
Detecting system will activate, and is identified the acoustical signal detected, afterwards, mobile terminal will
Corresponding operation is performed, such as, when user inputs " dialing the mobile phone of XX " according to the acoustical signal identified
Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal
Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.
But, speech detection method in prior art, when being applied to always listen in system, need to protect always
Hold opening to detect with the voice activity to user, accordingly, there exist the longest problem.
Summary of the invention
The problem that the embodiment of the present invention solves is the most time-consuming when carrying out speech detection.
For solving the problems referred to above, embodiments providing a kind of speech detection method, described voice is examined
Survey method includes:
The acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence;
Obtained multiple voiced frames are carried out windowing process;
Voiced frame after windowing process is carried out frequency domain conversion, obtains the frequency spectrum that each voiced frame is corresponding;
Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the scramble of correspondence
Spectrum;
Calculate the cepstrum distance between the cepstrum of two adjacent voiced frames;
When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is entered
Row speech detection.
Alternatively, described voiced frame after windowing process is carried out frequency domain conversion, obtain each sound
The frequency spectrum that frame is corresponding, including: the voiced frame after windowing process is carried out fast Fourier transform,
To the frequency spectrum that each voiced frame is corresponding.
Alternatively, described frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion,
Arrive corresponding cepstrum, including:
Wherein, c represents cepstrum coefficient, and S (w) represents voiced frame, and α is default correction term.
Alternatively, the cepstrum distance between the cepstrum of two voiced frames that described calculating is adjacent, including:
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjTable respectively
Showing the cepstrum of adjacent two voiced frame, k represents sampling frequency number.
Alternatively, the sampling frequency number of described voiced frame is 32.
Alternatively, time a length of 200ms to 1s of described gathered acoustical signal.
Alternatively, described distance threshold is by carrying out at preemphasis the sampled signal that sample frequency is 8KHz
Manage, and the Hamming window that the voiced frame that frame length is 20ms adds 256 obtains.
The embodiment of the present invention additionally provides a kind of speech detection device, and described device includes:
Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence
Sound frame;
Windowing process unit, is suitable to obtained multiple voiced frames are carried out windowing process;
Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each
The frequency spectrum that individual voiced frame is corresponding;
Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstrum
Territory is changed, and obtains the cepstrum of correspondence;
Computing unit, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame;
Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, right
The acoustical signal gathered carries out speech detection.
Alternatively, described frequency domain converting unit is suitable to the voiced frame after windowing process is carried out quick Fu
In leaf transformation, obtain the frequency spectrum that each voiced frame is corresponding.
Alternatively, the sampling frequency number of described voiced frame is 32.
Alternatively, time a length of 200ms to 1s of described gathered acoustical signal.
Alternatively, described distance threshold is by carrying out at preemphasis the sampled signal that sample frequency is 8KHz
Manage, and the Hamming window that the voiced frame that frame length is 20ms adds 256 obtains.
Compared with prior art, technical scheme has the advantage that
By the cepstrum distance between the cepstrum of calculating adjacent sound frame, determine whether the sound to input
Tone signal detects, relatively simple, only owing to calculating the computing of the cepstrum distance between alternative sounds frame
Therefore, it can save calculating resource and the time of speech detection.
Further, owing to the sampling frequency number in each voiced frame is 32, cost can be calculated saving
While, it is thus achieved that preferably speech detection performance.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of speech detection method in the embodiment of the present invention;
Fig. 2 is the flow chart of the another kind of speech detection method in the embodiment of the present invention;
Fig. 3 is the voice under the conditions of different clean speech of the speech detection method in the embodiment of the present invention
The simulation result schematic diagram of recognition correct rate;
Fig. 4 is that the speech detection method using ITU-T G.729B standard is under the conditions of different clean speech
The simulation result schematic diagram of speech recognition accuracy;
Fig. 5 is the VAD based on statistical model speech recognition accuracy under the conditions of different clean speech
Simulation result schematic diagram;
Fig. 6 be the VAD based on long-term speech information speech recognition under the conditions of different clean speech just
The really simulation result schematic diagram of rate;
Fig. 7 be the speech recognition under the conditions of white noise of the speech detection method in the embodiment of the present invention just
The really simulation result schematic diagram of rate;
Fig. 8 is the speech detection method using ITU-T G.729B standard voice under the conditions of white noise
The simulation result schematic diagram of recognition correct rate;
Fig. 9 is the emulation of the VAD based on statistical model speech recognition accuracy under the conditions of white noise
Result schematic diagram;
Figure 10 is the VAD based on long-term speech information speech recognition accuracy under the conditions of white noise
Simulation result schematic diagram;
Figure 11 is the structural representation of a kind of speech detection device in the embodiment of the present invention.
Detailed description of the invention
Of the prior art always listen system use voice activity detection (Voice Activity Detection, VAD)
Sound is detected by technology.
Voice activity detection method the most frequently used in GSM standard, carries out background noise more at noise intervals
Newly.This voice activity detection method based on frequency domain generally uses and includes linear prediction spectrum, full frequency band energy
Amount, low-frequency range (0-1KHz) energy and the characteristic vector of zero-crossing rate.Specifically, will input sound letter
After number device group is filtered after filtering, calculate the sound levels of each frequency range, and use there is premeasuring
Results model submodule determines probability, or determines that whether the energy level of present frame is more than making an uproar of storing
Sound.Above-mentioned voice activity detection method, it usually needs a reliable submodule updates and stores noise
Model.
For this problem, presently, there are and comment by being dynamically tracked power envelope carrying out noise spectrum
Estimate, above-mentioned voice activity detection method is further improved.A kind of method therein will be by receiving
Device characteristic working curve non-voice false alarm rate under some representational noises and situation the most less and
Whether voice hit rate increases, and compares with original voice activity detection method.In prior art
Another kind of speech detection method then construct a kind of loaded down with trivial details voice activity detection with six kinds of loaded down with trivial details rules
Method.
Above-mentioned voice activity detection method can show excellent performance in specific condition and platform.
But, above-mentioned voice activity detection method is when being applied to always listen in system, owing to needs by always listening are
Unification directly maintains opening, detects with the acoustical signal to input, thus there is consuming meter
Calculate resource and the problem of the time of calculating.
For solving the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is first
Pass through.
Understandable, below in conjunction with the accompanying drawings for enabling the above-mentioned purpose of the present invention, feature and advantage to become apparent from
The specific embodiment of the present invention is described in detail.
Fig. 1 shows the flow chart of a kind of speech detection method in the embodiment of the present invention.As shown in Figure 1
Speech detection method, may include that
Step S101: the acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence.
In being embodied as, in order to the acoustical signal gathered is processed, can first will collect
Acoustical signal carries out overlapping framing, obtains multiple voiced frame.The acoustical signal gathered is carried out framing, real
Matter is that acoustical signal is carried out short-time analysis, and short-time analysis is divided into acoustical signal and has the fixed cycle
Short of time, short of each time is relatively-stationary lasting sound clip.
In being embodied as, partly overlapping between two adjacent voiced frames, overlapping range can be according to reality
Border situation selects.
Step S102: obtained multiple voiced frames are carried out windowing process.
In being embodied as, the Speech processing such as Hamming window, Hanning window, rectangular window can be selected to commonly use
Window function, frame length is chosen as 10~40ms, and representative value is 20ms.
In being embodied as, voice signal is carried out sub-frame processing and destroys the naturalness of acoustical signal, logical
Cross use voiced frame and carry out windowing and return process etc., this problem can be solved.
Step S103: the voiced frame after windowing process is carried out frequency domain conversion, obtains each voiced frame
Corresponding frequency spectrum.
In being embodied as, the acoustical signal gathered in theory for be time dependent, be one
Astable process, it is not possible to directly carry out the conversion of frequency domain.But, due to the sound letter gathered
Number carrying out sub-frame processing (short-time analysis), the acoustical signal of every frame may be considered metastable, thus
Can apply and carry out frequency domain conversion.Wherein, the frequency spectrum that each obtained voiced frame is corresponding includes frequency
Relation with energy.
Step S104: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains
Corresponding cepstrum.
In being embodied as, each voiced frame signal obtained can be carried out after framing windowing process
Frequency domain is changed.
In being embodied as, cepstrum is that the logarithm value to power spectrum carries out inverse fourier transform (Inverse
Fourier Transform, IFT), complicated convolution relation is become simple linear superposition so that scramble
The group of frequencies component amount of each voiced frame signal can be identified relatively easily in spectrum.
Step S105: calculate the cepstrum distance between the cepstrum of adjacent voiced frame.
In being embodied as, by calculating the cepstrum distance between the cepstrum of adjacent voiced frame, with really
Determine whether the acoustical signal gathered to be carried out speech detection.Wherein, with prior art calculates adjacent sound
The spectrum energy of sound frame (current sound frame and the delay voiced frame with default time delay) is compared, and calculates phase
Cepstrum distance between the cepstrum of adjacent voiced frame, can reduce the complexity of calculating, therefore, it is possible to
Save and calculate resource and the time of calculating.
Step S106: when the cepstrum distance calculated is more than the distance threshold preset, to gathered
Acoustical signal carries out speech detection.
In being embodied as, when the cepstrum distance calculated is more than the distance threshold preset, show defeated
Containing voice signal in the acoustical signal entered, at this point it is possible to the acoustical signal gathered is carried out voice inspection
Survey, to identify voice signal therein.
Fig. 2 shows the flow chart of the another kind of speech detection method in the embodiment of the present invention.Such as Fig. 2 institute
The speech detection method shown, may include that
Step S201: the acoustical signal of preset duration is entered framing windowing process.
In being embodied as, first the acoustical signal inputted can be carried out overlapping framing, obtain frame by frame
Signal.Wherein, frame length is chosen as 20ms, adjacent before and after partly overlap between two voiced frames.Afterwards,
Voiced frame after framing can add the Hamming window of 256, and wherein, sample rate is 8kHz, and frame length is
20ms, interframe overlap is 50%, then a frame acoustical signal has 160 sampled points, by signal
End zero padding obtains 256 sampled points.
In being embodied as, the time delay of adjacent voiced frame has important work in the calculating of cepstrum distance
With.The most extended when putting longer, longer first tone signal with continuous print frequency spectrum may be returned by mistake
Class;The most extended when putting longer, can cause when carrying out speech detection needing the longer startup time, and
And to store the spectrum vector that more voiced frame is corresponding.In embodiments of the present invention, the sound letter gathered
Number time span could be arranged to 200ms to 1s, to improve the performance of speech detection.
In being embodied as, the time delay when determining between different spectrum vector, following public affairs can be used
Formula carries out simple z conversion to the time delay between different spectrum vectors, and conversion is to frequency domain:
F (x)=x (n-m)=> > F (z)=z-mX(Z) (2)
Wherein, f (x) represents the difference in time domain between two sampled points, and n represents the finger of current sampling point
Number (index), the index (index) of any sampled point before m table current sampling point, F (z) represents F
Through the function expression of z conversion, X (Z) represents x function expression after z changes.
Step S202: the voiced frame after framing windowing process is carried out FFT process, obtains each sound
The frequency spectrum that sound frame is corresponding.
In an embodiment of the present invention, by the voiced frame after framing windowing process is carried out quick Fu
In leaf transformation (Fast Fourier Transform, FFT) process, the frequency corresponding to obtain each voiced frame
Spectrum.Wherein, the spectrogram that each voiced frame is corresponding includes the corresponding relation between frequency and amplitude.
Step S203: frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains
Corresponding cepstrum.
In being embodied as, frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion,
The corresponding relation that the cepstrum figure obtained includes between inverted frequency (q) and cepstrum coefficient (c).At this
In a bright embodiment, formula below is used to be calculated the cepstrum coefficient that the voiced frame signal of input is corresponding:
Wherein, c represents cepstrum coefficient, and S (w) represents the voiced frame of input, and α is default correction term.
Step S204: calculate the cepstrum distance between the cepstrum of adjacent sound frame.
In being embodied as, distance computing formula different in prior art can be used, be calculated phase
Cepstrum distance between the cepstrum of adjacent voiced frame.In an embodiment of the present invention, Manhattan can be used
Cepstrum distance between the cepstrum of distance (city block distance) calculating adjacent sound frame:
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjRespectively
Representing the cepstrum of two adjacent voiced frames, k represents sampling frequency number.
In an embodiment of the present invention, the value of k is 32, then, it is only necessary to carry out 32 subtractions
With 31 sub-addition computings, just can calculate the cepstrum that has between the different cepstrum postponing frequency range away from
From, therefore, it can be substantially reduced the complexity of calculating, save and calculate resource.
It is to be herein pointed out along with the adjacent sound used in the speech detection process of an acoustical signal
The quantity of the cepstrum distance between the cepstrum of sound frame increases, and speech detection performance also will strengthen therewith.But
It is that practice analysis shows, the quantity of the cepstrum distance between the cepstrum of the adjacent voiced frame used
During more than 4, the lifting of speech detection performance will be the most small.
Step S205: judge that whether the cepstrum distance calculated is more than the distance threshold preset.
In being embodied as, calculate to the being adapted to property of distance threshold in the embodiment of the present invention, and
Independent of with other parts in the embodiment of the present invention.But, practice have shown that, when distance threshold is fixing not
During change, the speech detection method in the embodiment of the present invention is language under conditions of some speakers and background noise
Sound detection performance is closer to.
In order to save resource and memory space, in an embodiment of the present invention, distance threshold can pass through
Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and adds the voiced frame that frame length is 20ms
Calculate under conditions of the Hamming window of 256.
It is to be herein pointed out distance threshold can be set according to being actually needed of terminal use.
In being embodied as, when judged result is for being, step S206 can be performed, when judged result is
Time no, the most do not perform any action.
Step S206: the acoustical signal gathered is carried out speech detection.
In being embodied as, the cepstrum distance between the cepstrum of the adjacent sound frame calculated is more than
During the distance threshold preset, then show gathered input audio signal includes voice signal, therefore,
The input audio signal gathered can be carried out speech detection.
In being embodied as, when identifying the voice messaging in input audio signal, mobile terminal is permissible
Corresponding operation is performed according to the acoustical signal identified.Such as, " dialing the mobile phone of XX " is inputted as user
Voice time, the voice messaging of " dialing the mobile phone of XX " of user's input just can be known by mobile terminal
Not, and after correct identification, from mobile terminal, obtain the information of the phone number of XX, and dial.
Speech detection method in below the present invention being implemented and VAD technology of the prior art, and ITU-T
G.729 standard compares respectively.
Table 1:
Although the time of speech detection may be affected by coding techniques, but, from above-mentioned table 1
Contrast understand, the calculating time that the speech detection method in the embodiment of the present invention is used is short.Wherein,
Compared with the speech detection method using frequency domain to process typical with prior art, in the embodiment of the present invention
Speech detection method can save the time of more than 60%, compared with ITUT standard, then saves 40%
The above time.
Fig. 3-6 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard
Speech detection method, VAD based on statistical model and VAD based on long-term speech information in difference
Clean speech under the conditions of the simulation result schematic diagram of speech recognition accuracy.
Understanding from the comparison of Fig. 3-6, the speech recognition of the audio recognition method in the embodiment of the present invention is correct
Whether rate can reach 90%, and will not be that local speaker is affected by speaker.
Fig. 7-10 shows the speech detection method in the embodiment of the present invention, uses ITU-T G.729B standard
Speech detection method, VAD based on statistical model and VAD based on long-term speech information in white
The simulation result schematic diagram of the speech recognition accuracy under noise conditions.
Understand from the comparison of Fig. 7-10, the speech recognition of the audio recognition method in the embodiment of the present invention
Can, the performance under the noise circumstance of situation noise circumstance and different signal to noise ratio is higher than in prior art adopts
With the speech detection method of ITU-T G.729B standard, especially under the noise circumstance of low signal-to-noise ratio.But
It is that, compared with other VAD, the performance of the speech detection method in the embodiment of the present invention decreases.
This is because the speech detection method in the embodiment of the present invention is by a relatively simple.Meanwhile, the present invention implements
Speech detection method in example was saved while 90% calculating time, performance merely reduce 85%~
90%, therefore may certify that effectiveness and the availability of audio recognition method in the embodiment of the present invention, suitable
Speech detection is carried out in being applied to always listen in system.
Figure 11 shows the structural representation of a kind of speech detection device in the embodiment of the present invention.Such as Figure 11
Shown speech detection device 1100, can include framing unit 1101, windowing process unit 1102, frequently
Territory converting unit 1103, cepstral domains converting unit 1104, computing unit 1105 and speech detection unit
1106, wherein:
Framing unit 1101, is suitable to the acoustical signal gathered carries out overlapping framing, obtains corresponding many
Individual voiced frame.
Windowing process unit 1102, is suitable to obtained multiple voiced frames are carried out windowing process.
Frequency domain converting unit 1103, is suitable to the voiced frame after windowing process is carried out frequency domain conversion,
To the frequency spectrum that each voiced frame is corresponding.
In being embodied as, described frequency domain converting unit 1103 is suitable to the voiced frame after windowing process
Carry out fast Fourier transform, obtain the frequency spectrum that each voiced frame is corresponding.
Cepstral domains converting unit 1104, is suitable to carry out down frequency spectrum corresponding for each obtained voiced frame
Spectrum domain is changed, and obtains the cepstrum of correspondence.
Computing unit 1105, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame.
Speech detection unit 1106, is suitable to when the cepstrum distance calculated is more than the distance threshold preset,
The acoustical signal gathered is carried out speech detection.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment
Suddenly the program that can be by completes to instruct relevant hardware, and this program can be stored in computer-readable
In storage medium, storage medium may include that ROM, RAM, disk or CD etc..
Having been described in detail the method and system of the embodiment of the present invention above, the present invention is not limited to this.
Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various change with
Amendment, therefore protection scope of the present invention should be as the criterion with claim limited range.
Claims (12)
1. a speech detection method, it is characterised in that including:
The acoustical signal gathered is carried out overlapping framing, obtains multiple voiced frames of correspondence;
Obtained multiple voiced frames are carried out windowing process;
Voiced frame after windowing process is carried out frequency domain conversion, obtains the frequency spectrum that each voiced frame is corresponding;
Frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains conversion, obtains the cepstrum of correspondence;
Calculate the cepstrum distance between the cepstrum of two adjacent voiced frames;
When the cepstrum distance calculated is more than the distance threshold preset, the acoustical signal gathered is carried out
Speech detection.
Speech detection method the most according to claim 1, it is characterised in that described will be through windowing process
After voiced frame carry out frequency domain conversion, obtain the frequency spectrum that each voiced frame is corresponding, including: will be through adding
Voiced frame after window processes carries out fast Fourier transform, obtains the frequency spectrum that each voiced frame is corresponding.
Speech detection method the most according to claim 2, it is characterised in that described by obtained each
The frequency spectrum that voiced frame is corresponding carries out cepstral domains conversion, obtains the cepstrum of correspondence, including:
Wherein, c represents cepstrum coefficient, and S (w) represents voiced frame, and α is default correction term.
Speech detection method the most according to claim 1, it is characterised in that adjacent two of described calculating
Cepstrum distance between the cepstrum of voiced frame, including:
Wherein, D represents that cepstrum distance, j represent the sequence number of the sampling frequency in voiced frame, aj、bjTable respectively
Showing the cepstrum of adjacent two voiced frame, k represents sampling frequency number.
Speech detection method the most according to claim 1, it is characterised in that the sampling frequency of described voiced frame
Count is 32.
Speech detection method the most according to claim 1, it is characterised in that described gathered sound letter
Number time a length of 200ms to 1s.
Speech detection method the most according to claim 1, it is characterised in that described distance threshold is by right
Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length
Frame adds the Hamming window of 256 and obtains.
8. a speech detection device, it is characterised in that including:
Framing unit, is suitable to the acoustical signal gathered carries out overlapping framing, obtains multiple sound of correspondence
Frame;
Windowing process unit, is suitable to obtained multiple voiced frames are carried out windowing process;
Frequency domain converting unit, is suitable to the voiced frame after windowing process is carried out frequency domain conversion, obtains each
The frequency spectrum that voiced frame is corresponding;
Cepstral domains converting unit, is suitable to frequency spectrum corresponding for each obtained voiced frame is carried out cepstral domains
Conversion, obtains the cepstrum of correspondence;
Computing unit, is suitable to the cepstrum distance calculating between the cepstrum of adjacent two voiced frame;
Speech detection unit, is suitable to when the cepstrum distance calculated is more than the distance threshold preset, to institute
The acoustical signal gathered carries out speech detection.
Speech detection device the most according to claim 8, it is characterised in that described frequency domain converting unit is fitted
In the voiced frame after windowing process is carried out fast Fourier transform, obtain each voiced frame corresponding
Frequency spectrum.
Speech detection device the most according to claim 8, it is characterised in that the sampling frequency of described voiced frame
Count is 32.
11. speech detection devices according to claim 8, it is characterised in that described gathered sound letter
Number time a length of 200ms to 1s.
12. speech detection devices according to claim 8, it is characterised in that described distance threshold is by right
Sample frequency is that the sampled signal of 8KHz carries out preemphasis process, and is the sound of 20ms to frame length
Frame adds the Hamming window of 256 and obtains.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510119374.XA CN106024017A (en) | 2015-03-18 | 2015-03-18 | Voice detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510119374.XA CN106024017A (en) | 2015-03-18 | 2015-03-18 | Voice detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106024017A true CN106024017A (en) | 2016-10-12 |
Family
ID=57082366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510119374.XA Pending CN106024017A (en) | 2015-03-18 | 2015-03-18 | Voice detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106024017A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107393559A (en) * | 2017-07-14 | 2017-11-24 | 深圳永顺智信息科技有限公司 | The method and device of calibration voice detection results |
CN108470571A (en) * | 2018-03-08 | 2018-08-31 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of audio-frequency detection, device and storage medium |
CN109410971A (en) * | 2018-11-13 | 2019-03-01 | 无锡冰河计算机科技发展有限公司 | A kind of method and apparatus for beautifying sound |
CN111192600A (en) * | 2019-12-27 | 2020-05-22 | 北京网众共创科技有限公司 | Sound data processing method and device, storage medium and electronic device |
CN111433737A (en) * | 2017-12-04 | 2020-07-17 | 三星电子株式会社 | Electronic device and control method thereof |
CN113409827A (en) * | 2021-06-17 | 2021-09-17 | 山东省计算中心(国家超级计算济南中心) | Voice endpoint detection method and system based on local convolution block attention network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN103021405A (en) * | 2012-12-05 | 2013-04-03 | 渤海大学 | Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter |
CN103996399A (en) * | 2014-04-21 | 2014-08-20 | 深圳市北科瑞声科技有限公司 | Voice detection method and system |
-
2015
- 2015-03-18 CN CN201510119374.XA patent/CN106024017A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN103021405A (en) * | 2012-12-05 | 2013-04-03 | 渤海大学 | Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter |
CN103996399A (en) * | 2014-04-21 | 2014-08-20 | 深圳市北科瑞声科技有限公司 | Voice detection method and system |
Non-Patent Citations (1)
Title |
---|
王帛 等: ""一种基于短时倒谱速变率的语音信号平滑端点检测方法"", 《现代电子技术》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107393559A (en) * | 2017-07-14 | 2017-11-24 | 深圳永顺智信息科技有限公司 | The method and device of calibration voice detection results |
CN107393559B (en) * | 2017-07-14 | 2021-05-18 | 深圳永顺智信息科技有限公司 | Method and device for checking voice detection result |
CN111433737A (en) * | 2017-12-04 | 2020-07-17 | 三星电子株式会社 | Electronic device and control method thereof |
CN108470571A (en) * | 2018-03-08 | 2018-08-31 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of audio-frequency detection, device and storage medium |
CN108470571B (en) * | 2018-03-08 | 2020-09-08 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio detection method and device and storage medium |
CN109410971A (en) * | 2018-11-13 | 2019-03-01 | 无锡冰河计算机科技发展有限公司 | A kind of method and apparatus for beautifying sound |
CN109410971B (en) * | 2018-11-13 | 2021-08-31 | 无锡冰河计算机科技发展有限公司 | Method and device for beautifying sound |
CN111192600A (en) * | 2019-12-27 | 2020-05-22 | 北京网众共创科技有限公司 | Sound data processing method and device, storage medium and electronic device |
CN113409827A (en) * | 2021-06-17 | 2021-09-17 | 山东省计算中心(国家超级计算济南中心) | Voice endpoint detection method and system based on local convolution block attention network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106024017A (en) | Voice detection method and device | |
US8972255B2 (en) | Method and device for classifying background noise contained in an audio signal | |
US11475907B2 (en) | Method and device of denoising voice signal | |
Tsoukalas et al. | Speech enhancement based on audible noise suppression | |
CN101149928B (en) | Sound signal processing method, sound signal processing apparatus and computer program | |
CN103440872B (en) | The denoising method of transient state noise | |
EP2546831A1 (en) | Noise suppression device | |
CN107833581B (en) | Method, device and readable storage medium for extracting fundamental tone frequency of sound | |
WO2011091068A1 (en) | Distortion measurement for noise suppression system | |
CN109616098B (en) | Voice endpoint detection method and device based on frequency domain energy | |
CN101271686A (en) | Method and apparatus for estimating noise by using harmonics of voice signal | |
Kumar | Real-time performance evaluation of modified cascaded median-based noise estimation for speech enhancement system | |
Schwerin et al. | An improved speech transmission index for intelligibility prediction | |
KR100735343B1 (en) | Apparatus and method for extracting pitch information of a speech signal | |
CN101176149A (en) | Signal processing system for tonal noise robustness | |
CN106033669A (en) | Voice identification method and apparatus thereof | |
CN106920543B (en) | Audio recognition method and device | |
CN107564512B (en) | Voice activity detection method and device | |
CN106816157A (en) | Audio recognition method and device | |
Kumar | Mean-median based noise estimation method using spectral subtraction for speech enhancement technique | |
JP2010230814A (en) | Speech signal evaluation program, speech signal evaluation apparatus, and speech signal evaluation method | |
CN106297795A (en) | Audio recognition method and device | |
CN106340310B (en) | Speech detection method and device | |
CN114420165A (en) | Audio circuit testing method, device, equipment and storage medium | |
Flynn et al. | Combined speech enhancement and auditory modelling for robust distributed speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |
|
RJ01 | Rejection of invention patent application after publication |