CN107665711A - Voice activity detection method and device - Google Patents
Voice activity detection method and device Download PDFInfo
- Publication number
- CN107665711A CN107665711A CN201610607277.XA CN201610607277A CN107665711A CN 107665711 A CN107665711 A CN 107665711A CN 201610607277 A CN201610607277 A CN 201610607277A CN 107665711 A CN107665711 A CN 107665711A
- Authority
- CN
- China
- Prior art keywords
- mrow
- frame
- spectrum
- shannon entropy
- msup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000694 effects Effects 0.000 title claims abstract description 36
- 238000001514 detection method Methods 0.000 title claims abstract description 35
- 238000001228 spectrum Methods 0.000 claims abstract description 93
- 230000009466 transformation Effects 0.000 claims abstract description 10
- 238000000034 method Methods 0.000 claims description 23
- 238000005315 distribution function Methods 0.000 claims description 22
- 230000003595 spectral effect Effects 0.000 claims description 3
- 239000004744 fabric Substances 0.000 claims 1
- 230000008859 change Effects 0.000 description 8
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000002304 perfume Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 235000013599 spices Nutrition 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
Abstract
Voice activity detection method and device, the voice activity detection method include:The voice data to be identified of acquisition is divided into multiple overlapping frames, and quick Fourier transformation computation is carried out to each frame, obtains corresponding frequency spectrum;The frequency spectrum of the multiple overlapping frame is traveled through, calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely;When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, determine that present frame includes voice messaging.Above-mentioned scheme, the speed and accuracy rate of speech recognition can be improved.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of voice activity detection method and device.
Background technology
Mobile terminal, refer to the computer equipment that can be used on the move, in a broad aspect including mobile phone, notebook, put down
Plate computer, POS, vehicle-mounted computer etc..With the rapid development of integrated circuit technique, mobile terminal has had powerful place
Reason ability, mobile terminal are changed into an integrated information processing platform from simple call instrument, and this also increases to mobile terminal
Broader development space is added.But the use of mobile terminal, it usually needs user concentrates certain notice.Nowadays
Mobile terminal device be equipped with touch-screen, user needs to touch the touch-screen, to perform corresponding operation.But use
When family can not touch mobile terminal device, operation mobile terminal will become highly inconvenient.For example, when user drives vehicle
Or when article has been carried in hand.
Audio recognition method and the use for always listening system (Always Listening System) so that can be to movement
Terminal carries out non-manual activation and operation.When it is described always listen system detectio to voice signal when, speech recognition system will activate,
And the voice signal to detecting is identified, afterwards, mobile terminal will perform corresponding according to the voice signal identified
Operation, for example, when user input " mobile phone for dialing XX " voice when, mobile terminal " can be dialed with what is inputted to user
The voice messaging of XX mobile phone " is identified, and after correct identification, the letter of XX phone number is obtained from mobile terminal
Breath, and dial.
But voice activity detection method, general use preset voice data of the mathematical modeling to input in the prior art
Carry out speech recognition, the problem of and accuracy rate slow there is speech recognition speed is low.
The content of the invention
The embodiment of the present invention solves the problems, such as it is how to improve the speed and accuracy rate of speech recognition.
To solve the above problems, the embodiments of the invention provide a kind of voice activity detection method, the speech activity is detectd
Survey method includes:The voice data to be identified of acquisition is divided into multiple overlapping frames, and quick Fourier is carried out to each frame
Leaf transformation computing, obtain corresponding frequency spectrum;The frequency spectrum of the multiple overlapping frame is traveled through, calculates the present frame of traversal extremely
Spectrum energy domain Shannon entropy energy;When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value
When, determine that present frame includes voice messaging.
Alternatively, the Shannon entropy energy in the spectrum energy domain for calculating the present frame of traversal extremely, including:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame
Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division
The quantity of obtained frequency range.
Alternatively, the default threshold value is associated with the noise spectrum characteristic of the voice data to be identified.
Alternatively, the default threshold value is calculated in the following way:Frequency spectrum based on the multiple overlapping frame
The Shannon entropy of energy domain, it is determined that corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used for mould
Intend the Shannon entropy in the spectrum energy domain of the multiple overlapping frame;Using identified gauss of distribution function, it is calculated described
Threshold value.
Alternatively, two gauss of distribution function corresponding to the determination, including:Using corresponding to the determination of maximum expected value method
Two gauss of distribution function.
The embodiment of the present invention additionally provides a kind of voice activity detection device, and described device includes:Fourier transformation unit,
Suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames, and FFT fortune is carried out to each frame
Calculate, obtain corresponding frequency spectrum;First computing unit, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate traversal
The Shannon entropy energy in the spectrum energy domain of present frame extremely;Judging unit, the Shannon in the spectrum energy domain suitable for judging present frame
Whether entropy energy is more than default threshold value;Determining unit, the Shannon entropy energy suitable for the spectrum energy domain when determination present frame are big
When the threshold value, determine that present frame includes voice messaging.
Alternatively, first computing unit is suitable to the spectrum energy that the present frame of traversal extremely is calculated using formula below
The Shannon entropy energy in domain:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame
Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division
The quantity of obtained frequency range.
Alternatively, the default threshold value is related to the spectral characteristic of noise corresponding to current voice data to be identified
Connection.
Alternatively, described device also includes:Second computing unit, suitable for the spectrum energy based on the multiple overlapping frame
The Shannon entropy in domain, it is determined that corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used to simulate institute
State the Shannon entropy in the spectrum energy domain of multiple overlapping frames;Using identified gauss of distribution function, the threshold value is calculated.
Alternatively, second computing unit, suitable for using two Gaussian Profile letters corresponding to the determination of maximum expected value method
Number.
Compared with prior art, technical scheme has the following advantages that:
Above-mentioned scheme, the spectrum energy domain according to corresponding to voice data to be identified divides obtained multiple overlapping frames
Shannon entropy energy and default threshold value between comparative result, to determine whether include voice messaging in each frame, because relative
Shannon entropy energy in the spectrum energy domain for the frame for only including noise information, include the perfume (or spice) in the spectrum energy domain of the frame of voice messaging
Agriculture entropy energy is more regular, can be to identify whether each frame includes language exactly by the Shannon entropy in spectrum energy domain
Message ceases, thus can improve the accuracy of voice activity detection, and the Shannon entropy energy in the spectrum energy domain because of each frame
Calculate simpler compared with the mathematical modeling for establishing speech recognition, thus computing resource can be saved, improve speech activity and detect
The speed of survey.
Brief description of the drawings
Fig. 1 is a kind of flow chart of voice activity detection method in the embodiment of the present invention;
Fig. 2 is the flow chart of another voice activity detection method in the embodiment of the present invention;
Fig. 3 is a kind of structural representation of voice activity detection device in the embodiment of the present invention.
Embodiment
A kind of voice activity detection (Voice Activity Detection, VAD) method of the prior art, pass through by
The frequency spectrum of the current sound frame traversed is divided into non-overlapping multiple subbands;According to the frequency spectrum of multiple subbands of current sound frame
Energy, the energy root mean square of current sound frame is calculated;When it is determined that the energy root mean square of current sound frame is more than default threshold
During value, determine that current sound frame includes voice messaging.
Above-mentioned VAD method can be less than voice trace ability, and the energy water of sound bite in the speed of noise variation
When putting down the energy level higher than noise fragment, preferable performance can be obtained.But when the above situation changes, exist
The problem of speech detection accuracy is low.
To solve the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is by by current sound
The energy root mean square of sound frame is compared with corresponding threshold value, can be with to determine whether include voice messaging in current sound frame
The accuracy of voice activity detection is improved, and improves the speed of voice activity detection.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention
Specific embodiment be described in detail.
Fig. 1 shows a kind of flow chart of audio recognition method in the embodiment of the present invention.Speech recognition as shown in Figure 1
Method, it may include steps of:
Step S101:The voice data to be identified of acquisition is divided into multiple overlapping frames, and each frame carried out fast
Fast Fourier transformation computation obtains corresponding frequency spectrum.
In specific implementation, when voice data to be identified is divided, the number of obtained overlapping frame, and
Lap between consecutive frame can be configured according to the actual needs.
Step S102:The frequency spectrum of the multiple overlapping frame is traveled through, calculates the frequency spectrum energy of the present frame of traversal extremely
Measure the Shannon entropy energy in domain.
, can be according to frequency spectrum corresponding to multiple overlapping frames that corresponding time sequencing obtains to division in specific implementation
Traveled through.
Step S103:When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, it is determined that working as
Previous frame includes voice messaging.
In specific implementation, when the Shannon entropy energy in spectrum energy domain corresponding to each frame is calculated, it will can count
The Shannon entropy energy in obtained spectrum energy domain is compared with default threshold value, to judge the spectrum energy domain of each frame
Whether Shannon entropy energy is more than default threshold value.Wherein, when it is determined that the Shannon entropy energy in corresponding spectrum energy domain be more than it is default
Threshold value when, determine that the frame includes voice messaging;It is on the contrary, it is determined that not include voice messaging in the frame.
The audio recognition method in the embodiment of the present invention is further described in detail below in conjunction with Fig. 2.
Fig. 2 shows the flow chart of another audio recognition method in the embodiment of the present invention.Voice as shown in Figure 2 is known
Other method, the steps can be included:
Step S201:The voice data of acquisition is subjected to overlapping framing, obtains corresponding multiple overlapping frames.
In specific implementation, analog-to-digital conversion can be carried out to the voice signal gathered first, obtain corresponding sound number
According to.Then, corresponding voice data can be subjected to overlapping framing, obtains corresponding multiple frames.The voice data of collection is entered
Row framing, substantially it is that short-time analysis is carried out to voice data.Short-time analysis is when voice signal was divided into the fixed cycle
Between short section, each time short section be it is relatively-stationary continue sound clip.Wherein, partly weighed between two adjacent voiced frames
Folded, overlapping range can be selected according to actual conditions.
Step S202:Windowing process is carried out to resulting multiple overlapping frames.
In specific implementation, the conventional window function of the Speech processings such as Hamming window, Hanning window, rectangular window can be selected,
Frame length selection is 10~40ms, representative value 20ms.Wherein, oneself of voice signal is destroyed to voice signal progress sub-frame processing
So degree, adding window and return processing etc. are carried out by using voiced frame, can solve the problem.
Step S203:The voice signal of frame after windowing process is subjected to quick Fourier transformation computation, obtained each
Frequency spectrum corresponding to individual frame.
In specific implementation, voice data in theory for change over time, be the process of a unstable state, can not
Directly to carry out the conversion of frequency domain.But due to carrying out sub-frame processing (short-time analysis) to voice data, the voice data per frame
May be considered it is metastable, thus can be applied to frequency domain conversion.
In specific implementation, short time discrete Fourier transform (Short-Time Fourier Transform/ can be used
Short-Term Fourier Transform, STFT) frequency domain conversion is carried out to the voice data of every frame, to obtain each frame pair
The spectrum information answered.Wherein, resulting frequency spectrum includes the relation between the frequency of corresponding voice signal and energy.
Step S204:The multiple frames obtained to division travel through, and calculate the spectrum energy domain of the present frame of traversal extremely
Shannon entropy energy.
It is the Shannon entropy energy that information source defines in specific implementation, can be used for measurement and each accorded with optimum code
The average length for the bit that number (symbol) includes.Wherein, the Shannon entropy energy of time domain can use formula below to calculate
Obtain:
Wherein, H (S) represents the Shannon entropy energy of time domain, and S represents the voice data for including N number of symbol, and S (i) represents i-th
Individual symbol, (S (i) represents the emission probability of i-th of symbol to P.
In specific implementation, on the basis of Shannon entropy is built upon into a kind of hypothesis applied to voice activity detection, that is, assume
The signal spectrum of noise frame of the signal spectrum of frame including speech data than not including speech data is more regular.Therefore,
In the present invention one is implemented, formula (1) can be transformed into spectrum energy domain, i.e., each frame is calculated using formula below
The Shannon entropy energy in corresponding spectrum energy domain:
Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent to work as
Probability of the previous frame t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε is represented
Divide the quantity of obtained frequency range.
Step S205:Judge whether the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value;Work as judgement
As a result it is that when being, step S206 can be performed;Conversely, it can then continue to perform next frame since step S204.
In specific implementation, the default threshold value can use voice data to be identified, that is, divide to obtain multiple
The global variable of the Shannon entropy in the spectrum energy domain of frame is determined.It has been investigated that the frequency for dividing obtained multiple frames
Bimodal distribution state is presented in the numerical value of the global variable of the Shannon entropy in spectrum energy domain, thus can use two gauss of distribution function
To simulate the distribution of the numerical value of the global variable of the Shannon entropy in the spectrum energy domain for multiple frames that division obtains.Wherein, described two
Individual Gaussian function can use maximum expected value method to determine, reuse and determine two gauss of distribution function, can be to calculate
To the global optimization numerical value of the threshold value, that is, finally give the threshold value.
In specific implementation, threshold value noise spectrum characteristic corresponding with current sound to be identified is associated, namely
When noise spectrum property changes, corresponding threshold value can just change, and the change of noise level can't cause pair
The change for the threshold value answered, so that the voice activity detection method in the embodiment of the present invention can be when noise level changes still
Show robustness.
Step S206:Speech recognition is carried out to current sound frame.
In specific implementation, when the Shannon entropy energy in the spectrum energy domain of present frame is more than corresponding threshold value, show to work as
Previous frame includes voice messaging.At this point it is possible to speech recognition is carried out to present frame, to identify specific voice content.
, can be then to next voiced frame of current sound frame from step after execution of step S206 in specific implementation
Rapid S204 starts to perform, until traversal completes each voiced frame in acquired current voice data to be identified.
In specific implementation, when by above-mentioned audio recognition method be applied to mobile terminal in when always listening in system,
When identifying complete voice messaging in acquired voice data, mobile terminal can be held according to the voice content identified
The corresponding operation of row.For example, when identify user input voice be " mobile phone for dialing XX " when, mobile terminal can with to
The voice messaging of " mobile phone for dialing XX " of family input is identified, and after correct identification, XX mobile phone is obtained from itself
The information of number, and automatic dialing.
It is to be herein pointed out when Y represents white noise, H (| Y (w, t) |2) it is up to maximum, i.e. log
(Ω);When Y represents pure tone, H (| Y (w, t) |2) it is up to minimum value, i.e., 0.In other words, H (| Y (w, t) |2) dynamic become
Change scope is 0 to log (Ω), and under white noise, not the Shannon entropy in the spectrum energy domain of the noise frame including voice messaging
Numerical value it is unrelated with noise level, and corresponding threshold value can be pre-estimated to obtain.This observation result is based on, the present invention
Voice activity detection method in embodiment is extremely suitable for the voice activity detection under white noise or quasi- white noise.
Device corresponding to the audio recognition method in the embodiment of the present invention will be further described in detail below.
Fig. 3 shows that the embodiment of the present invention additionally provides a kind of structural representation of voice activity detection device.Specific
In implementation, voice activity detection device 300 as shown in Figure 3, Fourier transformation unit 301, the first computing unit can be included
302nd, judging unit 303 and determining unit 304, wherein:
The Fourier transformation unit 301, suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames,
And quick Fourier transformation computation is carried out to each frame, obtain corresponding frequency spectrum.
First computing unit 302, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate and travel through extremely
The Shannon entropy energy in the spectrum energy domain of present frame.
In an embodiment of the present invention, first computing unit 302 is suitable to be calculated using formula below and traveled through extremely
The Shannon entropy energy in the spectrum energy domain of present frame:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame
Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division
The quantity of obtained frequency range.
The judging unit 303, suitable for judging it is default whether the Shannon entropy energy in spectrum energy domain of present frame is more than
Threshold value.In specific implementation, the spectral characteristic phase of default threshold value noise corresponding with the voice data to be identified
Association, i.e., described threshold value changes with the change of the noise spectrum characteristic of voice data to be identified, but not with waiting to know
The change of the noise level of other voice data and change.
The determining unit 304, suitable for being more than the threshold value when the Shannon entropy energy in the spectrum energy domain for determining present frame
When, determine that present frame includes voice messaging.
In an embodiment of the present invention, the voice activity detection device 300 can also include the second computing unit 305,
Wherein:
Second computing unit 305, suitable for the Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that
Corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used to simulate the multiple overlapping frame
Spectrum energy domain Shannon entropy;Using identified gauss of distribution function, the threshold value is calculated.
In an embodiment of the present invention, second computing unit 305, corresponding to being determined using maximum expected value method
Two gauss of distribution function.
Because of the Shannon entropy energy in the spectrum energy domain relative to the frame for only including noise information, including the frame of voice messaging
The Shannon entropy energy in spectrum energy domain is more regular, and the scheme in the embodiment of the present invention passes through multiple heavy by what is be calculated
The Shannon entropy energy in spectrum energy domain corresponding to folded frame compared with default threshold value, can be determined with comparative result respectively
Whether include voice messaging in corresponding frame, thus the accuracy of voice activity detection can be improved, and relative to establishing voice
The mathematical modeling of identification, the calculating of the Shannon entropy energy in spectrum energy domain is simpler, thus can save computing resource.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
To instruct the hardware of correlation to complete by program, the program can be stored in computer-readable recording medium, and storage is situated between
Matter can include:ROM, RAM, disk or CD etc..
The method and system of the embodiment of the present invention are had been described in detail above, the present invention is not limited thereto.Any
Art personnel, without departing from the spirit and scope of the present invention, it can make various changes or modifications, therefore the guarantor of the present invention
Shield scope should be defined by claim limited range.
Claims (10)
- A kind of 1. voice activity detection method, it is characterised in that including:The voice data to be identified of acquisition is divided into multiple overlapping frames, and FFT fortune is carried out to each frame Calculate, obtain corresponding frequency spectrum;The frequency spectrum of the multiple overlapping frame is traveled through, calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely Amount;When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, determine that present frame is believed including voice Breath.
- 2. voice activity detection method according to claim 1, it is characterised in that the present frame of the calculating traversal extremely The Shannon entropy energy in spectrum energy domain, including:<mrow> <mi>H</mi> <mrow> <mo>(</mo> <mo>|</mo> <mi>Y</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mo>-</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>w</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&epsiv;</mi> </msubsup> <mi>P</mi> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>1</mn> </msup> <mo>;</mo> </mrow>Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
- 3. voice activity detection method according to claim 1, it is characterised in that the default threshold value is waited to know with described The noise spectrum characteristic of other voice data is associated.
- 4. voice activity detection method according to claim 1, it is characterised in that be calculated in the following way described Default threshold value:The Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that corresponding two gauss of distribution function;Wherein, Identified two gauss of distribution function are used for the Shannon entropy for simulating the spectrum energy domain of the multiple overlapping frame;Using identified gauss of distribution function, the threshold value is calculated.
- 5. voice activity detection method according to claim 4, it is characterised in that two Gausses point corresponding to the determination Cloth function, including:Using two gauss of distribution function corresponding to the determination of maximum expected value method.
- A kind of 6. voice activity detection device, it is characterised in that including:Fourier transformation unit, suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames, and to each frame Quick Fourier transformation computation is carried out, obtains corresponding frequency spectrum;First computing unit, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate the frequency of the present frame of traversal extremely The Shannon entropy energy in spectrum energy domain;Judging unit, suitable for judging whether the Shannon entropy energy in spectrum energy domain of present frame is more than default threshold value;Determining unit, suitable for when it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than the threshold value, it is determined that currently Frame includes voice messaging.
- 7. voice activity detection device according to claim 6, it is characterised in that first computing unit is suitable to use Formula below calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely:<mrow> <mi>H</mi> <mrow> <mo>(</mo> <mo>|</mo> <mi>Y</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mo>-</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>w</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&epsiv;</mi> </msubsup> <mi>P</mi> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>1</mn> </msup> <mo>;</mo> </mrow>Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
- 8. voice activity detection device according to claim 6, it is characterised in that the default threshold value is with currently waiting to know The spectral characteristic of noise is associated corresponding to other voice data.
- 9. voice activity detection device according to claim 6, it is characterised in that also include:Second computing unit, suitable for the Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that corresponding two Gauss of distribution function;Wherein, identified two gauss of distribution function are used for the spectrum energy for simulating the multiple overlapping frame The Shannon entropy in domain;Using identified gauss of distribution function, the threshold value is calculated.
- 10. voice activity detection device according to claim 9, it is characterised in that second computing unit, suitable for adopting Two gauss of distribution function corresponding to the determination of maximum expected value method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610607277.XA CN107665711A (en) | 2016-07-28 | 2016-07-28 | Voice activity detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610607277.XA CN107665711A (en) | 2016-07-28 | 2016-07-28 | Voice activity detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107665711A true CN107665711A (en) | 2018-02-06 |
Family
ID=61114130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610607277.XA Pending CN107665711A (en) | 2016-07-28 | 2016-07-28 | Voice activity detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107665711A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113270118A (en) * | 2021-05-14 | 2021-08-17 | 杭州朗和科技有限公司 | Voice activity detection method and device, storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101499280A (en) * | 2009-03-09 | 2009-08-05 | 武汉大学 | Spacing parameter choosing method and apparatus based on spacing perception entropy judgement |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN102097095A (en) * | 2010-12-28 | 2011-06-15 | 天津市亚安科技电子有限公司 | Speech endpoint detecting method and device |
CN102938069A (en) * | 2012-06-13 | 2013-02-20 | 北京师范大学 | Pure and mixed pixel automatic classification method based on information entropy |
CN103948398A (en) * | 2014-04-04 | 2014-07-30 | 杭州电子科技大学 | Heart sound location segmenting method suitable for Android system |
CN105023572A (en) * | 2014-04-16 | 2015-11-04 | 王景芳 | Noised voice end point robustness detection method |
-
2016
- 2016-07-28 CN CN201610607277.XA patent/CN107665711A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101499280A (en) * | 2009-03-09 | 2009-08-05 | 武汉大学 | Spacing parameter choosing method and apparatus based on spacing perception entropy judgement |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN102097095A (en) * | 2010-12-28 | 2011-06-15 | 天津市亚安科技电子有限公司 | Speech endpoint detecting method and device |
CN102938069A (en) * | 2012-06-13 | 2013-02-20 | 北京师范大学 | Pure and mixed pixel automatic classification method based on information entropy |
CN103948398A (en) * | 2014-04-04 | 2014-07-30 | 杭州电子科技大学 | Heart sound location segmenting method suitable for Android system |
CN105023572A (en) * | 2014-04-16 | 2015-11-04 | 王景芳 | Noised voice end point robustness detection method |
Non-Patent Citations (2)
Title |
---|
MEYSAM ASGARI ETC: "Voice Activity Detection Using Entropy in Spectrum Domain", 《2008 AUSTRALASIAN TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE》 * |
许作辉: "基于信息熵的语音端点检测算法研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113270118A (en) * | 2021-05-14 | 2021-08-17 | 杭州朗和科技有限公司 | Voice activity detection method and device, storage medium and electronic equipment |
CN113270118B (en) * | 2021-05-14 | 2024-02-13 | 杭州网易智企科技有限公司 | Voice activity detection method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100476949C (en) | Multichannel voice detection in adverse environments | |
WO2021114733A1 (en) | Noise suppression method for processing at different frequency bands, and system thereof | |
CN1248190C (en) | Fast frequency-domain pitch estimation | |
CN107833581B (en) | Method, device and readable storage medium for extracting fundamental tone frequency of sound | |
JP2007041593A (en) | Method and apparatus for extracting voiced/unvoiced classification information using harmonic component of voice signal | |
KR20150037986A (en) | Determining hotword suitability | |
CN104616663A (en) | Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation) | |
CN105261375A (en) | Voice activity detection method and apparatus | |
CN106024017A (en) | Voice detection method and device | |
CN104143324A (en) | Musical tone note identification method | |
KR100735343B1 (en) | Apparatus and method for extracting pitch information of a speech signal | |
CN111696580A (en) | Voice detection method and device, electronic equipment and storage medium | |
CN114666618B (en) | Audio auditing method, device, equipment and readable storage medium | |
CN106033669A (en) | Voice identification method and apparatus thereof | |
CN107564512B (en) | Voice activity detection method and device | |
CN106816157A (en) | Audio recognition method and device | |
CN101123090B (en) | Speech recognition by statistical language using square-rootdiscounting | |
CN106920543B (en) | Audio recognition method and device | |
CN106297795B (en) | Audio recognition method and device | |
CN103559289A (en) | Language-irrelevant keyword search method and system | |
CN101030374B (en) | Method and apparatus for extracting base sound period | |
CN111489739B (en) | Phoneme recognition method, apparatus and computer readable storage medium | |
CN107665711A (en) | Voice activity detection method and device | |
EP1436805B1 (en) | 2-phase pitch detection method and appartus | |
Bouzid et al. | Voice source parameter measurement based on multi-scale analysis of electroglottographic signal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180206 |
|
RJ01 | Rejection of invention patent application after publication |