CN107665711A - Voice activity detection method and device - Google Patents

Voice activity detection method and device Download PDF

Info

Publication number
CN107665711A
CN107665711A CN201610607277.XA CN201610607277A CN107665711A CN 107665711 A CN107665711 A CN 107665711A CN 201610607277 A CN201610607277 A CN 201610607277A CN 107665711 A CN107665711 A CN 107665711A
Authority
CN
China
Prior art keywords
mrow
frame
spectrum
shannon entropy
msup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610607277.XA
Other languages
Chinese (zh)
Inventor
孙廷玮
柯逸倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Priority to CN201610607277.XA priority Critical patent/CN107665711A/en
Publication of CN107665711A publication Critical patent/CN107665711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Voice activity detection method and device, the voice activity detection method include:The voice data to be identified of acquisition is divided into multiple overlapping frames, and quick Fourier transformation computation is carried out to each frame, obtains corresponding frequency spectrum;The frequency spectrum of the multiple overlapping frame is traveled through, calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely;When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, determine that present frame includes voice messaging.Above-mentioned scheme, the speed and accuracy rate of speech recognition can be improved.

Description

Voice activity detection method and device
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of voice activity detection method and device.
Background technology
Mobile terminal, refer to the computer equipment that can be used on the move, in a broad aspect including mobile phone, notebook, put down Plate computer, POS, vehicle-mounted computer etc..With the rapid development of integrated circuit technique, mobile terminal has had powerful place Reason ability, mobile terminal are changed into an integrated information processing platform from simple call instrument, and this also increases to mobile terminal Broader development space is added.But the use of mobile terminal, it usually needs user concentrates certain notice.Nowadays Mobile terminal device be equipped with touch-screen, user needs to touch the touch-screen, to perform corresponding operation.But use When family can not touch mobile terminal device, operation mobile terminal will become highly inconvenient.For example, when user drives vehicle Or when article has been carried in hand.
Audio recognition method and the use for always listening system (Always Listening System) so that can be to movement Terminal carries out non-manual activation and operation.When it is described always listen system detectio to voice signal when, speech recognition system will activate, And the voice signal to detecting is identified, afterwards, mobile terminal will perform corresponding according to the voice signal identified Operation, for example, when user input " mobile phone for dialing XX " voice when, mobile terminal " can be dialed with what is inputted to user The voice messaging of XX mobile phone " is identified, and after correct identification, the letter of XX phone number is obtained from mobile terminal Breath, and dial.
But voice activity detection method, general use preset voice data of the mathematical modeling to input in the prior art Carry out speech recognition, the problem of and accuracy rate slow there is speech recognition speed is low.
The content of the invention
The embodiment of the present invention solves the problems, such as it is how to improve the speed and accuracy rate of speech recognition.
To solve the above problems, the embodiments of the invention provide a kind of voice activity detection method, the speech activity is detectd Survey method includes:The voice data to be identified of acquisition is divided into multiple overlapping frames, and quick Fourier is carried out to each frame Leaf transformation computing, obtain corresponding frequency spectrum;The frequency spectrum of the multiple overlapping frame is traveled through, calculates the present frame of traversal extremely Spectrum energy domain Shannon entropy energy;When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value When, determine that present frame includes voice messaging.
Alternatively, the Shannon entropy energy in the spectrum energy domain for calculating the present frame of traversal extremely, including:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
Alternatively, the default threshold value is associated with the noise spectrum characteristic of the voice data to be identified.
Alternatively, the default threshold value is calculated in the following way:Frequency spectrum based on the multiple overlapping frame The Shannon entropy of energy domain, it is determined that corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used for mould Intend the Shannon entropy in the spectrum energy domain of the multiple overlapping frame;Using identified gauss of distribution function, it is calculated described Threshold value.
Alternatively, two gauss of distribution function corresponding to the determination, including:Using corresponding to the determination of maximum expected value method Two gauss of distribution function.
The embodiment of the present invention additionally provides a kind of voice activity detection device, and described device includes:Fourier transformation unit, Suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames, and FFT fortune is carried out to each frame Calculate, obtain corresponding frequency spectrum;First computing unit, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate traversal The Shannon entropy energy in the spectrum energy domain of present frame extremely;Judging unit, the Shannon in the spectrum energy domain suitable for judging present frame Whether entropy energy is more than default threshold value;Determining unit, the Shannon entropy energy suitable for the spectrum energy domain when determination present frame are big When the threshold value, determine that present frame includes voice messaging.
Alternatively, first computing unit is suitable to the spectrum energy that the present frame of traversal extremely is calculated using formula below The Shannon entropy energy in domain:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
Alternatively, the default threshold value is related to the spectral characteristic of noise corresponding to current voice data to be identified Connection.
Alternatively, described device also includes:Second computing unit, suitable for the spectrum energy based on the multiple overlapping frame The Shannon entropy in domain, it is determined that corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used to simulate institute State the Shannon entropy in the spectrum energy domain of multiple overlapping frames;Using identified gauss of distribution function, the threshold value is calculated.
Alternatively, second computing unit, suitable for using two Gaussian Profile letters corresponding to the determination of maximum expected value method Number.
Compared with prior art, technical scheme has the following advantages that:
Above-mentioned scheme, the spectrum energy domain according to corresponding to voice data to be identified divides obtained multiple overlapping frames Shannon entropy energy and default threshold value between comparative result, to determine whether include voice messaging in each frame, because relative Shannon entropy energy in the spectrum energy domain for the frame for only including noise information, include the perfume (or spice) in the spectrum energy domain of the frame of voice messaging Agriculture entropy energy is more regular, can be to identify whether each frame includes language exactly by the Shannon entropy in spectrum energy domain Message ceases, thus can improve the accuracy of voice activity detection, and the Shannon entropy energy in the spectrum energy domain because of each frame Calculate simpler compared with the mathematical modeling for establishing speech recognition, thus computing resource can be saved, improve speech activity and detect The speed of survey.
Brief description of the drawings
Fig. 1 is a kind of flow chart of voice activity detection method in the embodiment of the present invention;
Fig. 2 is the flow chart of another voice activity detection method in the embodiment of the present invention;
Fig. 3 is a kind of structural representation of voice activity detection device in the embodiment of the present invention.
Embodiment
A kind of voice activity detection (Voice Activity Detection, VAD) method of the prior art, pass through by The frequency spectrum of the current sound frame traversed is divided into non-overlapping multiple subbands;According to the frequency spectrum of multiple subbands of current sound frame Energy, the energy root mean square of current sound frame is calculated;When it is determined that the energy root mean square of current sound frame is more than default threshold During value, determine that current sound frame includes voice messaging.
Above-mentioned VAD method can be less than voice trace ability, and the energy water of sound bite in the speed of noise variation When putting down the energy level higher than noise fragment, preferable performance can be obtained.But when the above situation changes, exist The problem of speech detection accuracy is low.
To solve the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention uses is by by current sound The energy root mean square of sound frame is compared with corresponding threshold value, can be with to determine whether include voice messaging in current sound frame The accuracy of voice activity detection is improved, and improves the speed of voice activity detection.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings to the present invention Specific embodiment be described in detail.
Fig. 1 shows a kind of flow chart of audio recognition method in the embodiment of the present invention.Speech recognition as shown in Figure 1 Method, it may include steps of:
Step S101:The voice data to be identified of acquisition is divided into multiple overlapping frames, and each frame carried out fast Fast Fourier transformation computation obtains corresponding frequency spectrum.
In specific implementation, when voice data to be identified is divided, the number of obtained overlapping frame, and Lap between consecutive frame can be configured according to the actual needs.
Step S102:The frequency spectrum of the multiple overlapping frame is traveled through, calculates the frequency spectrum energy of the present frame of traversal extremely Measure the Shannon entropy energy in domain.
, can be according to frequency spectrum corresponding to multiple overlapping frames that corresponding time sequencing obtains to division in specific implementation Traveled through.
Step S103:When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, it is determined that working as Previous frame includes voice messaging.
In specific implementation, when the Shannon entropy energy in spectrum energy domain corresponding to each frame is calculated, it will can count The Shannon entropy energy in obtained spectrum energy domain is compared with default threshold value, to judge the spectrum energy domain of each frame Whether Shannon entropy energy is more than default threshold value.Wherein, when it is determined that the Shannon entropy energy in corresponding spectrum energy domain be more than it is default Threshold value when, determine that the frame includes voice messaging;It is on the contrary, it is determined that not include voice messaging in the frame.
The audio recognition method in the embodiment of the present invention is further described in detail below in conjunction with Fig. 2.
Fig. 2 shows the flow chart of another audio recognition method in the embodiment of the present invention.Voice as shown in Figure 2 is known Other method, the steps can be included:
Step S201:The voice data of acquisition is subjected to overlapping framing, obtains corresponding multiple overlapping frames.
In specific implementation, analog-to-digital conversion can be carried out to the voice signal gathered first, obtain corresponding sound number According to.Then, corresponding voice data can be subjected to overlapping framing, obtains corresponding multiple frames.The voice data of collection is entered Row framing, substantially it is that short-time analysis is carried out to voice data.Short-time analysis is when voice signal was divided into the fixed cycle Between short section, each time short section be it is relatively-stationary continue sound clip.Wherein, partly weighed between two adjacent voiced frames Folded, overlapping range can be selected according to actual conditions.
Step S202:Windowing process is carried out to resulting multiple overlapping frames.
In specific implementation, the conventional window function of the Speech processings such as Hamming window, Hanning window, rectangular window can be selected, Frame length selection is 10~40ms, representative value 20ms.Wherein, oneself of voice signal is destroyed to voice signal progress sub-frame processing So degree, adding window and return processing etc. are carried out by using voiced frame, can solve the problem.
Step S203:The voice signal of frame after windowing process is subjected to quick Fourier transformation computation, obtained each Frequency spectrum corresponding to individual frame.
In specific implementation, voice data in theory for change over time, be the process of a unstable state, can not Directly to carry out the conversion of frequency domain.But due to carrying out sub-frame processing (short-time analysis) to voice data, the voice data per frame May be considered it is metastable, thus can be applied to frequency domain conversion.
In specific implementation, short time discrete Fourier transform (Short-Time Fourier Transform/ can be used Short-Term Fourier Transform, STFT) frequency domain conversion is carried out to the voice data of every frame, to obtain each frame pair The spectrum information answered.Wherein, resulting frequency spectrum includes the relation between the frequency of corresponding voice signal and energy.
Step S204:The multiple frames obtained to division travel through, and calculate the spectrum energy domain of the present frame of traversal extremely Shannon entropy energy.
It is the Shannon entropy energy that information source defines in specific implementation, can be used for measurement and each accorded with optimum code The average length for the bit that number (symbol) includes.Wherein, the Shannon entropy energy of time domain can use formula below to calculate Obtain:
Wherein, H (S) represents the Shannon entropy energy of time domain, and S represents the voice data for including N number of symbol, and S (i) represents i-th Individual symbol, (S (i) represents the emission probability of i-th of symbol to P.
In specific implementation, on the basis of Shannon entropy is built upon into a kind of hypothesis applied to voice activity detection, that is, assume The signal spectrum of noise frame of the signal spectrum of frame including speech data than not including speech data is more regular.Therefore, In the present invention one is implemented, formula (1) can be transformed into spectrum energy domain, i.e., each frame is calculated using formula below The Shannon entropy energy in corresponding spectrum energy domain:
Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent to work as Probability of the previous frame t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε is represented Divide the quantity of obtained frequency range.
Step S205:Judge whether the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value;Work as judgement As a result it is that when being, step S206 can be performed;Conversely, it can then continue to perform next frame since step S204.
In specific implementation, the default threshold value can use voice data to be identified, that is, divide to obtain multiple The global variable of the Shannon entropy in the spectrum energy domain of frame is determined.It has been investigated that the frequency for dividing obtained multiple frames Bimodal distribution state is presented in the numerical value of the global variable of the Shannon entropy in spectrum energy domain, thus can use two gauss of distribution function To simulate the distribution of the numerical value of the global variable of the Shannon entropy in the spectrum energy domain for multiple frames that division obtains.Wherein, described two Individual Gaussian function can use maximum expected value method to determine, reuse and determine two gauss of distribution function, can be to calculate To the global optimization numerical value of the threshold value, that is, finally give the threshold value.
In specific implementation, threshold value noise spectrum characteristic corresponding with current sound to be identified is associated, namely When noise spectrum property changes, corresponding threshold value can just change, and the change of noise level can't cause pair The change for the threshold value answered, so that the voice activity detection method in the embodiment of the present invention can be when noise level changes still Show robustness.
Step S206:Speech recognition is carried out to current sound frame.
In specific implementation, when the Shannon entropy energy in the spectrum energy domain of present frame is more than corresponding threshold value, show to work as Previous frame includes voice messaging.At this point it is possible to speech recognition is carried out to present frame, to identify specific voice content.
, can be then to next voiced frame of current sound frame from step after execution of step S206 in specific implementation Rapid S204 starts to perform, until traversal completes each voiced frame in acquired current voice data to be identified.
In specific implementation, when by above-mentioned audio recognition method be applied to mobile terminal in when always listening in system, When identifying complete voice messaging in acquired voice data, mobile terminal can be held according to the voice content identified The corresponding operation of row.For example, when identify user input voice be " mobile phone for dialing XX " when, mobile terminal can with to The voice messaging of " mobile phone for dialing XX " of family input is identified, and after correct identification, XX mobile phone is obtained from itself The information of number, and automatic dialing.
It is to be herein pointed out when Y represents white noise, H (| Y (w, t) |2) it is up to maximum, i.e. log (Ω);When Y represents pure tone, H (| Y (w, t) |2) it is up to minimum value, i.e., 0.In other words, H (| Y (w, t) |2) dynamic become Change scope is 0 to log (Ω), and under white noise, not the Shannon entropy in the spectrum energy domain of the noise frame including voice messaging Numerical value it is unrelated with noise level, and corresponding threshold value can be pre-estimated to obtain.This observation result is based on, the present invention Voice activity detection method in embodiment is extremely suitable for the voice activity detection under white noise or quasi- white noise.
Device corresponding to the audio recognition method in the embodiment of the present invention will be further described in detail below.
Fig. 3 shows that the embodiment of the present invention additionally provides a kind of structural representation of voice activity detection device.Specific In implementation, voice activity detection device 300 as shown in Figure 3, Fourier transformation unit 301, the first computing unit can be included 302nd, judging unit 303 and determining unit 304, wherein:
The Fourier transformation unit 301, suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames, And quick Fourier transformation computation is carried out to each frame, obtain corresponding frequency spectrum.
First computing unit 302, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate and travel through extremely The Shannon entropy energy in the spectrum energy domain of present frame.
In an embodiment of the present invention, first computing unit 302 is suitable to be calculated using formula below and traveled through extremely The Shannon entropy energy in the spectrum energy domain of present frame:
Wherein, H (| Y (w, t) |2) represent present frame spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
The judging unit 303, suitable for judging it is default whether the Shannon entropy energy in spectrum energy domain of present frame is more than Threshold value.In specific implementation, the spectral characteristic phase of default threshold value noise corresponding with the voice data to be identified Association, i.e., described threshold value changes with the change of the noise spectrum characteristic of voice data to be identified, but not with waiting to know The change of the noise level of other voice data and change.
The determining unit 304, suitable for being more than the threshold value when the Shannon entropy energy in the spectrum energy domain for determining present frame When, determine that present frame includes voice messaging.
In an embodiment of the present invention, the voice activity detection device 300 can also include the second computing unit 305, Wherein:
Second computing unit 305, suitable for the Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that Corresponding two gauss of distribution function;Wherein, identified two gauss of distribution function are used to simulate the multiple overlapping frame Spectrum energy domain Shannon entropy;Using identified gauss of distribution function, the threshold value is calculated.
In an embodiment of the present invention, second computing unit 305, corresponding to being determined using maximum expected value method Two gauss of distribution function.
Because of the Shannon entropy energy in the spectrum energy domain relative to the frame for only including noise information, including the frame of voice messaging The Shannon entropy energy in spectrum energy domain is more regular, and the scheme in the embodiment of the present invention passes through multiple heavy by what is be calculated The Shannon entropy energy in spectrum energy domain corresponding to folded frame compared with default threshold value, can be determined with comparative result respectively Whether include voice messaging in corresponding frame, thus the accuracy of voice activity detection can be improved, and relative to establishing voice The mathematical modeling of identification, the calculating of the Shannon entropy energy in spectrum energy domain is simpler, thus can save computing resource.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can To instruct the hardware of correlation to complete by program, the program can be stored in computer-readable recording medium, and storage is situated between Matter can include:ROM, RAM, disk or CD etc..
The method and system of the embodiment of the present invention are had been described in detail above, the present invention is not limited thereto.Any Art personnel, without departing from the spirit and scope of the present invention, it can make various changes or modifications, therefore the guarantor of the present invention Shield scope should be defined by claim limited range.

Claims (10)

  1. A kind of 1. voice activity detection method, it is characterised in that including:
    The voice data to be identified of acquisition is divided into multiple overlapping frames, and FFT fortune is carried out to each frame Calculate, obtain corresponding frequency spectrum;
    The frequency spectrum of the multiple overlapping frame is traveled through, calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely Amount;
    When it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than default threshold value, determine that present frame is believed including voice Breath.
  2. 2. voice activity detection method according to claim 1, it is characterised in that the present frame of the calculating traversal extremely The Shannon entropy energy in spectrum energy domain, including:
    <mrow> <mi>H</mi> <mrow> <mo>(</mo> <mo>|</mo> <mi>Y</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mo>-</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>w</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;epsiv;</mi> </msubsup> <mi>P</mi> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>1</mn> </msup> <mo>;</mo> </mrow>
    Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
  3. 3. voice activity detection method according to claim 1, it is characterised in that the default threshold value is waited to know with described The noise spectrum characteristic of other voice data is associated.
  4. 4. voice activity detection method according to claim 1, it is characterised in that be calculated in the following way described Default threshold value:
    The Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that corresponding two gauss of distribution function;Wherein, Identified two gauss of distribution function are used for the Shannon entropy for simulating the spectrum energy domain of the multiple overlapping frame;
    Using identified gauss of distribution function, the threshold value is calculated.
  5. 5. voice activity detection method according to claim 4, it is characterised in that two Gausses point corresponding to the determination Cloth function, including:
    Using two gauss of distribution function corresponding to the determination of maximum expected value method.
  6. A kind of 6. voice activity detection device, it is characterised in that including:
    Fourier transformation unit, suitable for the voice data to be identified of acquisition is divided into multiple overlapping frames, and to each frame Quick Fourier transformation computation is carried out, obtains corresponding frequency spectrum;
    First computing unit, suitable for being traveled through to the frequency spectrum of the multiple overlapping frame, calculate the frequency of the present frame of traversal extremely The Shannon entropy energy in spectrum energy domain;
    Judging unit, suitable for judging whether the Shannon entropy energy in spectrum energy domain of present frame is more than default threshold value;
    Determining unit, suitable for when it is determined that the Shannon entropy energy in the spectrum energy domain of present frame is more than the threshold value, it is determined that currently Frame includes voice messaging.
  7. 7. voice activity detection device according to claim 6, it is characterised in that first computing unit is suitable to use Formula below calculates the Shannon entropy energy in the spectrum energy domain of the present frame of traversal extremely:
    <mrow> <mi>H</mi> <mrow> <mo>(</mo> <mo>|</mo> <mi>Y</mi> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>(</mo> <mo>-</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>w</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>&amp;epsiv;</mi> </msubsup> <mi>P</mi> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mo>(</mo> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mrow> <mo>|</mo> <mi>Y</mi> <mrow> <mo>(</mo> <mrow> <mi>w</mi> <mo>,</mo> <mi>t</mi> </mrow> <mo>)</mo> </mrow> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mn>1</mn> </msup> <mo>;</mo> </mrow>
    Wherein, H (| Y (w, t) |2) represent present frame t spectrum energy domain Shannon entropy energy, P (| Y (w, t) |2Represent present frame Probability of the t amplitude spectrum in corresponding frequency band w, Y (w, t) represent the noise types of frequency range w corresponding to present frame t, and ε represents division The quantity of obtained frequency range.
  8. 8. voice activity detection device according to claim 6, it is characterised in that the default threshold value is with currently waiting to know The spectral characteristic of noise is associated corresponding to other voice data.
  9. 9. voice activity detection device according to claim 6, it is characterised in that also include:
    Second computing unit, suitable for the Shannon entropy in the spectrum energy domain based on the multiple overlapping frame, it is determined that corresponding two Gauss of distribution function;Wherein, identified two gauss of distribution function are used for the spectrum energy for simulating the multiple overlapping frame The Shannon entropy in domain;Using identified gauss of distribution function, the threshold value is calculated.
  10. 10. voice activity detection device according to claim 9, it is characterised in that second computing unit, suitable for adopting Two gauss of distribution function corresponding to the determination of maximum expected value method.
CN201610607277.XA 2016-07-28 2016-07-28 Voice activity detection method and device Pending CN107665711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610607277.XA CN107665711A (en) 2016-07-28 2016-07-28 Voice activity detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610607277.XA CN107665711A (en) 2016-07-28 2016-07-28 Voice activity detection method and device

Publications (1)

Publication Number Publication Date
CN107665711A true CN107665711A (en) 2018-02-06

Family

ID=61114130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610607277.XA Pending CN107665711A (en) 2016-07-28 2016-07-28 Voice activity detection method and device

Country Status (1)

Country Link
CN (1) CN107665711A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113270118A (en) * 2021-05-14 2021-08-17 杭州朗和科技有限公司 Voice activity detection method and device, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499280A (en) * 2009-03-09 2009-08-05 武汉大学 Spacing parameter choosing method and apparatus based on spacing perception entropy judgement
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102097095A (en) * 2010-12-28 2011-06-15 天津市亚安科技电子有限公司 Speech endpoint detecting method and device
CN102938069A (en) * 2012-06-13 2013-02-20 北京师范大学 Pure and mixed pixel automatic classification method based on information entropy
CN103948398A (en) * 2014-04-04 2014-07-30 杭州电子科技大学 Heart sound location segmenting method suitable for Android system
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499280A (en) * 2009-03-09 2009-08-05 武汉大学 Spacing parameter choosing method and apparatus based on spacing perception entropy judgement
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102097095A (en) * 2010-12-28 2011-06-15 天津市亚安科技电子有限公司 Speech endpoint detecting method and device
CN102938069A (en) * 2012-06-13 2013-02-20 北京师范大学 Pure and mixed pixel automatic classification method based on information entropy
CN103948398A (en) * 2014-04-04 2014-07-30 杭州电子科技大学 Heart sound location segmenting method suitable for Android system
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEYSAM ASGARI ETC: "Voice Activity Detection Using Entropy in Spectrum Domain", 《2008 AUSTRALASIAN TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE》 *
许作辉: "基于信息熵的语音端点检测算法研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113270118A (en) * 2021-05-14 2021-08-17 杭州朗和科技有限公司 Voice activity detection method and device, storage medium and electronic equipment
CN113270118B (en) * 2021-05-14 2024-02-13 杭州网易智企科技有限公司 Voice activity detection method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN100476949C (en) Multichannel voice detection in adverse environments
WO2021114733A1 (en) Noise suppression method for processing at different frequency bands, and system thereof
CN1248190C (en) Fast frequency-domain pitch estimation
CN107833581B (en) Method, device and readable storage medium for extracting fundamental tone frequency of sound
JP2007041593A (en) Method and apparatus for extracting voiced/unvoiced classification information using harmonic component of voice signal
KR20150037986A (en) Determining hotword suitability
CN104616663A (en) Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)
CN105261375A (en) Voice activity detection method and apparatus
CN106024017A (en) Voice detection method and device
CN104143324A (en) Musical tone note identification method
KR100735343B1 (en) Apparatus and method for extracting pitch information of a speech signal
CN111696580A (en) Voice detection method and device, electronic equipment and storage medium
CN114666618B (en) Audio auditing method, device, equipment and readable storage medium
CN106033669A (en) Voice identification method and apparatus thereof
CN107564512B (en) Voice activity detection method and device
CN106816157A (en) Audio recognition method and device
CN101123090B (en) Speech recognition by statistical language using square-rootdiscounting
CN106920543B (en) Audio recognition method and device
CN106297795B (en) Audio recognition method and device
CN103559289A (en) Language-irrelevant keyword search method and system
CN101030374B (en) Method and apparatus for extracting base sound period
CN111489739B (en) Phoneme recognition method, apparatus and computer readable storage medium
CN107665711A (en) Voice activity detection method and device
EP1436805B1 (en) 2-phase pitch detection method and appartus
Bouzid et al. Voice source parameter measurement based on multi-scale analysis of electroglottographic signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180206

RJ01 Rejection of invention patent application after publication