US7774203B2 - Audio signal segmentation algorithm - Google Patents
Audio signal segmentation algorithm Download PDFInfo
- Publication number
- US7774203B2 US7774203B2 US11/589,772 US58977206A US7774203B2 US 7774203 B2 US7774203 B2 US 7774203B2 US 58977206 A US58977206 A US 58977206A US 7774203 B2 US7774203 B2 US 7774203B2
- Authority
- US
- United States
- Prior art keywords
- audio
- segment
- audio signal
- music
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 70
- 230000011218 segmentation Effects 0.000 title claims abstract description 42
- 238000001514 detection method Methods 0.000 claims abstract description 21
- 230000000694 effects Effects 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000009499 grossing Methods 0.000 claims abstract description 16
- 238000000034 method Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 7
- 230000004907 flux Effects 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 3
- 238000012549 training Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 230000008901 benefit Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013016 damping Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- Taiwan Application Serial Number 95118143 filed May 22, 2006, the disclosure of which is hereby incorporated by reference herein in its entirety.
- the present invention relates to an audio signal segmentation algorithm, and more particularly, to an audio signal segmentation algorithm used under low signal-to-noise ratio (SNR) noise environment.
- SNR signal-to-noise ratio
- the technique of segmenting speech/music signals from audio signals has become more important in multimedia applications.
- the first kind of audio signal segmentation algorithm designs classifiers by directly extracting the features of the signals in the time domain or the frequency domain to discriminate and to further segment the speech and the music signals.
- the features used in these kinds of audio signal segmentation algorithms are zero-crossing information, energy, pitch, Cepstral Coefficients, line spectral frequencies, 4 Hz modulation energy and some perception features, such as tone and rhythm.
- These kinds of conventional techniques extract the features directly.
- the size of the windows used to analyze the signals is increasingly bigger, so the segmented scope is not precise enough.
- fixed thresholds are used in most methods to determine the segmentation. Therefore, they cannot offer satisfactory results under low SNR noise environments.
- the second kind of audio signal segmentation algorithm generates features needed in the classifiers by statistics, which is called the posterior probability based feature. Although better results can be obtained by getting features with statistics, a large number of training data samples are needed in these kinds of conventional techniques and they are also not suitable in actual environments.
- the third kind of audio signal segmentation algorithm emphasizes the design of the classifier models.
- the most commonly used methods are Bayesian information criterion, Gaussian likelihood ratio and a hidden Markov model (HMM) based classifier. These kinds of conventional techniques put stress on setting up effective classifiers. Although the methods are practical, some of them need larger computation, such as using the Bayesian information criterion, and some of them need to prepare a large number of training data samples in advance to set up the models needed, such as using Gaussian likelihood ratio and hidden Markov model (HMM). They are not good choices in practical applications.
- one objective of the present invention is to provide an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
- Another objective of the present invention is to provide an audio signal segmentation algorithm which can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
- Still another objective of the present invention is to provide an audio signal segmentation algorithm in which plenty of training data is not needed and the ability of the features chosen to resist the noise is better.
- Still another objective of the present invention is to provide an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
- the present invention provides an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one first audio segment and at least one second audio segment. Then, an audio feature extraction step is performed on the second audio segment to obtain a plurality of audio features of the second audio segment. A smoothing step is then applied to the second audio segment after the audio feature extraction step. Afterwards, a plurality of speech frames and a plurality of music frames are discriminated from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
- AAD audio activity detection
- a smoothing step is then applied to the second audio segment after the audio feature extraction step.
- a plurality of speech frames and a plurality of music frames are discriminated from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
- the first audio segment is a noise segment.
- the audio activity detection step further comprises the following steps. First, the audio signal is divided into a plurality of frames. Then, a frequency transformation step is applied to signals in each of the frames to obtain a plurality of bands in each frame. Then, a likelihood computation step is performed on the bands and a noise parameter to obtain a likelihood ratio there between. Then, a comparison step is performed on the likelihood ratio and a noise threshold. If the noise threshold is greater than the likelihood ratio, the bands belong to a first frame, and if the likelihood ratio is greater than the noise threshold, the bands belong to a second frame wherein the first frame belongs to the first audio segment and the second frame belongs to the second audio segment. When a distance between two adjacent second frames is smaller than a predetermined value, the two adjacent second frames are combined to compose the second audio segment.
- the frequency transformation step is a Fourier Transform.
- the noise parameter is a noise variance of the Fourier coefficient and is obtained by estimating a variance of a noise segment in the initial part of the audio signal.
- the estimation of the noise threshold further comprises the following steps. First, a noise segment in initial the part of the audio signal is extracted. Then, the noise segment is mixed with one of a plurality of noiseless speech/music segment to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment. Then, the audio activity detection step is applied to the mixing audio segment to divide the mixing audio segment into at least one speech segment and at least one music segment by using a first threshold. Afterwards, the algorithm judges if the speech segment and the music segment match the noiseless speech/music segment and obtain a result. If the result is yes, the first threshold is equal to the noise threshold.
- SNR signal-to-noise ratio
- the present invention further comprises mixing the noise segment and the other noiseless speech/music segments, respectively, and repeating the audio activity detection step and the judging step to obtain a plurality of thresholds, and then, comparing the thresholds with the first threshold to choose a smallest value as the noise threshold.
- the audio features are selected from the group consisting of low short time energy rate (LSTER), spectrum flux (SF), likelihood ratio crossing rate (LRCR) and an arbitrary combination thereof.
- the audio feature extraction step to extract the audio feature of likelihood ratio crossing rate further comprises computing a sum of a crossing rate of the waveform of the likelihood ratio to a plurality of predetermined thresholds by using the likelihood ratio of each frame. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to the speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to the music segment.
- one of the predetermined thresholds is one third the mean of the likelihood ratio, and another one of the predetermined thresholds is one ninth the mean of the likelihood ratio.
- the smoothing step further comprises performing a convolution process to the second audio segment after the audio feature extraction step and a window.
- the window may be a rectangular window.
- the step of discriminating the speech frames and the music frames from the second audio segment is based on a classifier, and the classifier is selected from the group consisting of a K-nearest neighbor (KNN) classifier, a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
- KNN K-nearest neighbor
- GMM Gaussian mixture model
- HMM hidden Markov model
- MLP multi-layer perceptron
- the preferred embodiment of the present invention further comprises segmenting the speech segment and the music segment from the second audio segment.
- FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention
- FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention
- FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention
- FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention
- FIG. 5 illustrates a diagram of the likelihood ratio crossing rate of the music signal
- FIG. 6 illustrates a diagram of the likelihood ratio crossing rate of the speech signal
- FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention.
- FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
- the present invention discloses an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one noise segment and at least one noisy audio segment. Then, multiple audio features are extracted from the noisy audio segment by a frame with fixed length in the audio feature extraction step. Afterwards, a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames. Then, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
- AAD audio activity detection
- FIGS. 1 through 8 In order to make the illustration of the present invention more explicit and complete, the following description is stated with reference to FIGS. 1 through 8 .
- FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
- an audio signal is provided in step 102 .
- an audio activity detection (AAD) step is applied to divide the audio signal into a noise segment 106 and a noisy audio segment 108 in step 104 .
- an audio feature extraction step is performed on the noisy audio segment 108 , as shown in step 110 .
- the audio feature extraction step extracts three kinds of audio features from the noisy audio segment 108 .
- the audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively.
- LSTER low short time energy rate
- SF spectrum flux
- LRCR likelihood ratio crossing rate
- the likelihood ratio of each frame is used to compute the sum of the crossing rate in the waveform of the likelihood ratio compared to multiple predetermined thresholds. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to a speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to a music segment.
- step 112 a convolution process is performed on the result obtained and a window (such as a rectangular window) in the smoothing step to raise the discrimination rate for the following step.
- step 114 a classifier is used to tell the speech and the music frames apart.
- the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
- the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
- the speech segment 116 and the music segment 118 are obtained.
- the classifier is a KNN based classifier and it classifies the signals into different types in a codebook and further determines if the signals belong to speech or music. The following describes in detail the audio activity detection step used in the preferred embodiment of the present invention.
- FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention.
- the audio signal is divided into multiple frames in step 202 .
- the length of each frame may be 30 ms.
- a frequency transformation step is applied to the signals in each frame to obtain multiple bands in each frame in step 204 .
- the frequency transformation step uses a Fourier Transform.
- a likelihood computation step is performed on the bands and a noise parameter 208 to obtain a likelihood ratio between them in step 206 .
- the noise parameter 208 is the noise variance of the Fourier coefficient and is obtained by estimating the variance of a noise segment in the initial part of the audio signal.
- a comparison step is performed between the likelihood ratio and the noise threshold 212 . If the likelihood ratio is smaller than the noise threshold, the bands belong to a noise frame 214 , and if the likelihood ratio is greater than the noise threshold, the bands belong to a noisy audio frame 216 .
- the likelihood computation step and the comparison step are based on the equation:
- ⁇ is the likelihood ratio
- L is the number of the bands
- X k denotes the kth Fourier coefficient in one of the frames
- ⁇ N (k) is the noise variance of the Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise
- ⁇ is the noise threshold
- H 0 denotes the result is the noise frame
- H 1 denotes the result is the noisy audio frame.
- the method to merge the frames is to determine if the distance between the two adjacent frames detected is too small by programming. If the distance is too small, they are considered to be merged into the same frame. If the distance is not too small, they are still considered two different frames. In other words, when the distance between two adjacent noisy audio frames is smaller than a predetermined value, the two adjacent noisy audio frames are combined to compose the noisy audio segment 220 .
- FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention. As the circle in FIG. 3 shows, when the distance between the adjacent frames is too small, they are merged into a single frame.
- noise threshold ⁇ can be estimated as different values according to different environments rather than a fixed value in order to make the audio signal segmentation algorithm of the present invention suitable for different environments. The following describes in detail the estimation of the noise threshold.
- FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention.
- a noise segment 402 in the initial part of the audio signal is extracted.
- the noise segment 402 is mixed with a noiseless speech/music segment 404 to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment 406 .
- the noise parameter 410 estimated from the noise segment 402 is used to perform the audio activity detection step on the mixing audio segment 406 , as shown in step 408 .
- the mixing audio segment 406 is first divided into multiple frames, and then, a frequency transformation step is applied to signals in each frame to obtain multiple bands in each frame.
- a likelihood ratio between the bands and the noise parameter is computed, and the mixing audio segment 406 is divided into at least one speech segment and at least one music segment using a first threshold.
- a judging step is performed to judge if the speech segment and the music segment are correctly segmented to match the noiseless speech/music segment 404 and obtain a result. If the result is no, the first threshold is adjusted and the audio activity detection step and the judging step are repeated on the mixing audio segment 406 , as shown in step 414 . If the result is yes, the first threshold is equal to the noise threshold, as shown in step 416 .
- the estimation of the noise threshold in the preferred embodiment of the present invention extracts a noise segment in initial part of the audio signal first and then mixes the noise segment with prepared training data (a noiseless speech/music segment) to a certain predetermined signal-to-noise ratio. Since the training data is prepared in advance, the location of the voice in the training data is already known, so the signal-to-noise ratio of the training data and the noise segment can be adjusted. Generally, if the signal with the lowest SNR in the system is 5 dB, the SNR of the mixing audio segment can be set to 3 dB to estimate the threshold. It just needs to be smaller than 5 dB. Then, the audio activity detection step is performed to the mixing audio segment.
- the mixing audio segment is proceeded a Fourier transform by a unit of 30 ms frame. Then, the likelihood ratio is computed, and an initial threshold (0) is used to judge. If the threshold can detect all of the voice part in the training data, the threshold is adjusted to be 0.2 higher until the threshold with the highest value that still can completely tell apart all the voice segments is obtained. There are t training data, so the step needs to be done for t times. However, each training data is not as long as usual, so it does not take too much time. When all training data is processed, t thresholds can be obtained and the smallest one among these t thresholds is chosen to be the threshold used in the system.
- the audio signal inputted is divided into a noise segment and a noisy audio segment. Then, the audio feature extraction step is performed on the noisy audio segment to obtain audio features of the noisy audio segment.
- Three audio features are used in discriminating the speech signals and the music signals in the preferred embodiment of the present invention. Each audio feature is defined in a length of about one second, and the length of one second is also the smallest unit in the discrimination in the preferred embodiment of the present invention.
- These three audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively. They are described as follows.
- the audio features of the low short time energy rate in a piece of audio signals, since the change of the energy in the frames of the speech signal is bigger than that of the music signal owing to the pitch, the speech signal and the music signal can be discriminated just by calculating the ratio of the low energy.
- the audio feature of spectrum flux in a piece of audio segment, since the energy of the speech signal is changeable, if calculating the sum of the frequency distance between the adjacent frames in the piece of audio segment, the speech signal has bigger value.
- the change in the frequency of the audio signal is usually slower, so the sum of the frequency distance between the adjacent frames is smaller. Therefore, the spectrum flux can be used to discriminate the speech and the music signal.
- the audio feature of likelihood ratio crossing rate The waveform of the likelihood ratio obtained in the AAD step can be used to tell the speech and the music apart by observing the damping characteristics.
- the speech signal has more frames of low energy than the music signal does.
- the speech and the music signal are not easily discriminated in the way of calculating the energy in time domain. Therefore, the audio feature of likelihood ratio crossing rate is derived in frequency domain.
- the likelihood ratio waveform of each frame obtained in the AAD step is used and the sum of the crossing rate of the likelihood ratio waveform compared to two thresholds is calculated. Generally speaking, the crossing rate in speech is higher than in music.
- FIG. 5 and FIG. 6 illustrate diagrams of the likelihood ratio crossing rate of the music and the speech signal, respectively.
- one second is the smallest analyzing unit for each segment, and eight and five windows with one second in unit are illustrated, respectively.
- the mean and the two thresholds of the likelihood ratio in each window are computed.
- the mean of the likelihood ratio is denoted by the upper line, and the middle line and the lower line represent the two thresholds with one third the mean of the likelihood ratio and one ninth the mean of the likelihood ratio, respectively.
- the sum of the crossing rate of the likelihood ratio compared to the two thresholds is computed to discriminate between the music and the speech signals. From FIG. 5 and FIG. 6 , the crossing rate of the likelihood ratio to the two thresholds of the music part in FIG. 5 is smaller than that of the speech part in FIG. 6 .
- FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention.
- a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames.
- a rectangular window is used to perform the convolution process on the audio feature sequences obtained.
- the difference between before and after the smoothing step on the waveform of the music and the speech frames is shown in FIG. 7 .
- the audio feature sequences are irregular.
- the values of the features in speech segments are supposed to be high, but some of them are not as expected. So is the music segment.
- the circles in FIG. 7 point out two examples of them.
- the classifier is a KNN based classifier to classify the speech and the music types.
- the signal belongs to the type (the speech or the music) which has the most training data in the nearest k training data in the codebook.
- other classifiers may also be used, such as a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
- GMM Gaussian mixture model
- HMM hidden Markov model
- MLP multi-layer perceptron
- FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
- the first figure in FIG. 8 is the original input audio signal.
- the second figure in FIG. 8 is the result after obtaining the likelihood ratio.
- the third figure in FIG. 8 is the result after the smoothing step, and the fourth figure in FIG. 8 is the result after segmenting the speech and the music segments. From FIG. 8 , the speech and the music segments can be obtained from the input audio signal after the audio activity detection step, the audio feature extraction step, the smoothing step and the segmentation step.
- one advantage of the present invention is that the present invention provides an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
- the present invention provides an audio signal segmentation algorithm which can be integrated into multimedia content analysis applications, multimedia data compression and audio recognition, and can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
- yet another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
Description
where Λ is the likelihood ratio, L is the number of the bands, Xk denotes the kth Fourier coefficient in one of the frames, λN(k) is the noise variance of the Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise, η is the noise threshold, H0 denotes the result is the noise frame, and H1 denotes the result is the noisy audio frame.
Claims (17)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW95118143A | 2006-05-22 | ||
TW95118143 | 2006-05-22 | ||
TW095118143A TWI312982B (en) | 2006-05-22 | 2006-05-22 | Audio signal segmentation algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
US20070271093A1 US20070271093A1 (en) | 2007-11-22 |
US7774203B2 true US7774203B2 (en) | 2010-08-10 |
Family
ID=38713045
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/589,772 Expired - Fee Related US7774203B2 (en) | 2006-05-22 | 2006-10-31 | Audio signal segmentation algorithm |
Country Status (2)
Country | Link |
---|---|
US (1) | US7774203B2 (en) |
TW (1) | TWI312982B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029306A1 (en) * | 2009-07-28 | 2011-02-03 | Electronics And Telecommunications Research Institute | Audio signal discriminating device and method |
US20110243349A1 (en) * | 2010-03-30 | 2011-10-06 | Cambridge Silicon Radio Limited | Noise Estimation |
US20130103398A1 (en) * | 2009-08-04 | 2013-04-25 | Nokia Corporation | Method and Apparatus for Audio Signal Classification |
US20140088974A1 (en) * | 2012-09-26 | 2014-03-27 | Motorola Mobility Llc | Apparatus and method for audio frame loss recovery |
US8712771B2 (en) * | 2009-07-02 | 2014-04-29 | Alon Konchitsky | Automated difference recognition between speaking sounds and music |
US20140257814A1 (en) * | 2013-03-05 | 2014-09-11 | Microsoft Corporation | Posterior-Based Feature with Partial Distance Elimination for Speech Recognition |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2100294A4 (en) * | 2006-12-27 | 2011-09-28 | Intel Corp | Method and apparatus for speech segmentation |
JP5130809B2 (en) * | 2007-07-13 | 2013-01-30 | ヤマハ株式会社 | Apparatus and program for producing music |
US20090043577A1 (en) * | 2007-08-10 | 2009-02-12 | Ditech Networks, Inc. | Signal presence detection using bi-directional communication data |
ATE552651T1 (en) * | 2008-12-24 | 2012-04-15 | Dolby Lab Licensing Corp | AUDIO SIGNAL AUTUTITY DETERMINATION AND MODIFICATION IN THE FREQUENCY DOMAIN |
CN101847412B (en) * | 2009-03-27 | 2012-02-15 | 华为技术有限公司 | Method and device for classifying audio signals |
US10224036B2 (en) * | 2010-10-05 | 2019-03-05 | Infraware, Inc. | Automated identification of verbal records using boosted classifiers to improve a textual transcript |
TWI412019B (en) | 2010-12-03 | 2013-10-11 | Ind Tech Res Inst | Sound event detecting module and method thereof |
CN104282315B (en) * | 2013-07-02 | 2017-11-24 | 华为技术有限公司 | Audio signal classification processing method, device and equipment |
CN104347067B (en) | 2013-08-06 | 2017-04-12 | 华为技术有限公司 | Audio signal classification method and device |
CN103413553B (en) * | 2013-08-20 | 2016-03-09 | 腾讯科技(深圳)有限公司 | Audio coding method, audio-frequency decoding method, coding side, decoding end and system |
US9685156B2 (en) * | 2015-03-12 | 2017-06-20 | Sony Mobile Communications Inc. | Low-power voice command detector |
CN108269567B (en) * | 2018-01-23 | 2021-02-05 | 北京百度网讯科技有限公司 | Method, apparatus, computing device, and computer-readable storage medium for generating far-field speech data |
CN109712641A (en) * | 2018-12-24 | 2019-05-03 | 重庆第二师范学院 | A kind of processing method of audio classification and segmentation based on support vector machines |
CN111724757A (en) * | 2020-06-29 | 2020-09-29 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio data processing method and related product |
CN112489692B (en) * | 2020-11-03 | 2024-10-18 | 北京捷通华声科技股份有限公司 | Voice endpoint detection method and device |
CN112735470B (en) * | 2020-12-28 | 2024-01-23 | 携程旅游网络技术(上海)有限公司 | Audio cutting method, system, equipment and medium based on time delay neural network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6415253B1 (en) * | 1998-02-20 | 2002-07-02 | Meta-C Corporation | Method and apparatus for enhancing noise-corrupted speech |
US20020161576A1 (en) * | 2001-02-13 | 2002-10-31 | Adil Benyassine | Speech coding system with a music classifier |
US20060015333A1 (en) * | 2004-07-16 | 2006-01-19 | Mindspeed Technologies, Inc. | Low-complexity music detection algorithm and system |
US7558729B1 (en) * | 2004-07-16 | 2009-07-07 | Mindspeed Technologies, Inc. | Music detection for enhancing echo cancellation and speech coding |
-
2006
- 2006-05-22 TW TW095118143A patent/TWI312982B/en not_active IP Right Cessation
- 2006-10-31 US US11/589,772 patent/US7774203B2/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6415253B1 (en) * | 1998-02-20 | 2002-07-02 | Meta-C Corporation | Method and apparatus for enhancing noise-corrupted speech |
US20020161576A1 (en) * | 2001-02-13 | 2002-10-31 | Adil Benyassine | Speech coding system with a music classifier |
US20060015333A1 (en) * | 2004-07-16 | 2006-01-19 | Mindspeed Technologies, Inc. | Low-complexity music detection algorithm and system |
US7558729B1 (en) * | 2004-07-16 | 2009-07-07 | Mindspeed Technologies, Inc. | Music detection for enhancing echo cancellation and speech coding |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8712771B2 (en) * | 2009-07-02 | 2014-04-29 | Alon Konchitsky | Automated difference recognition between speaking sounds and music |
US20110029306A1 (en) * | 2009-07-28 | 2011-02-03 | Electronics And Telecommunications Research Institute | Audio signal discriminating device and method |
US20130103398A1 (en) * | 2009-08-04 | 2013-04-25 | Nokia Corporation | Method and Apparatus for Audio Signal Classification |
US9215538B2 (en) * | 2009-08-04 | 2015-12-15 | Nokia Technologies Oy | Method and apparatus for audio signal classification |
US20110243349A1 (en) * | 2010-03-30 | 2011-10-06 | Cambridge Silicon Radio Limited | Noise Estimation |
US8666092B2 (en) * | 2010-03-30 | 2014-03-04 | Cambridge Silicon Radio Limited | Noise estimation |
US20140088974A1 (en) * | 2012-09-26 | 2014-03-27 | Motorola Mobility Llc | Apparatus and method for audio frame loss recovery |
US9123328B2 (en) * | 2012-09-26 | 2015-09-01 | Google Technology Holdings LLC | Apparatus and method for audio frame loss recovery |
US20140257814A1 (en) * | 2013-03-05 | 2014-09-11 | Microsoft Corporation | Posterior-Based Feature with Partial Distance Elimination for Speech Recognition |
US9336775B2 (en) * | 2013-03-05 | 2016-05-10 | Microsoft Technology Licensing, Llc | Posterior-based feature with partial distance elimination for speech recognition |
Also Published As
Publication number | Publication date |
---|---|
TW200744069A (en) | 2007-12-01 |
US20070271093A1 (en) | 2007-11-22 |
TWI312982B (en) | 2009-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7774203B2 (en) | Audio signal segmentation algorithm | |
US8155953B2 (en) | Method and apparatus for discriminating between voice and non-voice using sound model | |
US7263485B2 (en) | Robust detection and classification of objects in audio using limited training data | |
US7904295B2 (en) | Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers | |
EP1083541B1 (en) | A method and apparatus for speech detection | |
Ramírez et al. | An effective subband OSF-based VAD with noise reduction for robust speech recognition | |
US7177808B2 (en) | Method for improving speaker identification by determining usable speech | |
US6785645B2 (en) | Real-time speech and music classifier | |
US8428945B2 (en) | Acoustic signal classification system | |
Evangelopoulos et al. | Multiband modulation energy tracking for noisy speech detection | |
US6993481B2 (en) | Detection of speech activity using feature model adaptation | |
US20040064314A1 (en) | Methods and apparatus for speech end-point detection | |
Singh et al. | Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination | |
US7120576B2 (en) | Low-complexity music detection algorithm and system | |
US7243063B2 (en) | Classifier-based non-linear projection for continuous speech segmentation | |
CN104835498A (en) | Voiceprint identification method based on multi-type combination characteristic parameters | |
Górriz et al. | Hard C-means clustering for voice activity detection | |
CA2492204A1 (en) | Similar speaking recognition method and system using linear and nonlinear feature extraction | |
CN103165127A (en) | Sound segmentation equipment, sound segmentation method and sound detecting system | |
Khoa | Noise robust voice activity detection | |
Schwartz et al. | The application of probability density estimation to text-independent speaker identification | |
Couvreur et al. | Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models | |
Górriz et al. | An effective cluster-based model for robust speech detection and speech recognition in noisy environments | |
US7630891B2 (en) | Voice region detection apparatus and method with color noise removal using run statistics | |
Rabaoui et al. | Using HMM-based classifier adapted to background noises with improved sounds features for audio surveillance application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NATIONAL CHENG KUNG UNIVERSITY, TAIWAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, JHING-FA;HUANG, CHAO-CHING;WU, DIAN-JIA;REEL/FRAME:018483/0590 Effective date: 20061017 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552) Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20220810 |