US7774203B2 - Audio signal segmentation algorithm - Google Patents

Audio signal segmentation algorithm Download PDF

Info

Publication number
US7774203B2
US7774203B2 US11/589,772 US58977206A US7774203B2 US 7774203 B2 US7774203 B2 US 7774203B2 US 58977206 A US58977206 A US 58977206A US 7774203 B2 US7774203 B2 US 7774203B2
Authority
US
United States
Prior art keywords
audio
segment
audio signal
music
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US11/589,772
Other versions
US20070271093A1 (en
Inventor
Jhing-Fa Wang
Chao-Ching Huang
Dian-Jia Wu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Cheng Kung University NCKU
Original Assignee
National Cheng Kung University NCKU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Cheng Kung University NCKU filed Critical National Cheng Kung University NCKU
Assigned to NATIONAL CHENG KUNG UNIVERSITY reassignment NATIONAL CHENG KUNG UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUANG, CHAO-CHING, WANG, JHING-FA, WU, DIAN-JIA
Publication of US20070271093A1 publication Critical patent/US20070271093A1/en
Application granted granted Critical
Publication of US7774203B2 publication Critical patent/US7774203B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • Taiwan Application Serial Number 95118143 filed May 22, 2006, the disclosure of which is hereby incorporated by reference herein in its entirety.
  • the present invention relates to an audio signal segmentation algorithm, and more particularly, to an audio signal segmentation algorithm used under low signal-to-noise ratio (SNR) noise environment.
  • SNR signal-to-noise ratio
  • the technique of segmenting speech/music signals from audio signals has become more important in multimedia applications.
  • the first kind of audio signal segmentation algorithm designs classifiers by directly extracting the features of the signals in the time domain or the frequency domain to discriminate and to further segment the speech and the music signals.
  • the features used in these kinds of audio signal segmentation algorithms are zero-crossing information, energy, pitch, Cepstral Coefficients, line spectral frequencies, 4 Hz modulation energy and some perception features, such as tone and rhythm.
  • These kinds of conventional techniques extract the features directly.
  • the size of the windows used to analyze the signals is increasingly bigger, so the segmented scope is not precise enough.
  • fixed thresholds are used in most methods to determine the segmentation. Therefore, they cannot offer satisfactory results under low SNR noise environments.
  • the second kind of audio signal segmentation algorithm generates features needed in the classifiers by statistics, which is called the posterior probability based feature. Although better results can be obtained by getting features with statistics, a large number of training data samples are needed in these kinds of conventional techniques and they are also not suitable in actual environments.
  • the third kind of audio signal segmentation algorithm emphasizes the design of the classifier models.
  • the most commonly used methods are Bayesian information criterion, Gaussian likelihood ratio and a hidden Markov model (HMM) based classifier. These kinds of conventional techniques put stress on setting up effective classifiers. Although the methods are practical, some of them need larger computation, such as using the Bayesian information criterion, and some of them need to prepare a large number of training data samples in advance to set up the models needed, such as using Gaussian likelihood ratio and hidden Markov model (HMM). They are not good choices in practical applications.
  • one objective of the present invention is to provide an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
  • Another objective of the present invention is to provide an audio signal segmentation algorithm which can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
  • Still another objective of the present invention is to provide an audio signal segmentation algorithm in which plenty of training data is not needed and the ability of the features chosen to resist the noise is better.
  • Still another objective of the present invention is to provide an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
  • the present invention provides an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one first audio segment and at least one second audio segment. Then, an audio feature extraction step is performed on the second audio segment to obtain a plurality of audio features of the second audio segment. A smoothing step is then applied to the second audio segment after the audio feature extraction step. Afterwards, a plurality of speech frames and a plurality of music frames are discriminated from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
  • AAD audio activity detection
  • a smoothing step is then applied to the second audio segment after the audio feature extraction step.
  • a plurality of speech frames and a plurality of music frames are discriminated from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
  • the first audio segment is a noise segment.
  • the audio activity detection step further comprises the following steps. First, the audio signal is divided into a plurality of frames. Then, a frequency transformation step is applied to signals in each of the frames to obtain a plurality of bands in each frame. Then, a likelihood computation step is performed on the bands and a noise parameter to obtain a likelihood ratio there between. Then, a comparison step is performed on the likelihood ratio and a noise threshold. If the noise threshold is greater than the likelihood ratio, the bands belong to a first frame, and if the likelihood ratio is greater than the noise threshold, the bands belong to a second frame wherein the first frame belongs to the first audio segment and the second frame belongs to the second audio segment. When a distance between two adjacent second frames is smaller than a predetermined value, the two adjacent second frames are combined to compose the second audio segment.
  • the frequency transformation step is a Fourier Transform.
  • the noise parameter is a noise variance of the Fourier coefficient and is obtained by estimating a variance of a noise segment in the initial part of the audio signal.
  • the estimation of the noise threshold further comprises the following steps. First, a noise segment in initial the part of the audio signal is extracted. Then, the noise segment is mixed with one of a plurality of noiseless speech/music segment to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment. Then, the audio activity detection step is applied to the mixing audio segment to divide the mixing audio segment into at least one speech segment and at least one music segment by using a first threshold. Afterwards, the algorithm judges if the speech segment and the music segment match the noiseless speech/music segment and obtain a result. If the result is yes, the first threshold is equal to the noise threshold.
  • SNR signal-to-noise ratio
  • the present invention further comprises mixing the noise segment and the other noiseless speech/music segments, respectively, and repeating the audio activity detection step and the judging step to obtain a plurality of thresholds, and then, comparing the thresholds with the first threshold to choose a smallest value as the noise threshold.
  • the audio features are selected from the group consisting of low short time energy rate (LSTER), spectrum flux (SF), likelihood ratio crossing rate (LRCR) and an arbitrary combination thereof.
  • the audio feature extraction step to extract the audio feature of likelihood ratio crossing rate further comprises computing a sum of a crossing rate of the waveform of the likelihood ratio to a plurality of predetermined thresholds by using the likelihood ratio of each frame. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to the speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to the music segment.
  • one of the predetermined thresholds is one third the mean of the likelihood ratio, and another one of the predetermined thresholds is one ninth the mean of the likelihood ratio.
  • the smoothing step further comprises performing a convolution process to the second audio segment after the audio feature extraction step and a window.
  • the window may be a rectangular window.
  • the step of discriminating the speech frames and the music frames from the second audio segment is based on a classifier, and the classifier is selected from the group consisting of a K-nearest neighbor (KNN) classifier, a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
  • KNN K-nearest neighbor
  • GMM Gaussian mixture model
  • HMM hidden Markov model
  • MLP multi-layer perceptron
  • the preferred embodiment of the present invention further comprises segmenting the speech segment and the music segment from the second audio segment.
  • FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention
  • FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention
  • FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention
  • FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention
  • FIG. 5 illustrates a diagram of the likelihood ratio crossing rate of the music signal
  • FIG. 6 illustrates a diagram of the likelihood ratio crossing rate of the speech signal
  • FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention.
  • FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
  • the present invention discloses an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one noise segment and at least one noisy audio segment. Then, multiple audio features are extracted from the noisy audio segment by a frame with fixed length in the audio feature extraction step. Afterwards, a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames. Then, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
  • AAD audio activity detection
  • FIGS. 1 through 8 In order to make the illustration of the present invention more explicit and complete, the following description is stated with reference to FIGS. 1 through 8 .
  • FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
  • an audio signal is provided in step 102 .
  • an audio activity detection (AAD) step is applied to divide the audio signal into a noise segment 106 and a noisy audio segment 108 in step 104 .
  • an audio feature extraction step is performed on the noisy audio segment 108 , as shown in step 110 .
  • the audio feature extraction step extracts three kinds of audio features from the noisy audio segment 108 .
  • the audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively.
  • LSTER low short time energy rate
  • SF spectrum flux
  • LRCR likelihood ratio crossing rate
  • the likelihood ratio of each frame is used to compute the sum of the crossing rate in the waveform of the likelihood ratio compared to multiple predetermined thresholds. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to a speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to a music segment.
  • step 112 a convolution process is performed on the result obtained and a window (such as a rectangular window) in the smoothing step to raise the discrimination rate for the following step.
  • step 114 a classifier is used to tell the speech and the music frames apart.
  • the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
  • the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
  • the speech segment 116 and the music segment 118 are obtained.
  • the classifier is a KNN based classifier and it classifies the signals into different types in a codebook and further determines if the signals belong to speech or music. The following describes in detail the audio activity detection step used in the preferred embodiment of the present invention.
  • FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention.
  • the audio signal is divided into multiple frames in step 202 .
  • the length of each frame may be 30 ms.
  • a frequency transformation step is applied to the signals in each frame to obtain multiple bands in each frame in step 204 .
  • the frequency transformation step uses a Fourier Transform.
  • a likelihood computation step is performed on the bands and a noise parameter 208 to obtain a likelihood ratio between them in step 206 .
  • the noise parameter 208 is the noise variance of the Fourier coefficient and is obtained by estimating the variance of a noise segment in the initial part of the audio signal.
  • a comparison step is performed between the likelihood ratio and the noise threshold 212 . If the likelihood ratio is smaller than the noise threshold, the bands belong to a noise frame 214 , and if the likelihood ratio is greater than the noise threshold, the bands belong to a noisy audio frame 216 .
  • the likelihood computation step and the comparison step are based on the equation:
  • is the likelihood ratio
  • L is the number of the bands
  • X k denotes the kth Fourier coefficient in one of the frames
  • ⁇ N (k) is the noise variance of the Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise
  • is the noise threshold
  • H 0 denotes the result is the noise frame
  • H 1 denotes the result is the noisy audio frame.
  • the method to merge the frames is to determine if the distance between the two adjacent frames detected is too small by programming. If the distance is too small, they are considered to be merged into the same frame. If the distance is not too small, they are still considered two different frames. In other words, when the distance between two adjacent noisy audio frames is smaller than a predetermined value, the two adjacent noisy audio frames are combined to compose the noisy audio segment 220 .
  • FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention. As the circle in FIG. 3 shows, when the distance between the adjacent frames is too small, they are merged into a single frame.
  • noise threshold ⁇ can be estimated as different values according to different environments rather than a fixed value in order to make the audio signal segmentation algorithm of the present invention suitable for different environments. The following describes in detail the estimation of the noise threshold.
  • FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention.
  • a noise segment 402 in the initial part of the audio signal is extracted.
  • the noise segment 402 is mixed with a noiseless speech/music segment 404 to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment 406 .
  • the noise parameter 410 estimated from the noise segment 402 is used to perform the audio activity detection step on the mixing audio segment 406 , as shown in step 408 .
  • the mixing audio segment 406 is first divided into multiple frames, and then, a frequency transformation step is applied to signals in each frame to obtain multiple bands in each frame.
  • a likelihood ratio between the bands and the noise parameter is computed, and the mixing audio segment 406 is divided into at least one speech segment and at least one music segment using a first threshold.
  • a judging step is performed to judge if the speech segment and the music segment are correctly segmented to match the noiseless speech/music segment 404 and obtain a result. If the result is no, the first threshold is adjusted and the audio activity detection step and the judging step are repeated on the mixing audio segment 406 , as shown in step 414 . If the result is yes, the first threshold is equal to the noise threshold, as shown in step 416 .
  • the estimation of the noise threshold in the preferred embodiment of the present invention extracts a noise segment in initial part of the audio signal first and then mixes the noise segment with prepared training data (a noiseless speech/music segment) to a certain predetermined signal-to-noise ratio. Since the training data is prepared in advance, the location of the voice in the training data is already known, so the signal-to-noise ratio of the training data and the noise segment can be adjusted. Generally, if the signal with the lowest SNR in the system is 5 dB, the SNR of the mixing audio segment can be set to 3 dB to estimate the threshold. It just needs to be smaller than 5 dB. Then, the audio activity detection step is performed to the mixing audio segment.
  • the mixing audio segment is proceeded a Fourier transform by a unit of 30 ms frame. Then, the likelihood ratio is computed, and an initial threshold (0) is used to judge. If the threshold can detect all of the voice part in the training data, the threshold is adjusted to be 0.2 higher until the threshold with the highest value that still can completely tell apart all the voice segments is obtained. There are t training data, so the step needs to be done for t times. However, each training data is not as long as usual, so it does not take too much time. When all training data is processed, t thresholds can be obtained and the smallest one among these t thresholds is chosen to be the threshold used in the system.
  • the audio signal inputted is divided into a noise segment and a noisy audio segment. Then, the audio feature extraction step is performed on the noisy audio segment to obtain audio features of the noisy audio segment.
  • Three audio features are used in discriminating the speech signals and the music signals in the preferred embodiment of the present invention. Each audio feature is defined in a length of about one second, and the length of one second is also the smallest unit in the discrimination in the preferred embodiment of the present invention.
  • These three audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively. They are described as follows.
  • the audio features of the low short time energy rate in a piece of audio signals, since the change of the energy in the frames of the speech signal is bigger than that of the music signal owing to the pitch, the speech signal and the music signal can be discriminated just by calculating the ratio of the low energy.
  • the audio feature of spectrum flux in a piece of audio segment, since the energy of the speech signal is changeable, if calculating the sum of the frequency distance between the adjacent frames in the piece of audio segment, the speech signal has bigger value.
  • the change in the frequency of the audio signal is usually slower, so the sum of the frequency distance between the adjacent frames is smaller. Therefore, the spectrum flux can be used to discriminate the speech and the music signal.
  • the audio feature of likelihood ratio crossing rate The waveform of the likelihood ratio obtained in the AAD step can be used to tell the speech and the music apart by observing the damping characteristics.
  • the speech signal has more frames of low energy than the music signal does.
  • the speech and the music signal are not easily discriminated in the way of calculating the energy in time domain. Therefore, the audio feature of likelihood ratio crossing rate is derived in frequency domain.
  • the likelihood ratio waveform of each frame obtained in the AAD step is used and the sum of the crossing rate of the likelihood ratio waveform compared to two thresholds is calculated. Generally speaking, the crossing rate in speech is higher than in music.
  • FIG. 5 and FIG. 6 illustrate diagrams of the likelihood ratio crossing rate of the music and the speech signal, respectively.
  • one second is the smallest analyzing unit for each segment, and eight and five windows with one second in unit are illustrated, respectively.
  • the mean and the two thresholds of the likelihood ratio in each window are computed.
  • the mean of the likelihood ratio is denoted by the upper line, and the middle line and the lower line represent the two thresholds with one third the mean of the likelihood ratio and one ninth the mean of the likelihood ratio, respectively.
  • the sum of the crossing rate of the likelihood ratio compared to the two thresholds is computed to discriminate between the music and the speech signals. From FIG. 5 and FIG. 6 , the crossing rate of the likelihood ratio to the two thresholds of the music part in FIG. 5 is smaller than that of the speech part in FIG. 6 .
  • FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention.
  • a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames.
  • a rectangular window is used to perform the convolution process on the audio feature sequences obtained.
  • the difference between before and after the smoothing step on the waveform of the music and the speech frames is shown in FIG. 7 .
  • the audio feature sequences are irregular.
  • the values of the features in speech segments are supposed to be high, but some of them are not as expected. So is the music segment.
  • the circles in FIG. 7 point out two examples of them.
  • the classifier is a KNN based classifier to classify the speech and the music types.
  • the signal belongs to the type (the speech or the music) which has the most training data in the nearest k training data in the codebook.
  • other classifiers may also be used, such as a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
  • GMM Gaussian mixture model
  • HMM hidden Markov model
  • MLP multi-layer perceptron
  • FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
  • the first figure in FIG. 8 is the original input audio signal.
  • the second figure in FIG. 8 is the result after obtaining the likelihood ratio.
  • the third figure in FIG. 8 is the result after the smoothing step, and the fourth figure in FIG. 8 is the result after segmenting the speech and the music segments. From FIG. 8 , the speech and the music segments can be obtained from the input audio signal after the audio activity detection step, the audio feature extraction step, the smoothing step and the segmentation step.
  • one advantage of the present invention is that the present invention provides an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
  • the present invention provides an audio signal segmentation algorithm which can be integrated into multimedia content analysis applications, multimedia data compression and audio recognition, and can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
  • yet another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The present invention discloses an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one noise segment and at least one noisy audio segment. Then, an audio feature extraction step is used on the noisy audio segment to obtain multiple audio features. Then, a smoothing step is applied. Then, multiple speech frames and multiple music frames are discriminated. The speech frames and the music frames compose at least one speech segment and at least one music segment. Finally, the speech segment and the music segment are segmented from the noisy audio segment.

Description

RELATED APPLICATIONS
The present application is based on, and claims priority from, Taiwan Application Serial Number 95118143, filed May 22, 2006, the disclosure of which is hereby incorporated by reference herein in its entirety.
FIELD OF THE INVENTION
The present invention relates to an audio signal segmentation algorithm, and more particularly, to an audio signal segmentation algorithm used under low signal-to-noise ratio (SNR) noise environment.
BACKGROUND OF THE INVENTION
The technique of segmenting speech/music signals from audio signals has become more important in multimedia applications. There are three kinds of audio signal segmentation algorithms at present. The first kind of audio signal segmentation algorithm designs classifiers by directly extracting the features of the signals in the time domain or the frequency domain to discriminate and to further segment the speech and the music signals. The features used in these kinds of audio signal segmentation algorithms are zero-crossing information, energy, pitch, Cepstral Coefficients, line spectral frequencies, 4 Hz modulation energy and some perception features, such as tone and rhythm. These kinds of conventional techniques extract the features directly. However, the size of the windows used to analyze the signals is increasingly bigger, so the segmented scope is not precise enough. Furthermore, fixed thresholds are used in most methods to determine the segmentation. Therefore, they cannot offer satisfactory results under low SNR noise environments.
The second kind of audio signal segmentation algorithm generates features needed in the classifiers by statistics, which is called the posterior probability based feature. Although better results can be obtained by getting features with statistics, a large number of training data samples are needed in these kinds of conventional techniques and they are also not suitable in actual environments.
The third kind of audio signal segmentation algorithm emphasizes the design of the classifier models. The most commonly used methods are Bayesian information criterion, Gaussian likelihood ratio and a hidden Markov model (HMM) based classifier. These kinds of conventional techniques put stress on setting up effective classifiers. Although the methods are practical, some of them need larger computation, such as using the Bayesian information criterion, and some of them need to prepare a large number of training data samples in advance to set up the models needed, such as using Gaussian likelihood ratio and hidden Markov model (HMM). They are not good choices in practical applications.
SUMMARY OF THE INVENTION
Therefore, one objective of the present invention is to provide an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
Another objective of the present invention is to provide an audio signal segmentation algorithm which can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
Still another objective of the present invention is to provide an audio signal segmentation algorithm in which plenty of training data is not needed and the ability of the features chosen to resist the noise is better.
Still another objective of the present invention is to provide an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
According to the aforementioned objectives, the present invention provides an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one first audio segment and at least one second audio segment. Then, an audio feature extraction step is performed on the second audio segment to obtain a plurality of audio features of the second audio segment. A smoothing step is then applied to the second audio segment after the audio feature extraction step. Afterwards, a plurality of speech frames and a plurality of music frames are discriminated from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
According to the preferred embodiment of the present invention, the first audio segment is a noise segment. The audio activity detection step further comprises the following steps. First, the audio signal is divided into a plurality of frames. Then, a frequency transformation step is applied to signals in each of the frames to obtain a plurality of bands in each frame. Then, a likelihood computation step is performed on the bands and a noise parameter to obtain a likelihood ratio there between. Then, a comparison step is performed on the likelihood ratio and a noise threshold. If the noise threshold is greater than the likelihood ratio, the bands belong to a first frame, and if the likelihood ratio is greater than the noise threshold, the bands belong to a second frame wherein the first frame belongs to the first audio segment and the second frame belongs to the second audio segment. When a distance between two adjacent second frames is smaller than a predetermined value, the two adjacent second frames are combined to compose the second audio segment.
According to the preferred embodiment of the present invention, the frequency transformation step is a Fourier Transform. The noise parameter is a noise variance of the Fourier coefficient and is obtained by estimating a variance of a noise segment in the initial part of the audio signal.
According to the preferred embodiment of the present invention, the estimation of the noise threshold further comprises the following steps. First, a noise segment in initial the part of the audio signal is extracted. Then, the noise segment is mixed with one of a plurality of noiseless speech/music segment to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment. Then, the audio activity detection step is applied to the mixing audio segment to divide the mixing audio segment into at least one speech segment and at least one music segment by using a first threshold. Afterwards, the algorithm judges if the speech segment and the music segment match the noiseless speech/music segment and obtain a result. If the result is yes, the first threshold is equal to the noise threshold. If the result is no, the first threshold is adjusted and the audio activity detection step and the judging step are repeated on the mixing audio segment. In the preferred embodiment of the present invention, the present invention further comprises mixing the noise segment and the other noiseless speech/music segments, respectively, and repeating the audio activity detection step and the judging step to obtain a plurality of thresholds, and then, comparing the thresholds with the first threshold to choose a smallest value as the noise threshold.
According to the preferred embodiment of the present invention, the audio features are selected from the group consisting of low short time energy rate (LSTER), spectrum flux (SF), likelihood ratio crossing rate (LRCR) and an arbitrary combination thereof. The audio feature extraction step to extract the audio feature of likelihood ratio crossing rate further comprises computing a sum of a crossing rate of the waveform of the likelihood ratio to a plurality of predetermined thresholds by using the likelihood ratio of each frame. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to the speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to the music segment. In the preferred embodiment of the present invention, one of the predetermined thresholds is one third the mean of the likelihood ratio, and another one of the predetermined thresholds is one ninth the mean of the likelihood ratio.
According to the preferred embodiment of the present invention, the smoothing step further comprises performing a convolution process to the second audio segment after the audio feature extraction step and a window. The window may be a rectangular window. The step of discriminating the speech frames and the music frames from the second audio segment is based on a classifier, and the classifier is selected from the group consisting of a K-nearest neighbor (KNN) classifier, a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier. After discriminating the speech frames and the music frames from the second audio segment, the speech frames and the music frames are respectively combined to form the speech segment and the music segment. The preferred embodiment of the present invention further comprises segmenting the speech segment and the music segment from the second audio segment.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention;
FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention;
FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention;
FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention;
FIG. 5 illustrates a diagram of the likelihood ratio crossing rate of the music signal;
FIG. 6 illustrates a diagram of the likelihood ratio crossing rate of the speech signal;
FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention; and
FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The present invention discloses an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one noise segment and at least one noisy audio segment. Then, multiple audio features are extracted from the noisy audio segment by a frame with fixed length in the audio feature extraction step. Afterwards, a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames. Then, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
In order to make the illustration of the present invention more explicit and complete, the following description is stated with reference to FIGS. 1 through 8.
Refer to FIG. 1. FIG. 1 illustrates a flow diagram of the audio signal segmentation algorithm according to the preferred embodiment of the present invention. First, an audio signal is provided in step 102. Then, an audio activity detection (AAD) step is applied to divide the audio signal into a noise segment 106 and a noisy audio segment 108 in step 104. Then, an audio feature extraction step is performed on the noisy audio segment 108, as shown in step 110. In the preferred embodiment of the present invention, the audio feature extraction step extracts three kinds of audio features from the noisy audio segment 108. The audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively. The likelihood ratio of each frame is used to compute the sum of the crossing rate in the waveform of the likelihood ratio compared to multiple predetermined thresholds. If the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to a speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to a music segment.
Then, in step 112, a convolution process is performed on the result obtained and a window (such as a rectangular window) in the smoothing step to raise the discrimination rate for the following step. Then, in step 114, a classifier is used to tell the speech and the music frames apart. The speech frames and the music frames compose at least one speech segment and at least one music segment, respectively. Then, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented. Finally, the speech segment 116 and the music segment 118 are obtained. In the preferred embodiment of the present invention, the classifier is a KNN based classifier and it classifies the signals into different types in a codebook and further determines if the signals belong to speech or music. The following describes in detail the audio activity detection step used in the preferred embodiment of the present invention.
Refer to FIG. 2. FIG. 2 illustrates a flow diagram of the audio activity detection step according to the preferred embodiment of the present invention. First, the audio signal is divided into multiple frames in step 202. The length of each frame may be 30 ms. Then a frequency transformation step is applied to the signals in each frame to obtain multiple bands in each frame in step 204. In the preferred embodiment of the present invention, the frequency transformation step uses a Fourier Transform. Then, a likelihood computation step is performed on the bands and a noise parameter 208 to obtain a likelihood ratio between them in step 206. The noise parameter 208 is the noise variance of the Fourier coefficient and is obtained by estimating the variance of a noise segment in the initial part of the audio signal.
Then, in step 210, a comparison step is performed between the likelihood ratio and the noise threshold 212. If the likelihood ratio is smaller than the noise threshold, the bands belong to a noise frame 214, and if the likelihood ratio is greater than the noise threshold, the bands belong to a noisy audio frame 216. In the preferred embodiment of the present invention, the likelihood computation step and the comparison step are based on the equation:
Λ = 1 L k = 0 L - 1 { X k 2 λ N ( k ) - log X k 2 λ N ( k ) - 1 } H 1 > < H 0 η
where Λ is the likelihood ratio, L is the number of the bands, Xk denotes the kth Fourier coefficient in one of the frames, λN(k) is the noise variance of the Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise, η is the noise threshold, H0 denotes the result is the noise frame, and H1 denotes the result is the noisy audio frame.
Then, a frame-merging process is performed in step 218. Some times the too-small and discrete frames are meaningless, so the frame-merging process is used to merge the small pieces into longer segments and to further raise the discrimination accuracy afterwards. In the preferred embodiment of the present invention, the method to merge the frames is to determine if the distance between the two adjacent frames detected is too small by programming. If the distance is too small, they are considered to be merged into the same frame. If the distance is not too small, they are still considered two different frames. In other words, when the distance between two adjacent noisy audio frames is smaller than a predetermined value, the two adjacent noisy audio frames are combined to compose the noisy audio segment 220. Refer to FIG. 3. FIG. 3 illustrates an example of the frame-merging process in the preferred embodiment of the present invention. As the circle in FIG. 3 shows, when the distance between the adjacent frames is too small, they are merged into a single frame.
It is noted that the noise threshold η can be estimated as different values according to different environments rather than a fixed value in order to make the audio signal segmentation algorithm of the present invention suitable for different environments. The following describes in detail the estimation of the noise threshold.
Refer to FIG. 4. FIG. 4 illustrates a flow diagram of the estimation of the noise threshold according to the preferred embodiment of the present invention. First, a noise segment 402 in the initial part of the audio signal is extracted. Then, the noise segment 402 is mixed with a noiseless speech/music segment 404 to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment 406. Then, the noise parameter 410 estimated from the noise segment 402 is used to perform the audio activity detection step on the mixing audio segment 406, as shown in step 408. The mixing audio segment 406 is first divided into multiple frames, and then, a frequency transformation step is applied to signals in each frame to obtain multiple bands in each frame. Then, a likelihood ratio between the bands and the noise parameter is computed, and the mixing audio segment 406 is divided into at least one speech segment and at least one music segment using a first threshold. Afterwards, in step 412, a judging step is performed to judge if the speech segment and the music segment are correctly segmented to match the noiseless speech/music segment 404 and obtain a result. If the result is no, the first threshold is adjusted and the audio activity detection step and the judging step are repeated on the mixing audio segment 406, as shown in step 414. If the result is yes, the first threshold is equal to the noise threshold, as shown in step 416.
In other words, the estimation of the noise threshold in the preferred embodiment of the present invention extracts a noise segment in initial part of the audio signal first and then mixes the noise segment with prepared training data (a noiseless speech/music segment) to a certain predetermined signal-to-noise ratio. Since the training data is prepared in advance, the location of the voice in the training data is already known, so the signal-to-noise ratio of the training data and the noise segment can be adjusted. Generally, if the signal with the lowest SNR in the system is 5 dB, the SNR of the mixing audio segment can be set to 3 dB to estimate the threshold. It just needs to be smaller than 5 dB. Then, the audio activity detection step is performed to the mixing audio segment. The mixing audio segment is proceeded a Fourier transform by a unit of 30 ms frame. Then, the likelihood ratio is computed, and an initial threshold (0) is used to judge. If the threshold can detect all of the voice part in the training data, the threshold is adjusted to be 0.2 higher until the threshold with the highest value that still can completely tell apart all the voice segments is obtained. There are t training data, so the step needs to be done for t times. However, each training data is not as long as usual, so it does not take too much time. When all training data is processed, t thresholds can be obtained and the smallest one among these t thresholds is chosen to be the threshold used in the system.
The following describes in detail the audio feature extraction step used in the preferred embodiment of the present invention.
After performing the audio activity detection step, the audio signal inputted is divided into a noise segment and a noisy audio segment. Then, the audio feature extraction step is performed on the noisy audio segment to obtain audio features of the noisy audio segment. Three audio features are used in discriminating the speech signals and the music signals in the preferred embodiment of the present invention. Each audio feature is defined in a length of about one second, and the length of one second is also the smallest unit in the discrimination in the preferred embodiment of the present invention. These three audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively. They are described as follows.
The audio features of the low short time energy rate: in a piece of audio signals, since the change of the energy in the frames of the speech signal is bigger than that of the music signal owing to the pitch, the speech signal and the music signal can be discriminated just by calculating the ratio of the low energy.
The audio feature of spectrum flux: in a piece of audio segment, since the energy of the speech signal is changeable, if calculating the sum of the frequency distance between the adjacent frames in the piece of audio segment, the speech signal has bigger value. The change in the frequency of the audio signal is usually slower, so the sum of the frequency distance between the adjacent frames is smaller. Therefore, the spectrum flux can be used to discriminate the speech and the music signal.
The audio feature of likelihood ratio crossing rate: The waveform of the likelihood ratio obtained in the AAD step can be used to tell the speech and the music apart by observing the damping characteristics. The speech signal has more frames of low energy than the music signal does. However, the speech and the music signal are not easily discriminated in the way of calculating the energy in time domain. Therefore, the audio feature of likelihood ratio crossing rate is derived in frequency domain. The likelihood ratio waveform of each frame obtained in the AAD step is used and the sum of the crossing rate of the likelihood ratio waveform compared to two thresholds is calculated. Generally speaking, the crossing rate in speech is higher than in music. The following describes in detail the audio feature extraction step in likelihood ratio crossing rate used in the present invention.
Refer to FIG. 5 and FIG. 6. FIG. 5 and FIG. 6 illustrate diagrams of the likelihood ratio crossing rate of the music and the speech signal, respectively. As shown in the enlarged diagram in FIG. 5 and FIG. 6, one second is the smallest analyzing unit for each segment, and eight and five windows with one second in unit are illustrated, respectively. The mean and the two thresholds of the likelihood ratio in each window are computed. The mean of the likelihood ratio is denoted by the upper line, and the middle line and the lower line represent the two thresholds with one third the mean of the likelihood ratio and one ninth the mean of the likelihood ratio, respectively. Then, the sum of the crossing rate of the likelihood ratio compared to the two thresholds is computed to discriminate between the music and the speech signals. From FIG. 5 and FIG. 6, the crossing rate of the likelihood ratio to the two thresholds of the music part in FIG. 5 is smaller than that of the speech part in FIG. 6.
Refer to FIG. 7. FIG. 7 illustrates an example of the smoothing step according to the preferred embodiment of the present invention. After extracting the three audio features from each segment, a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames. In the preferred embodiment of the present invention, a rectangular window is used to perform the convolution process on the audio feature sequences obtained. The difference between before and after the smoothing step on the waveform of the music and the speech frames is shown in FIG. 7. In the waveform before the smoothing step, the audio feature sequences are irregular. The values of the features in speech segments are supposed to be high, but some of them are not as expected. So is the music segment. The circles in FIG. 7 point out two examples of them. After the convolution process is performed with the rectangular window, it is shown that the feature sequences are smoother. Therefore, after the smoothing step, the error in discriminating can be reduced, and the discrimination rate for the following step can be raised.
After the smoothing step, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented. In the preferred embodiment of the present invention, the classifier is a KNN based classifier to classify the speech and the music types. The signal belongs to the type (the speech or the music) which has the most training data in the nearest k training data in the codebook. In other embodiments of the present invention, other classifiers may also be used, such as a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
Refer to FIG. 8. FIG. 8 illustrates an example of the audio signal segmentation algorithm according to the preferred embodiment of the present invention. The first figure in FIG. 8 is the original input audio signal. The second figure in FIG. 8 is the result after obtaining the likelihood ratio. The third figure in FIG. 8 is the result after the smoothing step, and the fourth figure in FIG. 8 is the result after segmenting the speech and the music segments. From FIG. 8, the speech and the music segments can be obtained from the input audio signal after the audio activity detection step, the audio feature extraction step, the smoothing step and the segmentation step.
According to the aforementioned description, one advantage of the present invention is that the present invention provides an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
According to the aforementioned description, another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be integrated into multimedia content analysis applications, multimedia data compression and audio recognition, and can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
According to the aforementioned description, yet another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
As is understood by a person skilled in the art, the foregoing preferred embodiments of the present invention are illustrative of the present invention rather than limiting of the present invention. It is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims, the scope of which should be accorded the broadest interpretation so as to encompass all such modifications and similar structure.

Claims (17)

1. An audio signal segmentation algorithm comprising:
providing an audio signal;
applying an audio activity detection (AAD) step to divide the audio signal into at least one first audio segment and at least one second audio segment, wherein the audio activity detection step further comprises:
dividing the audio signal into a plurality of frames;
applying a frequency transformation step to signals in each of the frames to obtain a plurality of bands in each frame;
performing a likelihood computation step to the bands and a noise parameter to obtain a likelihood ratio therebetween;
performing a comparison step to the likelihood ratio and a noise threshold, if the noise threshold is greater than the likelihood ratio, the bands belonging to a first frame, and if the likelihood ratio is greater than the noise threshold, the bands belonging to a second frame wherein the first frame belongs to the first audio segment and the second frame belongs to the second audio segment; and
when a distance between two adjacent second frames is smaller than a predetermined value, combining the two adjacent second frames to compose the second audio segment,
performing an audio feature extraction step on the second audio segment to obtain a plurality of audio features of the second audio segment;
applying a smoothing step to the second audio segment after the audio feature extraction step; and
discriminating a plurality of speech frames and a plurality of music frames from the second audio segment wherein the speech frames and the music frames compose at least one speech segment and at least one music segment, respectively.
2. The audio signal segmentation algorithm according to claim 1, wherein the frequency transformation step is proceeding a Fourier Transform.
3. The audio signal segmentation algorithm according to claim 1, wherein the noise parameter is a noise variance of Fourier coefficient and is obtained by estimating a variance of a noise segment in the initial part of the audio signal.
4. The audio signal segmentation algorithm according to claim 1, wherein the likelihood computation step and the comparison step are based on the equation:
Λ = 1 L k = 0 L - 1 { X k 2 λ N ( k ) - log X k 2 λ N ( k ) - 1 } H 1 > < H 0 η
where Λ is the likelihood ratio, L is the number of the bands, Xk denotes the kth Fourier coefficient in one of the frames, λk(k) is the noise variance of Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise, η is the noise threshold, H0 denotes the result is the first frame, and H1 denotes the result is the second frame.
5. The audio signal segmentation algorithm according to claim 1, wherein the estimation of the noise threshold further comprises:
extracting a noise segment from the initial part of the audio signal;
mixing the noise segment with one of a plurality of noiseless speech/music segments to a predetermined signal-to-noise ratio (SNR) to form a mixing audio segment;
applying the audio activity detection step to the mixing audio segment to divide the mixing audio segment into at least one speech segment and at least one music segment by using a first threshold; and
judging if the speech segment and the music segment match the noiseless speech/music segment and obtaining a result, if the result is yes, the first threshold being equal to the noise threshold, and if the result is no, adjusting the first threshold and repeating the audio activity detection step and the judging step on the mixing audio segment.
6. The audio signal segmentation algorithm according to claim 5, further comprising:
mixing the noise segment and the other noiseless speech/music segments, respectively, and repeating the audio activity detection step and the judging step to obtain a plurality of thresholds; and
comparing the thresholds with the first threshold to choose a smallest value as the noise threshold.
7. The audio signal segmentation algorithm according to claim 1, wherein the audio features are selected from the group consisting of low short time energy rate (LSTER), spectrum flux (SF), likelihood ratio crossing rate (LRCR) and an arbitrary combination thereof.
8. The audio signal segmentation algorithm according to claim 7, wherein the audio feature extraction step extracts the audio feature of the likelihood ratio crossing rate further comprising:
computing a sum of a crossing rate in the waveform of the likelihood ratio compared to a plurality of predetermined thresholds by using the likelihood ratio of each frame, if the sum of the crossing rate is greater than a predetermined value, the likelihood ratio belongs to the speech segment, and if the sum of the crossing rate is smaller than the predetermined value, the likelihood ratio belongs to the music segment.
9. The audio signal segmentation algorithm according to claim 8, wherein one of the predetermined thresholds is one third the mean of the likelihood ratio, and another one of the predetermined thresholds is one ninth the mean of the likelihood ratio.
10. The audio signal segmentation algorithm according to claim 1, wherein the smoothing step further comprises performing a convolution process to the second audio segment after the audio feature extraction step and a window.
11. The audio signal segmentation algorithm according to claim 10, wherein the window is a rectangular window.
12. The audio signal segmentation algorithm according to claim 1, wherein the step of discriminating the speech frames and the music frames from the second audio segment is based on a classifier, and the classifier is selected from the group consisting of a K-nearest neighbor (KNN) classifier, a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
13. The audio signal segmentation algorithm according to claim 1, further comprising combining the speech frames and the music frames, respectively, to form the speech segment and the music segment after the step of discriminating the speech frames and the music frames from the second audio segment.
14. The audio signal segmentation algorithm according to claim 1, further comprising segmenting the speech segment and the music segment from the second audio segment.
15. The audio signal segmentation algorithm according to claim 1, wherein the first audio segment is a noise segment.
16. The audio signal segmentation algorithm according to claim 1, wherein the audio features are extracted by a frame with fixed length in the audio feature extraction step.
17. The audio signal segmentation algorithm according to claim 16, wherein the fixed length is one second.
US11/589,772 2006-05-22 2006-10-31 Audio signal segmentation algorithm Expired - Fee Related US7774203B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
TW95118143A 2006-05-22
TW95118143 2006-05-22
TW095118143A TWI312982B (en) 2006-05-22 2006-05-22 Audio signal segmentation algorithm

Publications (2)

Publication Number Publication Date
US20070271093A1 US20070271093A1 (en) 2007-11-22
US7774203B2 true US7774203B2 (en) 2010-08-10

Family

ID=38713045

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/589,772 Expired - Fee Related US7774203B2 (en) 2006-05-22 2006-10-31 Audio signal segmentation algorithm

Country Status (2)

Country Link
US (1) US7774203B2 (en)
TW (1) TWI312982B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110029306A1 (en) * 2009-07-28 2011-02-03 Electronics And Telecommunications Research Institute Audio signal discriminating device and method
US20110243349A1 (en) * 2010-03-30 2011-10-06 Cambridge Silicon Radio Limited Noise Estimation
US20130103398A1 (en) * 2009-08-04 2013-04-25 Nokia Corporation Method and Apparatus for Audio Signal Classification
US20140088974A1 (en) * 2012-09-26 2014-03-27 Motorola Mobility Llc Apparatus and method for audio frame loss recovery
US8712771B2 (en) * 2009-07-02 2014-04-29 Alon Konchitsky Automated difference recognition between speaking sounds and music
US20140257814A1 (en) * 2013-03-05 2014-09-11 Microsoft Corporation Posterior-Based Feature with Partial Distance Elimination for Speech Recognition

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2100294A4 (en) * 2006-12-27 2011-09-28 Intel Corp Method and apparatus for speech segmentation
JP5130809B2 (en) * 2007-07-13 2013-01-30 ヤマハ株式会社 Apparatus and program for producing music
US20090043577A1 (en) * 2007-08-10 2009-02-12 Ditech Networks, Inc. Signal presence detection using bi-directional communication data
ATE552651T1 (en) * 2008-12-24 2012-04-15 Dolby Lab Licensing Corp AUDIO SIGNAL AUTUTITY DETERMINATION AND MODIFICATION IN THE FREQUENCY DOMAIN
CN101847412B (en) * 2009-03-27 2012-02-15 华为技术有限公司 Method and device for classifying audio signals
US10224036B2 (en) * 2010-10-05 2019-03-05 Infraware, Inc. Automated identification of verbal records using boosted classifiers to improve a textual transcript
TWI412019B (en) 2010-12-03 2013-10-11 Ind Tech Res Inst Sound event detecting module and method thereof
CN104282315B (en) * 2013-07-02 2017-11-24 华为技术有限公司 Audio signal classification processing method, device and equipment
CN104347067B (en) 2013-08-06 2017-04-12 华为技术有限公司 Audio signal classification method and device
CN103413553B (en) * 2013-08-20 2016-03-09 腾讯科技(深圳)有限公司 Audio coding method, audio-frequency decoding method, coding side, decoding end and system
US9685156B2 (en) * 2015-03-12 2017-06-20 Sony Mobile Communications Inc. Low-power voice command detector
CN108269567B (en) * 2018-01-23 2021-02-05 北京百度网讯科技有限公司 Method, apparatus, computing device, and computer-readable storage medium for generating far-field speech data
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines
CN111724757A (en) * 2020-06-29 2020-09-29 腾讯音乐娱乐科技(深圳)有限公司 Audio data processing method and related product
CN112489692B (en) * 2020-11-03 2024-10-18 北京捷通华声科技股份有限公司 Voice endpoint detection method and device
CN112735470B (en) * 2020-12-28 2024-01-23 携程旅游网络技术(上海)有限公司 Audio cutting method, system, equipment and medium based on time delay neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415253B1 (en) * 1998-02-20 2002-07-02 Meta-C Corporation Method and apparatus for enhancing noise-corrupted speech
US20020161576A1 (en) * 2001-02-13 2002-10-31 Adil Benyassine Speech coding system with a music classifier
US20060015333A1 (en) * 2004-07-16 2006-01-19 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US7558729B1 (en) * 2004-07-16 2009-07-07 Mindspeed Technologies, Inc. Music detection for enhancing echo cancellation and speech coding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415253B1 (en) * 1998-02-20 2002-07-02 Meta-C Corporation Method and apparatus for enhancing noise-corrupted speech
US20020161576A1 (en) * 2001-02-13 2002-10-31 Adil Benyassine Speech coding system with a music classifier
US20060015333A1 (en) * 2004-07-16 2006-01-19 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US7558729B1 (en) * 2004-07-16 2009-07-07 Mindspeed Technologies, Inc. Music detection for enhancing echo cancellation and speech coding

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8712771B2 (en) * 2009-07-02 2014-04-29 Alon Konchitsky Automated difference recognition between speaking sounds and music
US20110029306A1 (en) * 2009-07-28 2011-02-03 Electronics And Telecommunications Research Institute Audio signal discriminating device and method
US20130103398A1 (en) * 2009-08-04 2013-04-25 Nokia Corporation Method and Apparatus for Audio Signal Classification
US9215538B2 (en) * 2009-08-04 2015-12-15 Nokia Technologies Oy Method and apparatus for audio signal classification
US20110243349A1 (en) * 2010-03-30 2011-10-06 Cambridge Silicon Radio Limited Noise Estimation
US8666092B2 (en) * 2010-03-30 2014-03-04 Cambridge Silicon Radio Limited Noise estimation
US20140088974A1 (en) * 2012-09-26 2014-03-27 Motorola Mobility Llc Apparatus and method for audio frame loss recovery
US9123328B2 (en) * 2012-09-26 2015-09-01 Google Technology Holdings LLC Apparatus and method for audio frame loss recovery
US20140257814A1 (en) * 2013-03-05 2014-09-11 Microsoft Corporation Posterior-Based Feature with Partial Distance Elimination for Speech Recognition
US9336775B2 (en) * 2013-03-05 2016-05-10 Microsoft Technology Licensing, Llc Posterior-based feature with partial distance elimination for speech recognition

Also Published As

Publication number Publication date
TW200744069A (en) 2007-12-01
US20070271093A1 (en) 2007-11-22
TWI312982B (en) 2009-08-01

Similar Documents

Publication Publication Date Title
US7774203B2 (en) Audio signal segmentation algorithm
US8155953B2 (en) Method and apparatus for discriminating between voice and non-voice using sound model
US7263485B2 (en) Robust detection and classification of objects in audio using limited training data
US7904295B2 (en) Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers
EP1083541B1 (en) A method and apparatus for speech detection
Ramírez et al. An effective subband OSF-based VAD with noise reduction for robust speech recognition
US7177808B2 (en) Method for improving speaker identification by determining usable speech
US6785645B2 (en) Real-time speech and music classifier
US8428945B2 (en) Acoustic signal classification system
Evangelopoulos et al. Multiband modulation energy tracking for noisy speech detection
US6993481B2 (en) Detection of speech activity using feature model adaptation
US20040064314A1 (en) Methods and apparatus for speech end-point detection
Singh et al. Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination
US7120576B2 (en) Low-complexity music detection algorithm and system
US7243063B2 (en) Classifier-based non-linear projection for continuous speech segmentation
CN104835498A (en) Voiceprint identification method based on multi-type combination characteristic parameters
Górriz et al. Hard C-means clustering for voice activity detection
CA2492204A1 (en) Similar speaking recognition method and system using linear and nonlinear feature extraction
CN103165127A (en) Sound segmentation equipment, sound segmentation method and sound detecting system
Khoa Noise robust voice activity detection
Schwartz et al. The application of probability density estimation to text-independent speaker identification
Couvreur et al. Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models
Górriz et al. An effective cluster-based model for robust speech detection and speech recognition in noisy environments
US7630891B2 (en) Voice region detection apparatus and method with color noise removal using run statistics
Rabaoui et al. Using HMM-based classifier adapted to background noises with improved sounds features for audio surveillance application

Legal Events

Date Code Title Description
AS Assignment

Owner name: NATIONAL CHENG KUNG UNIVERSITY, TAIWAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, JHING-FA;HUANG, CHAO-CHING;WU, DIAN-JIA;REEL/FRAME:018483/0590

Effective date: 20061017

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552)

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20220810