WO2008082133A1 - Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same - Google Patents

Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same Download PDF

Info

Publication number
WO2008082133A1
WO2008082133A1 PCT/KR2007/006811 KR2007006811W WO2008082133A1 WO 2008082133 A1 WO2008082133 A1 WO 2008082133A1 KR 2007006811 W KR2007006811 W KR 2007006811W WO 2008082133 A1 WO2008082133 A1 WO 2008082133A1
Authority
WO
WIPO (PCT)
Prior art keywords
long
term feature
audio signal
current frame
term
Prior art date
Application number
PCT/KR2007/006811
Other languages
French (fr)
Inventor
Chang-Yong Son
Eun-Mi Oh
Ki-Hyun Choo
Jung-Hoe Kim
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Priority to EP07860649A priority Critical patent/EP2102860A4/en
Publication of WO2008082133A1 publication Critical patent/WO2008082133A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present general invention concept relates a method and apparatus to classify for an audio signal and a method and apparatus to encode and/or decode for an audio signal using the method and apparatus to classify, and more particularly, to a system that classifies audio signals into music signals and speech signals, an encoding apparatus that encodes an audio signal according to whether it is a music signal or a speech signal, and an audio signal classifying method and apparatus which can be applied to Universal Codec and the like Background Art
  • Audio signals can be classified into various types, such as speech signals, music signals, or mixtures of speech signals and music signals, according to their characteristics, and different coding methods or compression methods are applied to these types Compression methods for audio signals can be roughly divided into an audio codec and a speech codec
  • the audio codec such as Advanced Audio Coding Plus (aacPlus) is intended to compress music signals
  • the audio codec compresses a music signal in a frequency domain using a psychoacoustic model
  • a speech signal is compiessed using the audio codec
  • sound quality degradation is worse than that caused by compression of an audio signal using the speech codec and becomes more serious when the speech signal includes an attack signal
  • the speech codec such as Adaptive Multi Rate - WideBand (AMR-WB), is intended to compress speech signals
  • the speech codec compresses an audio signal in a time domain using an utterance model
  • sound quality degradation is woi se than that caused b) compiession of a speech signal using the audio codec Accordingly
  • U S Patent No 6, 134,518 discloses a method for coding a digital audio signal using a
  • a classifier 20 measures the autoco ⁇ elation of an input audio signal 10 to select one of a CELP coder 30 and a ti ansform codei 40 based on the measurement
  • the input audio signal 10 is coded by whichevei one of the CELP coder 30 and the ti ansform coder 40 ai e selected, by switching ol a switch 50
  • the US patent discloses the classifier 20 that calculates a pi obdbihty that a cui rent audio signal is a speech signal oi a music signal using auto- correlation in the time domain.
  • the present invention provides a classifying method and apparatus for an audio signal, in which a classification threshold for a current frame that is to be classified is adaptively adjusted according to a long-term feature of the audio signal in order to classify the current frame, thereby improving the hit rate of signal classification, suppressing frequent oscillation of a mode in frame units, improving noise tolerance, and improving smoothness of a reconstructed audio signal; and an encoding/decoding method and apparatus for an audio signal using the classifying method and apparatus.
  • a method of classifying an audio signal comprising: (a) analyzing the audio signal in units of frames, and generating a short-term feature and a long-term feature from the result of analyzing; (b) adaptively adjusting a classification threshold for a current frame that is to be classified, according to the generated long-term feature; and (c) classifying the current frame using the adjusted classification threshold.
  • an apparatus for classifying an audio signal comprising: a short-term feature generation unit to analyze the audio signal in units of frames and generating a short-term feature; a long- term feature generation unit to generate a long-term feature using the short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, by using the generated long-term feature; and a classification unit to classify the current frame using the adjusted classification threshold.
  • an apparatus for encoding an audio signal comprising: a short-term feature generation unit to analyze an audio signal in units of frames and generating a short-term feature; a long- term feature generation unit to generate a long-term feature using the short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, using the generated long-term feature; a classification unit to classify the current frame using the adaptively adjusted classification threshold; an encoding unit to perform the classified audio signal in units of frames; and a multiplexer to perform bitstream processing on the encoded signal so as to generate a bitstream.
  • a method of decoding an audio signal comprising: receiving a bitstream including classification information regarding each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; determining a decoding mode for the audio signal based on the classification information; and decoding the received bitstream according to the determined decoding mode.
  • an apparatus for decoding an audio signal comprising: a receipt unit to receive a bitstream including classification information for each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; a decoding mode determination unit to determine a decoding mode for the received bitstream according to the classification information; and a decoding unit to decode the received bitstream according to the determined decoding mode.
  • FIG. 1 is a block diagram of a conventional audio signal encoder
  • FIG. 2 is a block diagram of an apparatus to encode for an audio signal according to an embodiment of the present general inventive concept
  • FIG. 3 is a block diagram of an apparatus to classify for an audio signal according to an embodiment of the present general inventive concept
  • FIG. 4 is a detailed block diagram of a short-term feature generation unit and a long- term feature generation unit illustrated in FIG. 3;
  • FIG. 5 is a detailed block diagram of a linear prediction-long-term prediction
  • FIG. 6A is a screen shot illustrating a variation feature SNRJVar of an LP-LTP gain according to a music signal and a speech signal;
  • FIG. 6B is a reference diagram illustrating the distribution feature of a frequency percent according to the variation feature SNR_VAR of FIG. 6A;
  • FIG. 6C is a reference diagram illustrating the distribution feature of cumulative frequency percent according to the variation feature SNR_VAR of FIG. 6A;
  • FIG. 6D is a reference diagram illustrating a long-term feature SNR_SP according to the LP-LTP gain of FIG. 6A;
  • FIG. 7A is a screen shot illustrating a variation feature TILTJVAR of a spectrum tilt according to a music signal and a speech signal;
  • FIG. 7B is a reference diagram illustrating a long-term feature TILTJSP of the spectrum tilt of FIG. 7A;
  • FIG. 8A is a screen shot illustrating a variation feature ZC_Var of a zero crossing rate according to a music signal and a speech signal;
  • FIG. 8B is a reference diagram illustrating a long-term feature ZC_SP with respect to the zero crossing rate of FIG. 8 A;
  • FIG. 9 is a reference diagram illustrating a long-term feature SPP according to a music signal and a speech signal
  • FIG. 10 is a flowchart illustrating a method to classify an audio signal according to an embodiment of the present general inventive concept.
  • FIG. 1 1 is a block diagram of an apparatus to decode for an audio signal according to an exemplary embodiment of the present general inventive concept.
  • Mode for Invention
  • FIG. 2A is a block diagram of an apparatus to encode for an audio signal according to an embodiment of the present general inventive concept.
  • the apparatus to encode for an audio signal includes an audio signal classifying apparatus 100, a speech coding unit 200, a music coding unit 300, and a bitstream multiplexer 400.
  • the audio signal classifying apparatus 100 divides an input audio signal into frames based on the input time of the audio signal, and determines whether each of the frames is a speech signal or a music signal.
  • the audio signal classifying apparatus 100 transmits as additional information classification information indicating whether a current frame is a speech signal or a music signal, to the bitstream multiplexer 400.
  • the detailed construction of the audio signal classifying apparatus 100 is illustrated in FIG. 3 and will be described later.
  • the audio signal classifying apparatus 100 may f urther include a time-to-frequency conversion unit (not shown) that converts an audio signal in the time domain into a signal in the frequency domain.
  • the speech coding unit 200 encodes an audio signal corresponding to a frame that is classified into the speech signal by the audio signal classifying apparatus 100, and transmits the encoded audio signal to the bitstream multiplexer 400.
  • encoding is performed by the speech coding unit 200 and the music coding unit 300, but an audio signal may be encoded by a time-domain coding unit and a frequency-domain coding unit.
  • an audio signal may be encoded by a time-domain coding unit and a frequency-domain coding unit.
  • CELP Code excited linear prediction
  • TCX transform coded excitation
  • AAC advanced audio codec
  • the bitstream multiplexer 400 receives the encoded audio signal from the speech coding unit 200 or the music coding unit 300 and the classification information from the audio signal classifying apparatus 100, and generates a bitstream using the received signal and the classification information.
  • the classification information can be used to generate a bitstream in a decoding mode in order to determine a method of efficiently reconstruct an audio signal.
  • FIG. 3 is a block diagram of an audio signal classifying apparatus 100 according to an exemplary embodiment of the present invention.
  • the audio signal classifying apparatus 100 includes an audio signal division unit 110, a short- term feature generation unit 120, a long-term feature generation unit 130, a buffer 160 including a short-term feature buffer 161 and a long-term feature buffer 162, a long- term feature comparison unit 170, a classification threshold adjustment unit 180, and a classification unit 190.
  • the audio signal division unit 110 divides an input audio signal into frames in the time domain and transmits the divided audio signal to the short-term feature generation unit 120.
  • the short-term feature generation unit 120 performs short-term analysis with respect to the divided audio signal to generate a short-term feature.
  • the short-term feature is the unique feature of each frame, the use of which can determine whether the current frame is in a music mode or a speech mode and which one of time domain and the frequency domain is an efficient encoding domain for the current frame.
  • the short-term feature may include a linear prediction-long-term prediction
  • the short-term feature generation unit 120 may independently generate and output one short-term feature or a plurality of short-term features, or output the sum of a plurality of weighted short-term features as a representative short-term feature.
  • the detailed structure of the short-term feature generation unit 120 is illustrated in FIG. 4 and will be described later.
  • the long-term feature generation unit 130 generates a long-term feature using the short-term feature generated by the short-term feature generation unit 120 and features that are stored in the short-term feature buffer 161 and the long-term feature buffer 162.
  • the long-term feature generation unit 130 includes a first long-term feature generation unit 140 and a second long-term feature generation unit 150.
  • the first long-term feature generation unit 140 obtains information about the short- term features of 5 consecutive previous frames preceding the current frame from the short-term feature buffer 161 to calculate an average value and calculates the difference between the short-term feature of the current frame and the calculated average value, thereby generating a variation feature.
  • the average value is an average of
  • LP-LTP gains of the previous frames preceding the current frame and the variation feature is information describing how much the LP-LTP gain of the current frame deviates from the average value corresponding to a predetermined term.
  • SNR_VAR Signal to Noise Ratio Variation
  • the second long-term feature generation unit 150 generates a long-term feature having a moving average that considers a per-frame change in the variation feature generated by the first long-term feature generation unit 140 under a predetermined constraint.
  • the predetermined constraint means a condition and a method for applying a weight to the variation feature of a previous frame preceding the current frame.
  • the second long-term feature generation unit 150 distinguishes between a case where the variation feature of the current frame is greater than a predetermined threshold and a case where the variation feature of the current frame is less than the predetermined threshold, and applies different weights to the variation feature of the previous frame and the variation feature of the current frame, thereby generating a long-term feature.
  • the predetermined threshold is a preset value for distinguishing between a speech signal and a music signal. The generation of the long-term feature will later be described in more detail.
  • the buffer 160 includes the short-term feature buffer 161 and the long-term feature buffer 162
  • the short-term feature buffer 161 stores a short-term feature generated by the short-term feature generation unit 120 for at least a predetermined period of time
  • the long-term feature buffer 162 stores a long-term feature generated by the first long-term feature generation unit 140 and the second long-term feature generation unit 150 for at least a predetermined period of time.
  • the long-term feature comparison unit 170 compares the long-term feature generated by the second long-term feature generation unit 150 with a predetermined threshold.
  • the predetermined threshold is a long-term feature for the case where there is a high possibility that a current signal is a speech signal and is previously determined by preliminary statistical analysis.
  • a threshold SpThr for a long-term feature is set as illustrated in FIG. 9B and the long-term feature generated by the second long-term feature generation unit 150 is greater than the threshold SpThr, the possibility that the current frame is a music signal is less than 1%. In other words, when the long-term feature is greater than the threshold, the current frame can be classified into a speech signal.
  • the type of the current frame can be determined by a process of adjusting a classification threshold and comparison of the short-term feature with the classification threshold.
  • the threshold may be adjusted based on the hit rate of classification and as illustrated in FIG. 9B, the hit rate of classification is lowered by setting the threshold low.
  • the classification threshold adjustment unit 180 adaptively adjusts the classification threshold that is referred to for classifying the current frame when the long-term feature generated by the second long-term feature generation unit 150 is less than the threshold, i.e., when it is difficult to determine the type of the current frame only with lhe long-term feature.
  • the classification threshold adjustment unit 180 receives classification information of a previous frame from the classification unit 190, and adjusts the classification threshold adaptively according to whether the previous frame is classified into the speech signal or the music signal.
  • the classification threshold is used to determine whether the short-term feature of a frame that is to be classified, i.e., the current frame, has a property of the speech signal or the music signal.
  • the main technical idea of the current embodiment is that the classification threshold is adjusted according to whether a previous frame preceding the current frame is classified into the speech signal or the music signal. The adjustment of the classification threshold will later be described in detail.
  • the classification unit 190 compares a short-term feature STF_THR of the current frame with a classification threshold STF_THR adjusted by the classification threshold adjustment unit 180 in order to determine whether the current frame is the speech signal or the music signal.
  • FIG. 4 is a detailed block diagram of the short-term feature generation unit 120 and the long-term feature generation unit 130 illustrated in FIG. 3.
  • the short-term feature generation unit 120 includes an LP-LTP gain generation unit 121 , a spectrum tilt generation unit 122, and a zero crossing rate (ZCR) generation unit 123.
  • the long-term feature generation unit 130 includes an LP-LTP moving average calculation unit 141 , a spectrum tilt moving average calculation unit 142, a zero crossing rate moving average calculation unit 143, a first variation feature comparison unit 151 , a second variation feature comparison unit 152, a third variation feature comparison unit 153, a SNR_SP calculation unit 154, a TILT_SP calculation unit 155, and a ZC_SP calculation unit
  • the LP-LTP gain generation unit 127 generates an LP-LTP gain of the current frame by short-term analysis with respect to each frame of the input audio signal.
  • FIG. 5 is a detailed block diagram of the LP-LTP gain generation unit 121. Referring to FIG. 5, the LP-LTP gain generation unit 121 includes an LP analysis unit 121a, an open-loop pitch analysis unit 121b, an LTP contribution synthesis unit 121c, and a weighted SegSNR calculation unit 121d. [53] The LP analysis unit 121a calculates
  • PrdErr is a prediction ei ⁇ or according to Levinson-Durbin that is a process of obtaining an LP filter coefficient, and is the first reflection coefficient.
  • the LP analysis unit 121a calculates a linear prediction coefficient (LPC) using autocorrelation with respect to the current frame. At this time, a short-term analysis filter is specified by the LPC and a signal passing through the specified filter is transmitted to the open-loop pitch analysis unit 121b.
  • LPC linear prediction coefficient
  • the open-loop pitch analysis unit 121 b calculates a pitch correlation by performing long-term analysis with respect to an audio signal that is filtered by the short-term analysis filter.
  • the open-pitch loop analysis unit 121 b calculates an open-loop pitch lag for the maximum cross correlation between an audio signal corresponding to a previous frame stored in the buffer 160 and an audio signal corresponding to the current frame, and specifies a long-term analysis filter using the calculated lag.
  • the open-loop pitch analysis unit 121 b obtains a pitch using correlation between a previous audio signal and the current audio signal, which is obtained by the LP analysis unit 121 a, and divides the correlation by the pitch, thereby calculating a normalized pitch correlation.
  • the normalized pitch correlation r x can be calculated as follows:
  • T is an estimation value of an open-loop pitch period and x, is a weighted value of an input signal.
  • the LP-LTP synthesis unit 121c receives zero excitation as an input and performs
  • the weighted SegSNR calculation unit 121d calculates an LP-LTP gain of a reconstructed signal received from the LP-LTP synthesis unit 121c.
  • the LP-LTP gain which is a short-term feature of the current frame, is transmitted to the LP_LTP moving average calculation unit 141.
  • the LP_LTP moving average calculation unit 141 calculates an average of LP-LTP gains of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161.
  • the first variation feature comparison unit 151 receives a difference SNR_VAR between the moving average calculated by the LP_LTP moving average calculation unit 141 and the LP-LTP gain of the current frame, and compares the received difference with a predetermined threshold SNR_THR.
  • the SNR_SP calculation unit 154 calculates a long-term feature SNR_SP by an 'if conditional statement according to the comparison result obtained by the first variation feature comparison unit 151 , as follows:
  • SNR _SP a ⁇ * SNR _ SP + (1 - ⁇ , ) * SNR _ VA R else
  • SNR _ SP is O, is a real number between 0 and 1 and is a weight for SNR _ SP and SNR _VAR
  • Equation (3) is a constant that suppresses a mode change between the speech mode and the music mode, caused by noise, and the larger a ⁇ allows smoother ieconstruction of an audio signal
  • the long-term featuie SNR_SP increases when SNR_VAR is greater than the thieshold SNR_THR and the long-teim feature SNR_SP is i educed fiom SNR_SP of a pievious frame by a predetermined value when SNR_VAR is less than the threshold SNRJTHR
  • the SNR_SP calculation unit 154 calculates the long-term feature SNR-SP by executing the 'if conditional statement expiessed by Equation (3) foi each frame of the input audio signal SNR_VAR is also a kind of long-term featuie, but is tiansfoi med into SNR_SP having a distribution illustrated in FIG 6D
  • FIGS 6A through 6D aie refeience diagrams for explaining distribution features of
  • I IG 6A is a scieen shot illustiating a va ⁇ ation f eatuie SNR_VAR of an LP-L FP gain accoiding to a music signal and a speech signal It can be seen fiom FlG 6A that SNR_VAR geneiated by the LP LTP gain geneialion unit 121 has diffeient dis- ti ibutions dccoiding to w hethei an input signal is a speech signal oi a music signal
  • FIG 6B IS a iefeience diagiam illustiating the statistical disti ibution leatuie of a frequency percent according to the vanation feature SNR_VAR of the LP-LTP gain
  • the veitical axis indicates a fiequency peicent, i e , (fiequency of SNR_VAR/lotal fiequency) x 100%
  • An ulteied speech signal is geneially composed of voiced sound, unvoiced sound, and silence The voiced sound has a large LP-LTP gain, and the unvoiced sound and silence have small LP-LTP gains
  • most speech signals having a switch between voiced sound and unvoiced sound have a large SNRJVAR within a predetermined interval
  • music signals are continuous or have a small LP-LTP gain change and thus have a smaller SNR_VAR than the speech signals
  • FIG 6C is a reference diagram illustrating the statistical distribution feature of a cumulative frequency percent according to the va ⁇ ation feature SNR_VAR of an LP- LTP gain
  • SNRJTHR is employed as a criterion for executing a conditional statement for obtaining SNR_SP, thereby improving the accuracy of dis
  • FIG 6D is a reference diagram illustrating a long-term feature SNR_SP according to an LP-LTP gain
  • the SNR_SP calculation unit 154 geneiates a new long-term feature SNR_SP for SNRJVAR having a distribution illustrated in FIG 6A by executing the conditional statement It can also be seen from FIG 6D that SNR__SP values for a speech signal and a music signal, which are obtained by executing the conditional statement accoiding to the threshold SNRJTHR, are definitely distinguished from each other
  • the spectrum tilt generation unit 122 generates a spectium tilt of the current frame using short-term analysis for each fiame of an input audio signal
  • the spectrum tilt is a ratio of energy according to a low-band spectrum to energy according to a high-band spectrum and is calculated as follows
  • the spectrum tilt moving average calculation unit 142 calculates an aveiage of spectium tilts of a predetei mined number of frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of spectrum tilts including the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122.
  • the second variation feature comparison unit 152 receives a difference Tilt_VAR between the average generated by the spectrum tilt moving average calculation unit 142 and the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122 and compares the received difference with a predetermined threshold TILT_THR.
  • the TILT_SP calculation unit 155 calculates a tilt speech possibility TILT_SP that is a long-term feature by executing an 'if conditional statement expressed by Equation (5) according to the comparison result obtained by the spectrum tilt variation feature comparison unit 152, as follows:
  • TILT SP a 2 * TILT _SP + (l - ⁇ 2 ) * TILT VAR else
  • TILT SP is O
  • a 2 is a real number between 0 and 1 and is a weight for
  • D 2 is ⁇ 2 x (TILT _ THR I SPECTR UM TILT) in which
  • FIG. 7 A is a screen shot illustrating a variation feature TILTJVAR of a spectrum tilt gain according to a music signal and a speech signal.
  • the variation feature TILTJVAR generated by the spectrum tilt generation unit 122 differs according to whether an input signal is a speech signal or a music signal.
  • FIG. 7B is a reference diagram illustrating a long-term feature TILT_SP of a spectrum tilt.
  • the TILT_SP calculation unit 155 generates a new long-term feature TILT_SP by executing the conditional statement with respect to TILTJVAR having a distribution illustrated in FIG. 7B. It can also be seen from FIG. 7B that TILT_SP values for a speech signal and a music signal, which are obtained by executing the conditional statement according to the threshold TILTJTHR, are definitely distinguished from each other.
  • the ZCR generation unit 123 generates a zero crossing rate of the current frame by performing short-term analysis for each frame of the input audio signal.
  • the zero crossing rate means the frequency of occurrence of a signal change in input samples with respect to the current frame and is calculated according to a conditional statement using Equation (6) as follows:
  • ⁇ S(n) is a variable for determining whether an audio signal corresponding to the current frame n is a positive value or a negative value, and an initial value of ZCR is O.
  • the ZCR average calculation unit 143 calculates an average of zero crossing rates of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of zero crossing rates including the zero crossing rate of the current frame, which is generated by the ZCR generation unit 123.
  • the third variation feature comparison unit 153 receives a difference ZCJVAR between the average generated by the ZCR average calculation unit 143 and the zero crossing rate of the current frame generated by the ZCR generation unit 123, and compares the received difference with a predetermined threshold ZCJTHR.
  • the ZC_SP calculation unit 156 calculates ZC_SP that is a long-term feature by executing an 'if conditional statement expressed by Equation (7) according to the comparison result obtained by the zero crossing rate variation feature comparison unit 153, as follows:
  • ZC_SP a 3 * ZC_SP + ( ⁇ - a 3 ) * ZC_VAR else
  • Zero - crossing rate is a zero crossing rate of the current frame.
  • FIG. 8 A is a screen shot illustrating a variation feature ZC_VAR of a zero crossing rate according to a music signal and a speech signal.
  • ZC_VAR generated by the ZCR generation unit 123 differs according to whether an input signal is a speech signal or a music signal.
  • FIG. 8B is a reference diagram illustrating a long-term feature ZC_SP of a zero crossing rate.
  • the ZC_SP calculation unit 155 generates a new long-term feature value ZC_SP by executing the conditional statement with respect to ZC_VAR having a distribution as illustrated in FIG. 8B. It can also be seen from FIG. 8B that ZC_SP values for a speech signal and a music signal, which are obtained by executing the conditional statement according to the threshold ZC_THR, are definitely distinguished from each other.
  • the SPP generation unit 157 generates a speech presence possibility (SPP) using a long-term feature calculated by each of the SNR_SP calculation unit 154, the TILT_SP calculation unit 155, and the ZC_SP calculation unit 156, as follows:
  • SNR W is a weight for
  • TILT 1 JV is a weight for
  • ZC W is a weight for
  • FIG 9 A is a ieference diagram illustiating the distribution featuie of an SPP generated by the SPP generation unit 157
  • the short-term featuies generated by the LP-LTP gain generation unit 121, the spectrum tilt generation unit 122, and the ZCR geneiation unit 123 are transformed into a new long-term feature SPP by the above- described process, and a speech signal and a music signal can be more definitely distinguished from each other based on the long-term feature SPP
  • FIG 9B is a reference diagram illustrating a cumulative long-term feature according to the long-term feature SPP of FIG 9A
  • a long-term feature threshold SpThr may be set to an SPP foi a 99% cumulative distribution of a music signal
  • an audio signal corresponding to the current frame may be detei mined as a speech signal
  • a classification thieshold is adjusted based on whether a pievious frame is classified into a speech signal or a music signal, and the adjusted classification threshold is compaied with the shoit-term feature of the current fiame, theieby classifying the current frame into the speech signal or the music signal
  • the present invention discloses a method of distinguishing between a speech signal and a music signal included in an audio signal
  • Voice activity detection VAD
  • VAD Voice activity detection
  • VAD has been widely used to distinguish between a desned signal and the othei signal that aie included in an audio signal
  • VAD has been designed to mainly piocess speech signals, and is thus unavailable undei an envnonment in w hich speech, music, and noise aie mixed
  • the present invention can be geneially applied to an encoding apparatus that encodes an audio signal according to whethei it is a music signal or a speech signal, and Universal Codec and the like
  • FlG 10 is a flowchait illustrating a method to classify an audio signal according to an exemplary embodiment of the present general inventive concept
  • the short-term featuie geneiation unit 120 divides an input audio signal into frames and calculates an LP-LTP gain, a spectrum tilt, and a zero crossing rate by performing short-term analysis with respect to each of the frames.
  • a hit rate of 90% or higher can be achieved when the audio signal is classified in units of frames using three types of short-term features. The calculation of the short- term features has already been described above and thus will be omitted here.
  • the long-term feature generation unit 130 calculates long-term features SNR_SP, TILT_SP, and ZC_SP by performing long-term analysis with respect to the short-term features generated by the short-term feature generation unit 120, and applies weights to the long-term features, thereby calculating an SPP.
  • operation 1 100 and operation 1200 short-term features and long-term features of the current frame are calculated. Methods of calculating short-term features and long- term features of the current frame have been described above. Although not illustrated in FIG. 10, before performing operations 1 100 and 1200, it is necessary to obtain information regarding the distributions of shot-term features and long-term features from speech data and music data, and make the obtained information a database.
  • the long-term feature comparison unit 170 compares SPP of the current frame calculated in operation 1200 with a preset long-term feature threshold SpThr. When SPP is greater than SpThr, the current frame is determined as a speech signal. When SPP is less than SpThr, a classification threshold is adjusted and compared with a short-term feature, thereby determining the type of the current frame.
  • the classification threshold adjustment unit 180 receives classification information about a previous frame from the long-term feature comparison unit 170 or the long-term feature buffer 162, and determines whether the previous frame is classified into a speech signal or a music signal according to the received classification information.
  • the classification threshold adjustment unit 180 outputs a value obtained by dividing a classification threshold STF_THR for determining a short-term feature of the current frame by a value Sx when the previous frame is classified into the speech signal.
  • Sx is a value having an attribute of a cumulative probability of a speech signal and is intended to increase or reduce the classification threshold. Referring to FlG.9A, SPP for an Sx of 1 is selected, and a cumulative probability with respect to each SPP is divided by a cumulative probability with respect to SpSx, thereby calculating normalized Sx.
  • the mode determination threshold STF_THR is reduced in operation 1410 and the possibility that the current frame is determined as the speech signal is increased.
  • the classification threshold adjustment unit 180 outputs a product of the classification threshold STFJTHR for determining the short-term feature of the current frame and a value Mx when the previous frame is determined as the music signal.
  • Mx is a value having an attribute of a cumulative probability of a music signal and is intended to increase or reduce the classification threshold.
  • a music presence possibility (MPP) for an Mx of 1 may be set as MpMx and a probability with respect to each MPP is divided by a probability with respect to MpMx, thereby calculating normalized Mx.
  • Mx is greater than MpMx, the classification threshold STF_THR is increased and the possibility that the current frame is determined as the music signal is also increased.
  • the classification threshold adjustment unit 180 compares the short-term feature of the current frame with the classification threshold STF_THR that is adaptively adjusted in operation 1410 or operation 1420, and outputs the comparison result.
  • the classification unit 190 determines the current frame as the music signal, and outputs the determination result as classification information.
  • the classification unit 190 determines the current frame as the speech signal, and outputs the determination result as classification information.
  • FIG. 1 1 is a block diagram of a decoding apparatus 2000 for an audio signal according to an exemplary embodiment of the present general inventive concept.
  • a bitstream receipt unit 2100 receives a bitstream including classification information for each frame of an audio signal.
  • a classification information extraction unit 2200 extracts the classification information from the received bitstream.
  • a decoding mode determination unit 2300 determines a decoding mode for the audio signal according to the extracted classification information, and transmits the bitstream to a music decoding unit 2400 or a speech decoding unit 2500.
  • the music decoding unit 2400 decodes the received bitstream in the frequency domain and the speech decoding unit 2500 decodes the received bitstream in the time domain.
  • a mixing unit 2600 mixes the decoded signals in order to reconstruct the audio signal.
  • the present invention can also be embodied as computer-readable code on a computer-readable recording medium.
  • the computer-readable recording medium is any data storage device that can store data which can be thereafter read by a computer system.
  • embodiments of the present invention can also be implemented through computer readable code/instructions in/on a medium, e.g., a computer readable medium, to control at least one processing element to implement any above described embodiment.
  • a medium e.g., a computer readable medium
  • the medium can correspond to any medium/media permitting the storing and/or transmission of the computer readable code.
  • the computer readable code can be recorded/transferred on a medium in a variety of ways, with examples of the medium including recording media, such as magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.) and optical recording media (e.g., CD-ROMs, or DVDs), and transmission media such as carrier waves, as well as through the Internet, for example.
  • the medium may further be a signal, such as a resultant signal or bitstream, according to embodiments of the present invention.
  • the media may also be a distributed network, so that the computer readable code is stored/ transferred and executed in a distributed fashion.
  • the processing element could include a processor or a computer processor, and processing elements may be distributed and/or included in a single device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided are a classifying method and apparatus for an audio signal, and an encoding/decoding method and apparatus for an audio signal using the classifying method and apparatus. In the classification method, an audio signal is classified by adaptively adjusting a classification threshold for a frame of the audio signal that is to be classified according to a long-term feature of the audio signal, thereby improving a hit rate of signal classification, suppressing frequent mode switching per frame, improving noise tolerance, and providing smooth reconstruction of the audio signal.

Description

Description
METHOD, MEDIUM, AND APPARATUS TO CLASSIFY FOR AUDIO SIGNAL, AND METHOD, MEDIUM AND APPARATUS TO ENCODE AND/OR DECODE FOR AUDIO SIGNAL USING
THE SAME
Technical Field
[1] The present general invention concept relates a method and apparatus to classify for an audio signal and a method and apparatus to encode and/or decode for an audio signal using the method and apparatus to classify, and more particularly, to a system that classifies audio signals into music signals and speech signals, an encoding apparatus that encodes an audio signal according to whether it is a music signal or a speech signal, and an audio signal classifying method and apparatus which can be applied to Universal Codec and the like Background Art
[2] Audio signals can be classified into various types, such as speech signals, music signals, or mixtures of speech signals and music signals, according to their characteristics, and different coding methods or compression methods are applied to these types Compression methods for audio signals can be roughly divided into an audio codec and a speech codec The audio codec, such as Advanced Audio Coding Plus (aacPlus), is intended to compress music signals The audio codec compresses a music signal in a frequency domain using a psychoacoustic model When a speech signal is compiessed using the audio codec, sound quality degradation is worse than that caused by compression of an audio signal using the speech codec and becomes more serious when the speech signal includes an attack signal The speech codec, such as Adaptive Multi Rate - WideBand (AMR-WB), is intended to compress speech signals The speech codec compresses an audio signal in a time domain using an utterance model When an audio signal is compressed using the speech codec, sound quality degradation is woi se than that caused b) compiession of a speech signal using the audio codec Accordingly, it is impoi tant to classify an audio signal into an exact type
[3 ] U S Patent No 6, 134,518 discloses a method for coding a digital audio signal using a
CELP coder and a tiansform codei Referring to FIG 1 , a classifier 20 measures the autocoπelation of an input audio signal 10 to select one of a CELP coder 30 and a ti ansform codei 40 based on the measurement The input audio signal 10 is coded by whichevei one of the CELP coder 30 and the ti ansform coder 40 ai e selected, by switching ol a switch 50 The US patent discloses the classifier 20 that calculates a pi obdbihty that a cui rent audio signal is a speech signal oi a music signal using auto- correlation in the time domain.
[4] However, because of weak noise tolerance, the disclosed technique has a low hit rate of signal classification under noisy conditions. Moreover, frequent oscillation of an audio signal mode in frame units cannot provide a smooth reconstructed audio signal. Disclosure of Invention Technical Solution
[5] The present invention provides a classifying method and apparatus for an audio signal, in which a classification threshold for a current frame that is to be classified is adaptively adjusted according to a long-term feature of the audio signal in order to classify the current frame, thereby improving the hit rate of signal classification, suppressing frequent oscillation of a mode in frame units, improving noise tolerance, and improving smoothness of a reconstructed audio signal; and an encoding/decoding method and apparatus for an audio signal using the classifying method and apparatus.
[6] According to an aspect of the present invention, there is provided a method of classifying an audio signal, comprising: (a) analyzing the audio signal in units of frames, and generating a short-term feature and a long-term feature from the result of analyzing; (b) adaptively adjusting a classification threshold for a current frame that is to be classified, according to the generated long-term feature; and (c) classifying the current frame using the adjusted classification threshold.
[7] According to another aspect of the present invention, there is provided an apparatus for classifying an audio signal, comprising: a short-term feature generation unit to analyze the audio signal in units of frames and generating a short-term feature; a long- term feature generation unit to generate a long-term feature using the short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, by using the generated long-term feature; and a classification unit to classify the current frame using the adjusted classification threshold.
[8] According to another aspect of the present invention, there is provided an apparatus for encoding an audio signal, comprising: a short-term feature generation unit to analyze an audio signal in units of frames and generating a short-term feature; a long- term feature generation unit to generate a long-term feature using the short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, using the generated long-term feature; a classification unit to classify the current frame using the adaptively adjusted classification threshold; an encoding unit to perform the classified audio signal in units of frames; and a multiplexer to perform bitstream processing on the encoded signal so as to generate a bitstream. [9] According to another aspect of the present invention, there is provided a method of decoding an audio signal, comprising: receiving a bitstream including classification information regarding each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; determining a decoding mode for the audio signal based on the classification information; and decoding the received bitstream according to the determined decoding mode.
[10] According to another aspect of the present invention, there is provided an apparatus for decoding an audio signal, comprising: a receipt unit to receive a bitstream including classification information for each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; a decoding mode determination unit to determine a decoding mode for the received bitstream according to the classification information; and a decoding unit to decode the received bitstream according to the determined decoding mode.
[1 1] According to another aspect of the present invention, there is provided a computer readable medium having recorded thereon a computer program for executing the method of classifying an audio signal. Description of Drawings
[ 12] These and/or other aspects and utilities of the present general inventive concept will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
[13] FIG. 1 is a block diagram of a conventional audio signal encoder;
[14] FIG. 2 is a block diagram of an apparatus to encode for an audio signal according to an embodiment of the present general inventive concept;
[ 15] FIG. 3 is a block diagram of an apparatus to classify for an audio signal according to an embodiment of the present general inventive concept;
[16 ] FIG. 4 is a detailed block diagram of a short-term feature generation unit and a long- term feature generation unit illustrated in FIG. 3;
[ 17J FIG. 5 is a detailed block diagram of a linear prediction-long-term prediction
(LP-LTP) gain generation unit illustrated in FIG. 4;
( 1 8] FIG. 6A is a screen shot illustrating a variation feature SNRJVar of an LP-LTP gain according to a music signal and a speech signal;
[ 19J FIG. 6B is a reference diagram illustrating the distribution feature of a frequency percent according to the variation feature SNR_VAR of FIG. 6A;
[20] FIG. 6C is a reference diagram illustrating the distribution feature of cumulative frequency percent according to the variation feature SNR_VAR of FIG. 6A;
[21 ] FIG. 6D is a reference diagram illustrating a long-term feature SNR_SP according to the LP-LTP gain of FIG. 6A;
[22] FIG. 7A is a screen shot illustrating a variation feature TILTJVAR of a spectrum tilt according to a music signal and a speech signal;
[23] FIG. 7B is a reference diagram illustrating a long-term feature TILTJSP of the spectrum tilt of FIG. 7A;
[24] FIG. 8A is a screen shot illustrating a variation feature ZC_Var of a zero crossing rate according to a music signal and a speech signal;
[25] FIG. 8B is a reference diagram illustrating a long-term feature ZC_SP with respect to the zero crossing rate of FIG. 8 A;
[26] FIG. 9 is a reference diagram illustrating a long-term feature SPP according to a music signal and a speech signal;
[27] FIG. 10 is a flowchart illustrating a method to classify an audio signal according to an embodiment of the present general inventive concept; and
[28] FIG. 1 1 is a block diagram of an apparatus to decode for an audio signal according to an exemplary embodiment of the present general inventive concept. Mode for Invention
[29] Reference will now be made in detail to the embodiments of the present general inventive concept, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present general inventive concept by referring to the figures.
[30J FIG. 2A is a block diagram of an apparatus to encode for an audio signal according to an embodiment of the present general inventive concept. Referring to FIG. 2A, the apparatus to encode for an audio signal includes an audio signal classifying apparatus 100, a speech coding unit 200, a music coding unit 300, and a bitstream multiplexer 400.
[31 j The audio signal classifying apparatus 100 divides an input audio signal into frames based on the input time of the audio signal, and determines whether each of the frames is a speech signal or a music signal. The audio signal classifying apparatus 100 transmits as additional information classification information indicating whether a current frame is a speech signal or a music signal, to the bitstream multiplexer 400. The detailed construction of the audio signal classifying apparatus 100 is illustrated in FIG. 3 and will be described later. Also, the audio signal classifying apparatus 100 may f urther include a time-to-frequency conversion unit (not shown) that converts an audio signal in the time domain into a signal in the frequency domain.
[32] The speech coding unit 200 encodes an audio signal corresponding to a frame that is classified into the speech signal by the audio signal classifying apparatus 100, and transmits the encoded audio signal to the bitstream multiplexer 400.
[33] In the current embodiment, encoding is performed by the speech coding unit 200 and the music coding unit 300, but an audio signal may be encoded by a time-domain coding unit and a frequency-domain coding unit. In this case, it is efficient to encode a speech signal by using a time-domain coding method, and encode a music signal by using a frequency-domain coding method. Code excited linear prediction (CELP) may be employed as the time-domain coding method, and transform coded excitation (TCX) and advanced audio codec (AAC) may be employed as the frequency-domain coding method.
[34] The bitstream multiplexer 400 receives the encoded audio signal from the speech coding unit 200 or the music coding unit 300 and the classification information from the audio signal classifying apparatus 100, and generates a bitstream using the received signal and the classification information. In particular, the classification information can be used to generate a bitstream in a decoding mode in order to determine a method of efficiently reconstruct an audio signal.
[35] FIG. 3 is a block diagram of an audio signal classifying apparatus 100 according to an exemplary embodiment of the present invention. Referring to FIG. 3, the audio signal classifying apparatus 100 includes an audio signal division unit 110, a short- term feature generation unit 120, a long-term feature generation unit 130, a buffer 160 including a short-term feature buffer 161 and a long-term feature buffer 162, a long- term feature comparison unit 170, a classification threshold adjustment unit 180, and a classification unit 190.
[36] The audio signal division unit 110 divides an input audio signal into frames in the time domain and transmits the divided audio signal to the short-term feature generation unit 120.
[37] The short-term feature generation unit 120 performs short-term analysis with respect to the divided audio signal to generate a short-term feature. In the current embodiment, the short-term feature is the unique feature of each frame, the use of which can determine whether the current frame is in a music mode or a speech mode and which one of time domain and the frequency domain is an efficient encoding domain for the current frame.
[38 ] The short-term feature may include a linear prediction-long-term prediction
(LP-LTP) gain, a spectrum tilt, a zero crossing rate, a spectrum autocorrelation, and the like.
[39] The short-term feature generation unit 120 may independently generate and output one short-term feature or a plurality of short-term features, or output the sum of a plurality of weighted short-term features as a representative short-term feature. The detailed structure of the short-term feature generation unit 120 is illustrated in FIG. 4 and will be described later.
[40] The long-term feature generation unit 130 generates a long-term feature using the short-term feature generated by the short-term feature generation unit 120 and features that are stored in the short-term feature buffer 161 and the long-term feature buffer 162. The long-term feature generation unit 130 includes a first long-term feature generation unit 140 and a second long-term feature generation unit 150.
[41] The first long-term feature generation unit 140 obtains information about the short- term features of 5 consecutive previous frames preceding the current frame from the short-term feature buffer 161 to calculate an average value and calculates the difference between the short-term feature of the current frame and the calculated average value, thereby generating a variation feature.
[42] When the short-term feature is an LP-LTP gain, the average value is an average of
LP-LTP gains of the previous frames preceding the current frame and the variation feature is information describing how much the LP-LTP gain of the current frame deviates from the average value corresponding to a predetermined term. As can be seen in FIG. 6B, a variation feature Signal to Noise Ratio Variation (SNR_VAR) is distributed over different areas when the audio signal is a speech signal or in a speech mode, while the variation feature SNR_VAR is concentrated over a small area when the audio signal is a music signal or in a music mode.
[43] The second long-term feature generation unit 150 generates a long-term feature having a moving average that considers a per-frame change in the variation feature generated by the first long-term feature generation unit 140 under a predetermined constraint. Here, the predetermined constraint means a condition and a method for applying a weight to the variation feature of a previous frame preceding the current frame. The second long-term feature generation unit 150 distinguishes between a case where the variation feature of the current frame is greater than a predetermined threshold and a case where the variation feature of the current frame is less than the predetermined threshold, and applies different weights to the variation feature of the previous frame and the variation feature of the current frame, thereby generating a long-term feature. Here, the predetermined threshold is a preset value for distinguishing between a speech signal and a music signal. The generation of the long-term feature will later be described in more detail.
[44] As mentioned above, the buffer 160 includes the short-term feature buffer 161 and the long-term feature buffer 162 The short-term feature buffer 161 stores a short-term feature generated by the short-term feature generation unit 120 for at least a predetermined period of time, and the long-term feature buffer 162 stores a long-term feature generated by the first long-term feature generation unit 140 and the second long-term feature generation unit 150 for at least a predetermined period of time.
(45 ] The long-term feature comparison unit 170 compares the long-term feature generated by the second long-term feature generation unit 150 with a predetermined threshold. Here, the predetermined threshold is a long-term feature for the case where there is a high possibility that a current signal is a speech signal and is previously determined by preliminary statistical analysis. When a threshold SpThr for a long-term feature is set as illustrated in FIG. 9B and the long-term feature generated by the second long-term feature generation unit 150 is greater than the threshold SpThr, the possibility that the current frame is a music signal is less than 1%. In other words, when the long-term feature is greater than the threshold, the current frame can be classified into a speech signal.
[46] When the long-term feature is less than the threshold, the type of the current frame can be determined by a process of adjusting a classification threshold and comparison of the short-term feature with the classification threshold. The threshold may be adjusted based on the hit rate of classification and as illustrated in FIG. 9B, the hit rate of classification is lowered by setting the threshold low.
[47] The classification threshold adjustment unit 180 adaptively adjusts the classification threshold that is referred to for classifying the current frame when the long-term feature generated by the second long-term feature generation unit 150 is less than the threshold, i.e., when it is difficult to determine the type of the current frame only with lhe long-term feature.
[48] The classification threshold adjustment unit 180 receives classification information of a previous frame from the classification unit 190, and adjusts the classification threshold adaptively according to whether the previous frame is classified into the speech signal or the music signal. The classification threshold is used to determine whether the short-term feature of a frame that is to be classified, i.e., the current frame, has a property of the speech signal or the music signal. The main technical idea of the current embodiment is that the classification threshold is adjusted according to whether a previous frame preceding the current frame is classified into the speech signal or the music signal. The adjustment of the classification threshold will later be described in detail.
[49 ] The classification unit 190 compares a short-term feature STF_THR of the current frame with a classification threshold STF_THR adjusted by the classification threshold adjustment unit 180 in order to determine whether the current frame is the speech signal or the music signal.
[50] FIG. 4 is a detailed block diagram of the short-term feature generation unit 120 and the long-term feature generation unit 130 illustrated in FIG. 3. The short-term feature generation unit 120 includes an LP-LTP gain generation unit 121 , a spectrum tilt generation unit 122, and a zero crossing rate (ZCR) generation unit 123. The long-term feature generation unit 130 includes an LP-LTP moving average calculation unit 141 , a spectrum tilt moving average calculation unit 142, a zero crossing rate moving average calculation unit 143, a first variation feature comparison unit 151 , a second variation feature comparison unit 152, a third variation feature comparison unit 153, a SNR_SP calculation unit 154, a TILT_SP calculation unit 155, and a ZC_SP calculation unit
156. [51] The LP-LTP gain generation unit 127 generates an LP-LTP gain of the current frame by short-term analysis with respect to each frame of the input audio signal. [52] FIG. 5 is a detailed block diagram of the LP-LTP gain generation unit 121. Referring to FIG. 5, the LP-LTP gain generation unit 121 includes an LP analysis unit 121a, an open-loop pitch analysis unit 121b, an LTP contribution synthesis unit 121c, and a weighted SegSNR calculation unit 121d. [53] The LP analysis unit 121a calculates
PrdErr
r[0] by performing linear analysis with respect to an audio signal corresponding to the current frame, and calculates an LPC gain using the calculated value as follows: I541 LPC gain = - 10. * log l 0((PrdErr /(r[0] + 0.0000001)) (1),
[55] where
PrdErr is a prediction eiτor according to Levinson-Durbin that is a process of obtaining an LP filter coefficient, and
Figure imgf000009_0001
is the first reflection coefficient.
[56] The LP analysis unit 121a calculates a linear prediction coefficient (LPC) using autocorrelation with respect to the current frame. At this time, a short-term analysis filter is specified by the LPC and a signal passing through the specified filter is transmitted to the open-loop pitch analysis unit 121b.
[57] The open-loop pitch analysis unit 121 b calculates a pitch correlation by performing long-term analysis with respect to an audio signal that is filtered by the short-term analysis filter. The open-pitch loop analysis unit 121 b calculates an open-loop pitch lag for the maximum cross correlation between an audio signal corresponding to a previous frame stored in the buffer 160 and an audio signal corresponding to the current frame, and specifies a long-term analysis filter using the calculated lag. The open-loop pitch analysis unit 121 b obtains a pitch using correlation between a previous audio signal and the current audio signal, which is obtained by the LP analysis unit 121 a, and divides the correlation by the pitch, thereby calculating a normalized pitch correlation. The normalized pitch correlation rx can be calculated as follows:
Figure imgf000010_0001
(2),
[59] where T is an estimation value of an open-loop pitch period and x, is a weighted value of an input signal.
[60] The LP-LTP synthesis unit 121c receives zero excitation as an input and performs
LP-LTP synthesis.
[61] The weighted SegSNR calculation unit 121d calculates an LP-LTP gain of a reconstructed signal received from the LP-LTP synthesis unit 121c. The LP-LTP gain, which is a short-term feature of the current frame, is transmitted to the LP_LTP moving average calculation unit 141.
[62] The LP_LTP moving average calculation unit 141 calculates an average of LP-LTP gains of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161.
[63] The first variation feature comparison unit 151 receives a difference SNR_VAR between the moving average calculated by the LP_LTP moving average calculation unit 141 and the LP-LTP gain of the current frame, and compares the received difference with a predetermined threshold SNR_THR.
[64] The SNR_SP calculation unit 154 calculates a long-term feature SNR_SP by an 'if conditional statement according to the comparison result obtained by the first variation feature comparison unit 151 , as follows:
165 ' If (SNR _ VA R > SNR _ THR)
SNR _SP = aλ * SNR _ SP + (1 - α, ) * SNR _ VA R else
SNR JSP = Dx (3).
[66] where an initial value of
SNR _ SP is O, is a real number between 0 and 1 and is a weight for SNR _ SP and SNR _VAR
, and
A
IS βx x (SNR _ THR I LT - LTP gain) in which
Figure imgf000011_0001
is a constant indicating the degiee of reduction [67] In Equation (3), α, is a constant that suppresses a mode change between the speech mode and the music mode, caused by noise, and the larger a\ allows smoother ieconstruction of an audio signal According to the 'if conditional statement expressed by Equation (3), the long-term featuie SNR_SP increases when SNR_VAR is greater than the thieshold SNR_THR and the long-teim feature SNR_SP is i educed fiom SNR_SP of a pievious frame by a predetermined value when SNR_VAR is less than the threshold SNRJTHR
[681 The SNR_SP calculation unit 154 calculates the long-term feature SNR-SP by executing the 'if conditional statement expiessed by Equation (3) foi each frame of the input audio signal SNR_VAR is also a kind of long-term featuie, but is tiansfoi med into SNR_SP having a distribution illustrated in FIG 6D
[69] FIGS 6A through 6D aie refeience diagrams for explaining distribution features of
SNR_VAR, SNRJΪΉR, and SNR_SP according to the cuiient exemplaiy embodiment
[70 ] I IG 6A is a scieen shot illustiating a vaπation f eatuie SNR_VAR of an LP-L FP gain accoiding to a music signal and a speech signal It can be seen fiom FlG 6A that SNR_VAR geneiated by the LP LTP gain geneialion unit 121 has diffeient dis- ti ibutions dccoiding to w hethei an input signal is a speech signal oi a music signal
[71 ] FIG 6B IS a iefeience diagiam illustiating the statistical disti ibution leatuie of a frequency percent according to the vanation feature SNR_VAR of the LP-LTP gain In FIG 6B, the veitical axis indicates a fiequency peicent, i e , (fiequency of SNR_VAR/lotal fiequency) x 100% An ulteied speech signal is geneially composed of voiced sound, unvoiced sound, and silence The voiced sound has a large LP-LTP gain, and the unvoiced sound and silence have small LP-LTP gains Thus, most speech signals having a switch between voiced sound and unvoiced sound have a large SNRJVAR within a predetermined interval However, music signals are continuous or have a small LP-LTP gain change and thus have a smaller SNR_VAR than the speech signals
[72] FIG 6C is a reference diagram illustrating the statistical distribution feature of a cumulative frequency percent according to the vaπation feature SNR_VAR of an LP- LTP gain Since music signals are mostly distributed in an area having small SNR_VAR, the possibility of the presence of the music signal is very low when SNR_VAR is greater than a predetermined threshold as can be seen in the cumulative curve A speech signal has a gentler cumulative curve than a music signal In this case, THR may be defined as P(musiclS) - P(speechlS) and SNRJVAR for the maximum THR may be defined as (SNRJTHR) Here, P(musictS) is the probability that the cuπent audio signal is a music signal under a condition S, and P(speechlS) is a probability that the current audio signal is a speech signal under the condition S In the current embodiment, SNRJTHR is employed as a criterion for executing a conditional statement for obtaining SNR_SP, thereby improving the accuracy of distingmshment between a speech signal and a music signal
[73] FIG 6D is a reference diagram illustrating a long-term feature SNR_SP according to an LP-LTP gain The SNR_SP calculation unit 154 geneiates a new long-term feature SNR_SP for SNRJVAR having a distribution illustrated in FIG 6A by executing the conditional statement It can also be seen from FIG 6D that SNR__SP values for a speech signal and a music signal, which are obtained by executing the conditional statement accoiding to the threshold SNRJTHR, are definitely distinguished from each other
[74] The spectrum tilt generation unit 122 generates a spectium tilt of the current frame using short-term analysis for each fiame of an input audio signal The spectrum tilt is a ratio of energy according to a low-band spectrum to energy according to a high-band spectrum and is calculated as follows
(4) 176] wheic
is an aveiage energy in a high band and E1 is an a\ eiage eneigy in a low band The spectrum tilt moving average calculation unit 142 calculates an aveiage of spectium tilts of a predetei mined number of frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of spectrum tilts including the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122.
[77] The second variation feature comparison unit 152 receives a difference Tilt_VAR between the average generated by the spectrum tilt moving average calculation unit 142 and the spectrum tilt of the current frame generated by the spectrum tilt generation unit 122 and compares the received difference with a predetermined threshold TILT_THR.
[78] The TILT_SP calculation unit 155 calculates a tilt speech possibility TILT_SP that is a long-term feature by executing an 'if conditional statement expressed by Equation (5) according to the comparison result obtained by the spectrum tilt variation feature comparison unit 152, as follows:
[79] if (TJLT _ VAR > TILT _ THR)
TILT SP = a2 * TILT _SP + (l - α2 ) * TILT VAR else
TILT SP = D2 (5),
[80] where an initial value of
TILT SP is O, a2 is a real number between 0 and 1 and is a weight for
TILT ^SP and TILT _VAR
. and D2 is β2 x (TILT _ THR I SPECTR UM TILT) in which
is a constant indicating the degree of reduction. A detailed description that is common to TILT _ SP and SNR _ SP will not be given.
[81] FIG. 7 A is a screen shot illustrating a variation feature TILTJVAR of a spectrum tilt gain according to a music signal and a speech signal. The variation feature TILTJVAR generated by the spectrum tilt generation unit 122 differs according to whether an input signal is a speech signal or a music signal.
[82] FIG. 7B is a reference diagram illustrating a long-term feature TILT_SP of a spectrum tilt. The TILT_SP calculation unit 155 generates a new long-term feature TILT_SP by executing the conditional statement with respect to TILTJVAR having a distribution illustrated in FIG. 7B. It can also be seen from FIG. 7B that TILT_SP values for a speech signal and a music signal, which are obtained by executing the conditional statement according to the threshold TILTJTHR, are definitely distinguished from each other.
[83] The ZCR generation unit 123 generates a zero crossing rate of the current frame by performing short-term analysis for each frame of the input audio signal. The zero crossing rate means the frequency of occurrence of a signal change in input samples with respect to the current frame and is calculated according to a conditional statement using Equation (6) as follows:
I84I if(S(n) S(n - 1) < 0) ZCR = ZCR + \ (6),
[85] where
■ S(n) is a variable for determining whether an audio signal corresponding to the current frame n is a positive value or a negative value, and an initial value of ZCR is O.
| 86| The ZCR average calculation unit 143 calculates an average of zero crossing rates of a predetermined number of previous frames preceding the current frame, which are stored in the short-term feature buffer 161, or calculates an average of zero crossing rates including the zero crossing rate of the current frame, which is generated by the ZCR generation unit 123. [87] The third variation feature comparison unit 153 receives a difference ZCJVAR between the average generated by the ZCR average calculation unit 143 and the zero crossing rate of the current frame generated by the ZCR generation unit 123, and compares the received difference with a predetermined threshold ZCJTHR. [88J The ZC_SP calculation unit 156 calculates ZC_SP that is a long-term feature by executing an 'if conditional statement expressed by Equation (7) according to the comparison result obtained by the zero crossing rate variation feature comparison unit 153, as follows:
[89] if (ZC _VAR > ZC_ THR)
ZC_SP = a3 * ZC_SP + (\ - a3) * ZC_VAR else
ZC SP = D3 (7),
[9OJ where an initial value of
ZC _SP is O,
is a real number between 0 and 1 and is a weight for ZC SP and ZC _ VAR
D, is
/?3 x (ZC _ THR I zero - crossing rale) in which
is a constant indicating the degree of reduction, and zero - crossing rate is a zero crossing rate of the current frame. A detailed description that is common to ZC SP and SNR _S P will not be given.
[91 1 FIG. 8 A is a screen shot illustrating a variation feature ZC_VAR of a zero crossing rate according to a music signal and a speech signal. ZC_VAR generated by the ZCR generation unit 123 differs according to whether an input signal is a speech signal or a music signal. [92] FIG. 8B is a reference diagram illustrating a long-term feature ZC_SP of a zero crossing rate. The ZC_SP calculation unit 155 generates a new long-term feature value ZC_SP by executing the conditional statement with respect to ZC_VAR having a distribution as illustrated in FIG. 8B. It can also be seen from FIG. 8B that ZC_SP values for a speech signal and a music signal, which are obtained by executing the conditional statement according to the threshold ZC_THR, are definitely distinguished from each other.
[93] The SPP generation unit 157 generates a speech presence possibility (SPP) using a long-term feature calculated by each of the SNR_SP calculation unit 154, the TILT_SP calculation unit 155, and the ZC_SP calculation unit 156, as follows:
[94] spp = SNR JW ■ SNR _SP + TILTJV ■ TILT _ SP + ZC_W - ZC_SP (8),
[95] where
SNR W is a weight for
SNR _SP
TILT 1 JV is a weight for
TILT _ SP
, and
ZC W is a weight for
ZC SP
[96] Referring to FIGS. 6C, 7B, and 8B,
SNR JV is calculated by multiplying P(musiclS) - P(speechlS) = 0.46(46%) according to SNR_THR by a predetermined normalization factor. Here, although there is no special restriction on the normalization factor, SNR_SP(=7.5) for a 90% SNR_SP cumulative probability of a speech signal may be set to the normalization factor. Similarly, TILT _ W is calculated using P(musiclT) - P(speechlT) = 0.35(35%) according to TILT_THR and a normalization factor for
TILT SP The normalization factor for
ΓILT _SP is TILT_SP(=45) for a 90% TILT_SP cumulative probability of a speech signal ZCJW can also be calculated using P(musiclZ) - P(speechlZ) = 0 32(32%) according to
ZCJTΗR and a normalization factor(=75) for
ZC _SP
[97] FIG 9 A is a ieference diagram illustiating the distribution featuie of an SPP generated by the SPP generation unit 157 The short-term featuies generated by the LP-LTP gain generation unit 121, the spectrum tilt generation unit 122, and the ZCR geneiation unit 123 are transformed into a new long-term feature SPP by the above- described process, and a speech signal and a music signal can be more definitely distinguished from each other based on the long-term feature SPP
[98] FIG 9B is a reference diagram illustrating a cumulative long-term feature according to the long-term feature SPP of FIG 9A A long-term feature threshold SpThr may be set to an SPP foi a 99% cumulative distribution of a music signal When the SPP of the cuπent frame is gi eater than the threshold SpThr, an audio signal corresponding to the current frame may be detei mined as a speech signal However, when the SPP ol the cuirent frame is less than the threshold SpThr a classification thieshold is adjusted based on whether a pievious frame is classified into a speech signal or a music signal, and the adjusted classification threshold is compaied with the shoit-term feature of the current fiame, theieby classifying the current frame into the speech signal or the music signal
[99] As descπbed above, the present invention discloses a method of distinguishing between a speech signal and a music signal included in an audio signal Voice activity detection (VAD) has been widely used to distinguish between a desned signal and the othei signal that aie included in an audio signal Howevei, VAD has been designed to mainly piocess speech signals, and is thus unavailable undei an envnonment in w hich speech, music, and noise aie mixed According to the piesent invention, it is possible to classify audio signals into speech signals and music signals, and the present invention can be geneially applied to an encoding apparatus that encodes an audio signal according to whethei it is a music signal or a speech signal, and Universal Codec and the like
[100] FlG 10 is a flowchait illustrating a method to classify an audio signal according to an exemplary embodiment of the present general inventive concept
[ 101] Referring to FIGS 3 and 10, in opeiation 1100, the short-term featuie geneiation unit 120 divides an input audio signal into frames and calculates an LP-LTP gain, a spectrum tilt, and a zero crossing rate by performing short-term analysis with respect to each of the frames. Although there is no special restriction on the type of short-term feature, a hit rate of 90% or higher can be achieved when the audio signal is classified in units of frames using three types of short-term features. The calculation of the short- term features has already been described above and thus will be omitted here.
[102] In operation 1200, the long-term feature generation unit 130 calculates long-term features SNR_SP, TILT_SP, and ZC_SP by performing long-term analysis with respect to the short-term features generated by the short-term feature generation unit 120, and applies weights to the long-term features, thereby calculating an SPP.
[103] In operation 1 100 and operation 1200, short-term features and long-term features of the current frame are calculated. Methods of calculating short-term features and long- term features of the current frame have been described above. Although not illustrated in FIG. 10, before performing operations 1 100 and 1200, it is necessary to obtain information regarding the distributions of shot-term features and long-term features from speech data and music data, and make the obtained information a database.
[ 104] In operation 1300, the long-term feature comparison unit 170 compares SPP of the current frame calculated in operation 1200 with a preset long-term feature threshold SpThr. When SPP is greater than SpThr, the current frame is determined as a speech signal. When SPP is less than SpThr, a classification threshold is adjusted and compared with a short-term feature, thereby determining the type of the current frame.
[105] In operation 1400, the classification threshold adjustment unit 180 receives classification information about a previous frame from the long-term feature comparison unit 170 or the long-term feature buffer 162, and determines whether the previous frame is classified into a speech signal or a music signal according to the received classification information.
[ 106] In operation 1410, the classification threshold adjustment unit 180 outputs a value obtained by dividing a classification threshold STF_THR for determining a short-term feature of the current frame by a value Sx when the previous frame is classified into the speech signal. Sx is a value having an attribute of a cumulative probability of a speech signal and is intended to increase or reduce the classification threshold. Referring to FlG.9A, SPP for an Sx of 1 is selected, and a cumulative probability with respect to each SPP is divided by a cumulative probability with respect to SpSx, thereby calculating normalized Sx. When SPP of the current frame is between SpSx and SpThr. the mode determination threshold STF_THR is reduced in operation 1410 and the possibility that the current frame is determined as the speech signal is increased.
[ 107] In operation 1420, the classification threshold adjustment unit 180 outputs a product of the classification threshold STFJTHR for determining the short-term feature of the current frame and a value Mx when the previous frame is determined as the music signal. Mx is a value having an attribute of a cumulative probability of a music signal and is intended to increase or reduce the classification threshold. As illustrated in FIG. 9B, a music presence possibility (MPP) for an Mx of 1 may be set as MpMx and a probability with respect to each MPP is divided by a probability with respect to MpMx, thereby calculating normalized Mx. When Mx is greater than MpMx, the classification threshold STF_THR is increased and the possibility that the current frame is determined as the music signal is also increased.
[108] In operation 1430, the classification threshold adjustment unit 180 compares the short-term feature of the current frame with the classification threshold STF_THR that is adaptively adjusted in operation 1410 or operation 1420, and outputs the comparison result.
[109] In operation 1500, when it is determined in operation 1430 that the short-term feature of the current frame is less than the adjusted classification threshold STF_THR, the classification unit 190 determines the current frame as the music signal, and outputs the determination result as classification information.
[ 1 10] In operation 1600, when it is determined in operation 1430 that the short-term feature of the current frame is greater than the adjusted classification threshold STF_THR, the classification unit 190 determines the current frame as the speech signal, and outputs the determination result as classification information.
[ I l l] FIG. 1 1 is a block diagram of a decoding apparatus 2000 for an audio signal according to an exemplary embodiment of the present general inventive concept.
[112] Referring to FIG. 11, a bitstream receipt unit 2100 receives a bitstream including classification information for each frame of an audio signal. A classification information extraction unit 2200 extracts the classification information from the received bitstream. A decoding mode determination unit 2300 determines a decoding mode for the audio signal according to the extracted classification information, and transmits the bitstream to a music decoding unit 2400 or a speech decoding unit 2500.
[ 1 13] The music decoding unit 2400 decodes the received bitstream in the frequency domain and the speech decoding unit 2500 decodes the received bitstream in the time domain. A mixing unit 2600 mixes the decoded signals in order to reconstruct the audio signal.
[ 1 14] The present invention can also be embodied as computer-readable code on a computer-readable recording medium. The computer-readable recording medium is any data storage device that can store data which can be thereafter read by a computer system.
[ 1 15] In addition to the above described embodiments, embodiments of the present invention can also be implemented through computer readable code/instructions in/on a medium, e.g., a computer readable medium, to control at least one processing element to implement any above described embodiment. The medium can correspond to any medium/media permitting the storing and/or transmission of the computer readable code.
[ 1 16] The computer readable code can be recorded/transferred on a medium in a variety of ways, with examples of the medium including recording media, such as magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.) and optical recording media (e.g., CD-ROMs, or DVDs), and transmission media such as carrier waves, as well as through the Internet, for example. Thus, the medium may further be a signal, such as a resultant signal or bitstream, according to embodiments of the present invention. The media may also be a distributed network, so that the computer readable code is stored/ transferred and executed in a distributed fashion. Still further, as only an example, the processing element could include a processor or a computer processor, and processing elements may be distributed and/or included in a single device.
[1 17] While aspects of the present invention has been particularly shown and described with reference to differing embodiments thereof, it should be understood that these exemplary embodiments should be considered in a descriptive sense only and not for purposes of limitation. Any narrowing or broadening of functionality or capability of an aspect in one embodiment should not considered as a respective broadening or narrowing of similar features in a different embodiment, i.e., descriptions of features or aspects within each embodiment should typically be considered as available for other similar features or aspects in the remaining embodiments.
[1 18] Thus, although a few embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.

Claims

Claims
[1] 1. A method of classifying an audio signal, comprising:
(a) analyzing the audio signal in units of frames, and generating a short-term feature and a long-term feature from the result of analyzing;
(b) adaptively adjusting a classification threshold for a current frame that is to be classified, according to the generated long-term feature; and
(c) classifying the current frame using the adjusted classification threshold.
[2] 2. The method of claim 1 , further comprising comparing the long-term feature of the current frame with a predetermined threshold, wherein (b) comprises adaptively adjusting the classification threshold according to the comparison result.
[3] 3. The method of claim 1 , wherein the generation of the long-term feature comprises generating the long-term feature using a difference between an average of short-term features of a predetermined number of previous frames preceding the current frame and the short-term feature of the current frame.
[4] 4. The method of claim 1, further comprising comparing the long-term feature of the current frame with a predetermined threshold, wherein (b) comprises adaptively adjusting the classification threshold according to the comparison result and the result of classifying a previous frame preceding the current frame.
[5] 5. The method of claim 4, wherein (b) comprises adjusting the classification threshold in such a way as to increase a possibility that the current frame and the previous frame are classified into the same type, when the comparison result reveals that it is difficult to classify the current frame using only the long-term feature of the current frame.
[6] 6. The method of claim 1, wherein (c) comprises dividing the audio signal into frames, and classifying each of the frames into a speech signal or a music signal.
[7] 7. The method of claim 1 , wherein during (c). the current frame is classified by comparing the short-term feature of the current frame with the adjusted classification threshold.
[8] 8. The method of claim 3, wherein the generation of the long-term feature comprises: when the difference for the current frame is greater than a predetermined threshold, applying positive weights to the difference for the current frame and a difference for a previous frame preceding the current frame between an average of short-term features of a predetermined number of previous frames preceding the previous frame and the short-term feature of the previous frame, and summing the weight-applied differences so as to generate the long-term feature, and when the difference for the current frame is less than the predetermined threshold, applying a negative weight to the difference for the current frame and a positive weight to the difference for the previous frame, and summing the weight-applied differences or reducing a long-term feature of the previous frame so as to generate the long-term feature.
[9] 9. The method of claim 8, wherein during (c), the audio signal is divided into frames units and each of the frames is classified into a speech signal or a music signal, and the predetermined threshold used to generate the long-term feature is a difference for a maximum difference between a possibility of the presence of the audio signal and a possibility of the presence of the music signal.
[10] 10. The method of claim 1, wherein the long-term feature is at least one selected from a group consisting of a linear prediction-long-term prediction gain, a spectrum tilt, and a zero crossing rate.
[11] 1 1. A computer-readable recording medium having recorded thereon a computer program for implementing the method of any one of claims through 10.
[ 12] 12. A method of encoding an audio signal, comprising:
(a) dividing an audio signal in units of frames and classifying the frames according to the method of claim 1;
(b) encoding the audio signal according to the result of classification; and
(c) generating a bitstream by performing bitstream processing on the encoded signal.
[13] 13. The method of claim 12, wherein the generated bitstream includes classification information for the audio signal.
( 14] 14. The method of claim 12, wherein the encoding in (b) comprises performing encoding in the time domain when the frames are classified into speech signals, and performing encoding in the frequency signal when the frames are classified into music signals.
15. An apparatus for classifying an audio signal, comprising: a short-term feature generation unit to analyze the audio signal in units of frames and generating a short-term feature; a long-term feature generation unit to generate a long-term feature using the' short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, by using the generated long- term feature; and a classification unit to classify the current frame using the adjusted classification threshold.
[15] 16. The apparatus of claim 15, further comprising a long-term feature comparison unit to compare the long-term feature of the current frame with a predetermined threshold, wherein the classification unit classifies the current frame, based on a long-term feature of a previous frame preceding the current frame and the result of comparison received from the long-term feature comparison result.
[16]
17. The apparatus of claim 15, wherein the long-term feature generation unit comprises: a first long-term feature generation unit to generate a first long-term feature using short-term features of a predetermined number of previous frames preceding the current frame; and a second long-term feature generation unit to generate a second long-term feature by using the first long-term feature generated by the first long-term feature generation unit, and a first long-term feature of the previous frames, wherein the classification threshold adjustment unit adaptively adjusts the classification threshold for the current frame using the second long-term feature generated by the second long-term feature generation unit.
[17] 18. The apparatus of claim 15, wherein the short-term feature generation unit comprises at least one selected from a group consisting of a linear prediction- long-term prediction gain generation unit, a spectrum tilt generation unit, and a zero crossing rate generation unit.
[ 18] 19. An apparatus for encoding an audio signal, comprising: a short-term feature generation unit to analyze an audio signal in units of frames and generating a short-term feature; a long-term feature generation unit to generate a long-term feature using the short-term feature; a classification threshold adjustment unit to adaptively adjust a classification threshold for a current frame that is to be classified, using the generated long- term feature; a classification unit to classify the current frame using the adaptively adjusted classification threshold; an encoding unit to perform the classified audio signal in units of frames; and a multiplexer to perform bitstream processing on the encoded signal so as to generate a bitstream.
[ 19] 20. A method of decoding an audio signal, comprising: receiving a bitstream including classification information regarding each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; determining a decoding mode for the audio signal based on the classification information; and decoding the received bitstream according to the determined decoding mode. [20] 21. An apparatus for decoding an audio signal, comprising: a receipt unit to receive a bitstream including classification information for each of frames of an audio signal, where the classification information is adaptively determined using a long-term feature of the audio signal; a decoding mode determination unit to determine a decoding mode for the received bitstream according to the classification information; and a decoding unit to decode the received bitstream according to the determined decoding mode.
PCT/KR2007/006811 2006-12-28 2007-12-26 Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same WO2008082133A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07860649A EP2102860A4 (en) 2006-12-28 2007-12-26 Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2006-0136823 2006-12-28
KR1020060136823A KR100883656B1 (en) 2006-12-28 2006-12-28 Method and apparatus for discriminating audio signal, and method and apparatus for encoding/decoding audio signal using it

Publications (1)

Publication Number Publication Date
WO2008082133A1 true WO2008082133A1 (en) 2008-07-10

Family

ID=39585193

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2007/006811 WO2008082133A1 (en) 2006-12-28 2007-12-26 Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same

Country Status (4)

Country Link
US (1) US20080162121A1 (en)
EP (1) EP2102860A4 (en)
KR (1) KR100883656B1 (en)
WO (1) WO2008082133A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9224403B2 (en) 2010-07-02 2015-12-29 Dolby International Ab Selective bass post filter
US9711158B2 (en) 2011-01-25 2017-07-18 Nippon Telegraph And Telephone Corporation Encoding method, encoder, periodic feature amount determination method, periodic feature amount determination apparatus, program and recording medium

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE547898T1 (en) 2006-12-12 2012-03-15 Fraunhofer Ges Forschung ENCODER, DECODER AND METHOD FOR ENCODING AND DECODING DATA SEGMENTS TO REPRESENT A TIME DOMAIN DATA STREAM
RU2454736C2 (en) * 2007-10-15 2012-06-27 ЭлДжи ЭЛЕКТРОНИКС ИНК. Signal processing method and apparatus
JP2011518345A (en) * 2008-03-14 2011-06-23 ドルビー・ラボラトリーズ・ライセンシング・コーポレーション Multi-mode coding of speech-like and non-speech-like signals
KR20100006492A (en) * 2008-07-09 2010-01-19 삼성전자주식회사 Method and apparatus for deciding encoding mode
MX2011000364A (en) * 2008-07-11 2011-02-25 Ten Forschung Ev Fraunhofer Method and discriminator for classifying different segments of a signal.
EP2144231A1 (en) * 2008-07-11 2010-01-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Low bitrate audio encoding/decoding scheme with common preprocessing
EP2144230A1 (en) 2008-07-11 2010-01-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Low bitrate audio encoding/decoding scheme having cascaded switches
KR101381513B1 (en) * 2008-07-14 2014-04-07 광운대학교 산학협력단 Apparatus for encoding and decoding of integrated voice and music
KR101756834B1 (en) 2008-07-14 2017-07-12 삼성전자주식회사 Method and apparatus for encoding and decoding of speech and audio signal
KR101601906B1 (en) * 2008-07-18 2016-03-10 삼성전자주식회사 Apparatus and method for coding audio signal by swithcing transform scheme among frequency domain transform and time domain transform
US9037474B2 (en) * 2008-09-06 2015-05-19 Huawei Technologies Co., Ltd. Method for classifying audio signal into fast signal or slow signal
MX2011003824A (en) * 2008-10-08 2011-05-02 Fraunhofer Ges Forschung Multi-resolution switched audio encoding/decoding scheme.
CN101751926B (en) * 2008-12-10 2012-07-04 华为技术有限公司 Signal coding and decoding method and device, and coding and decoding system
US9269366B2 (en) * 2009-08-03 2016-02-23 Broadcom Corporation Hybrid instantaneous/differential pitch period coding
CN102982804B (en) * 2011-09-02 2017-05-03 杜比实验室特许公司 Method and system of voice frequency classification
CN103000172A (en) * 2011-09-09 2013-03-27 中兴通讯股份有限公司 Signal classification method and device
US9111531B2 (en) 2012-01-13 2015-08-18 Qualcomm Incorporated Multiple coding mode signal classification
US8712076B2 (en) 2012-02-08 2014-04-29 Dolby Laboratories Licensing Corporation Post-processing including median filtering of noise suppression gains
US9173025B2 (en) 2012-02-08 2015-10-27 Dolby Laboratories Licensing Corporation Combined suppression of noise, echo, and out-of-location signals
CN104078050A (en) 2013-03-26 2014-10-01 杜比实验室特许公司 Device and method for audio classification and audio processing
US9984706B2 (en) * 2013-08-01 2018-05-29 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
CN106409310B (en) 2013-08-06 2019-11-19 华为技术有限公司 A kind of audio signal classification method and apparatus
CN106256001B (en) * 2014-02-24 2020-01-21 三星电子株式会社 Signal classification method and apparatus and audio encoding method and apparatus using the same
US9886963B2 (en) * 2015-04-05 2018-02-06 Qualcomm Incorporated Encoder selection
US10186276B2 (en) * 2015-09-25 2019-01-22 Qualcomm Incorporated Adaptive noise suppression for super wideband music
KR101702565B1 (en) * 2016-03-03 2017-02-03 삼성전자 주식회사 Apparatus and method for coding audio signal by swithcing transform scheme among frequency domain transform and time domain transform
CN111261143B (en) * 2018-12-03 2024-03-22 嘉楠明芯(北京)科技有限公司 Voice wakeup method and device and computer readable storage medium
US10728676B1 (en) * 2019-02-01 2020-07-28 Sonova Ag Systems and methods for accelerometer-based optimization of processing performed by a hearing device
US20220199074A1 (en) * 2019-04-18 2022-06-23 Dolby Laboratories Licensing Corporation A dialog detector

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11175098A (en) * 1997-12-12 1999-07-02 Nec Corp Voice and music encoding system
JP2000267699A (en) * 1999-03-19 2000-09-29 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal coding method and device therefor, program recording medium therefor, and acoustic signal decoding device
US20030101050A1 (en) * 2001-11-29 2003-05-29 Microsoft Corporation Real-time speech and music classifier
KR20030070178A (en) * 2002-02-21 2003-08-29 엘지전자 주식회사 Method and system for real-time music/speech discrimination in digital audio signals
KR20050046204A (en) * 2003-11-13 2005-05-18 한국전자통신연구원 An apparatus for coding of variable bit-rate wideband speech and audio signals, and a method thereof

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5233660A (en) * 1991-09-10 1993-08-03 At&T Bell Laboratories Method and apparatus for low-delay celp speech coding and decoding
TW271524B (en) * 1994-08-05 1996-03-01 Qualcomm Inc
US5751903A (en) * 1994-12-19 1998-05-12 Hughes Electronics Low rate multi-mode CELP codec that encodes line SPECTRAL frequencies utilizing an offset
US6134518A (en) * 1997-03-04 2000-10-17 International Business Machines Corporation Digital audio signal coding using a CELP coder and a transform coder
US6330533B2 (en) * 1998-08-24 2001-12-11 Conexant Systems, Inc. Speech encoder adaptively applying pitch preprocessing with warping of target signal
US6449590B1 (en) * 1998-08-24 2002-09-10 Conexant Systems, Inc. Speech encoder using warping in long term preprocessing
US6385573B1 (en) * 1998-08-24 2002-05-07 Conexant Systems, Inc. Adaptive tilt compensation for synthesized speech residual
US6260010B1 (en) * 1998-08-24 2001-07-10 Conexant Systems, Inc. Speech encoder using gain normalization that combines open and closed loop gains
US6397177B1 (en) * 1999-03-10 2002-05-28 Samsung Electronics, Co., Ltd. Speech-encoding rate decision apparatus and method in a variable rate
US7010480B2 (en) * 2000-09-15 2006-03-07 Mindspeed Technologies, Inc. Controlling a weighting filter based on the spectral content of a speech signal
CA2365203A1 (en) * 2001-12-14 2003-06-14 Voiceage Corporation A signal modification method for efficient coding of speech signals
US7657427B2 (en) * 2002-10-11 2010-02-02 Nokia Corporation Methods and devices for source controlled variable bit-rate wideband speech coding
KR100964402B1 (en) * 2006-12-14 2010-06-17 삼성전자주식회사 Method and Apparatus for determining encoding mode of audio signal, and method and appartus for encoding/decoding audio signal using it

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11175098A (en) * 1997-12-12 1999-07-02 Nec Corp Voice and music encoding system
JP2000267699A (en) * 1999-03-19 2000-09-29 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal coding method and device therefor, program recording medium therefor, and acoustic signal decoding device
US20030101050A1 (en) * 2001-11-29 2003-05-29 Microsoft Corporation Real-time speech and music classifier
KR20030070178A (en) * 2002-02-21 2003-08-29 엘지전자 주식회사 Method and system for real-time music/speech discrimination in digital audio signals
KR20050046204A (en) * 2003-11-13 2005-05-18 한국전자통신연구원 An apparatus for coding of variable bit-rate wideband speech and audio signals, and a method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2102860A4 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9595270B2 (en) 2010-07-02 2017-03-14 Dolby International Ab Selective post filter
US9830923B2 (en) 2010-07-02 2017-11-28 Dolby International Ab Selective bass post filter
US9396736B2 (en) 2010-07-02 2016-07-19 Dolby International Ab Audio encoder and decoder with multiple coding modes
US9552824B2 (en) 2010-07-02 2017-01-24 Dolby International Ab Post filter
US9558753B2 (en) 2010-07-02 2017-01-31 Dolby International Ab Pitch filter for audio signals
US9558754B2 (en) 2010-07-02 2017-01-31 Dolby International Ab Audio encoder and decoder with pitch prediction
US9343077B2 (en) 2010-07-02 2016-05-17 Dolby International Ab Pitch filter for audio signals
US11996111B2 (en) 2010-07-02 2024-05-28 Dolby International Ab Post filter for audio signals
US9224403B2 (en) 2010-07-02 2015-12-29 Dolby International Ab Selective bass post filter
US9858940B2 (en) 2010-07-02 2018-01-02 Dolby International Ab Pitch filter for audio signals
US10236010B2 (en) 2010-07-02 2019-03-19 Dolby International Ab Pitch filter for audio signals
US10811024B2 (en) 2010-07-02 2020-10-20 Dolby International Ab Post filter for audio signals
US11183200B2 (en) 2010-07-02 2021-11-23 Dolby International Ab Post filter for audio signals
US11610595B2 (en) 2010-07-02 2023-03-21 Dolby International Ab Post filter for audio signals
US9711158B2 (en) 2011-01-25 2017-07-18 Nippon Telegraph And Telephone Corporation Encoding method, encoder, periodic feature amount determination method, periodic feature amount determination apparatus, program and recording medium

Also Published As

Publication number Publication date
EP2102860A4 (en) 2011-05-04
EP2102860A1 (en) 2009-09-23
KR20080061758A (en) 2008-07-03
KR100883656B1 (en) 2009-02-18
US20080162121A1 (en) 2008-07-03

Similar Documents

Publication Publication Date Title
EP2102860A1 (en) Method, medium, and apparatus to classify for audio signal, and method, medium and apparatus to encode and/or decode for audio signal using the same
US20080147414A1 (en) Method and apparatus to determine encoding mode of audio signal and method and apparatus to encode and/or decode audio signal using the encoding mode determination method and apparatus
EP1747442B1 (en) Selection of coding models for encoding an audio signal
US8990073B2 (en) Method and device for sound activity detection and sound signal classification
EP1747554B1 (en) Audio encoding with different coding frame lengths
US7693710B2 (en) Method and device for efficient frame erasure concealment in linear predictive based speech codecs
US8725499B2 (en) Systems, methods, and apparatus for signal change detection
EP2301011B1 (en) Method and discriminator for classifying different segments of an audio signal comprising speech and music segments
EP1982329B1 (en) Adaptive time and/or frequency-based encoding mode determination apparatus and method of determining encoding mode of the apparatus
US6564182B1 (en) Look-ahead pitch determination
Jayant et al. Speech coding with time-varying bit allocations to excitation and LPC parameters
Özaydın et al. Matrix quantization and mixed excitation based linear predictive speech coding at very low bit rates
Ojala Toll quality variable-rate speech codec
KR20070017379A (en) Selection of coding models for encoding an audio signal
Rämö et al. Segmental speech coding model for storage applications.
Cuperman et al. Adaptive window excitation coding in low-bit-rate CELP coders
ZA200609478B (en) Audio encoding with different coding frame lengths

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07860649

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2007860649

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE