WO2015018121A1 - 一种音频信号分类方法和装置 - Google Patents

一种音频信号分类方法和装置 Download PDF

Info

Publication number
WO2015018121A1
WO2015018121A1 PCT/CN2013/084252 CN2013084252W WO2015018121A1 WO 2015018121 A1 WO2015018121 A1 WO 2015018121A1 CN 2013084252 W CN2013084252 W CN 2013084252W WO 2015018121 A1 WO2015018121 A1 WO 2015018121A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
audio frame
current audio
spectrum
linear prediction
Prior art date
Application number
PCT/CN2013/084252
Other languages
English (en)
French (fr)
Inventor
王喆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to EP19189062.3A priority Critical patent/EP3667665B1/en
Priority to KR1020177034564A priority patent/KR101946513B1/ko
Priority to EP21213287.2A priority patent/EP4057284A3/en
Priority to KR1020167006075A priority patent/KR101805577B1/ko
Priority to EP17160982.9A priority patent/EP3324409B1/en
Priority to KR1020207002653A priority patent/KR102296680B1/ko
Priority to SG11201600880SA priority patent/SG11201600880SA/en
Priority to BR112016002409-5A priority patent/BR112016002409B1/pt
Priority to EP13891232.4A priority patent/EP3029673B1/en
Priority to MX2016001656A priority patent/MX353300B/es
Priority to JP2016532192A priority patent/JP6162900B2/ja
Priority to KR1020197003316A priority patent/KR102072780B1/ko
Priority to AU2013397685A priority patent/AU2013397685B2/en
Priority to ES13891232.4T priority patent/ES2629172T3/es
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015018121A1 publication Critical patent/WO2015018121A1/zh
Priority to US15/017,075 priority patent/US10090003B2/en
Priority to HK16107115.7A priority patent/HK1219169A1/zh
Priority to AU2017228659A priority patent/AU2017228659B2/en
Priority to AU2018214113A priority patent/AU2018214113B2/en
Priority to US16/108,668 priority patent/US10529361B2/en
Priority to US16/723,584 priority patent/US11289113B2/en
Priority to US17/692,640 priority patent/US11756576B2/en
Priority to US18/360,675 priority patent/US20240029757A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients

Definitions

  • Such codecs typically include a speech generation model based encoder (such as CELP) and a transform based encoder (such as an MDCT based encoder).
  • a speech generation model based encoder such as CELP
  • a transform based encoder such as an MDCT based encoder
  • the encoder based on the speech generation model can obtain better speech coding quality, but the coding quality of the music is relatively poor, and the transform-based encoder can obtain better music coding quality, and the speech quality.
  • the coding quality is poor. Therefore, the hybrid codec encodes the speech signal using a speech generation model-based encoder, and the music signal is encoded by a transform-based encoder, thereby obtaining an overall optimal encoding effect.
  • a core technology is the classification of audio signals, or specific to this application, is the choice of coding mode.
  • Hybrid codecs need to obtain accurate signal type information in order to get the optimal coding mode choice.
  • the audio signal classifier here can also be considered roughly as a voice/music classifier.
  • Speech recognition rate and music recognition rate are important indicators for measuring the performance of speech/music classifiers. Especially for music signals, the recognition of music signals is often more difficult than speech due to the variety/complexity of their signal characteristics.
  • the recognition delay is also one of the most important indicators. Due to the ambiguity of the voice/music features in a short time, It is often necessary to recognize speech/music more accurately over a relatively long period of time. In general, in the middle of the same type of signal, the longer the recognition delay, the more accurate the recognition.
  • the stability of classification is also an important attribute that affects the quality of hybrid encoder coding.
  • hybrid encoders produce a quality degradation when switching between different types of encoders. If the classifier has frequent type switching in the same type of signal, the impact on the coding quality is relatively large, which requires the output classification result of the classifier to be accurate and smooth.
  • computational complexity and storage overhead are also required to be as low as possible to meet commercial needs.
  • the ITU-T standard G. 720. 1 contains a voice/music classifier.
  • This classifier uses a main parameter, the spectral fluctuation variance var_f lux, as the main basis for signal classification, and combines two different spectral kurtosis parameters pl, p2 as an auxiliary basis.
  • the classification of the input signal according to var_f lux is done in a var.f lux buffer of a FIFO based on the local statistics of var_f lux.
  • the specific process is outlined below. First, the spectrum fluctuation f lux is extracted for each input audio frame, and is buffered in a first buffer, where f lux is calculated in the latest 4 frames including the current input frame, and other calculation methods are also available. .
  • the variance of f lux of the N most recent frames including the current input frame is calculated, and the var_f lux of the current input frame is obtained and cached in the second buff er. Then, the number K of frames in the var_f lux of the M latest frames including the current input frame that is greater than the first threshold is counted. If the ratio of ⁇ to ⁇ is greater than a second threshold, it is determined that the current input frame is a speech frame, otherwise it is a music frame.
  • the auxiliary parameters l and p2 are mainly used for the correction of the classification, and are also calculated for each input audio frame.
  • the current input audio frame is directly determined to be a music frame.
  • the shortcoming of this speech/music classifier is that the absolute recognition rate of music still needs to be improved.
  • the target application of the classifier since the target application of the classifier has no application scenario for mixed signals, the recognition performance of the mixed signal is also certain. Room for improvement.
  • Many of the existing speech/music classifiers are designed based on the principle of pattern recognition. This type of classifier usually extracts multiple feature parameters (ten to tens of tens) for the input audio frame, and feeds these parameters into one or based on a Gaussian mixture model, or based on a neural network, or based on other classical classification methods. The classifier to classify.
  • An object of the embodiments of the present invention is to provide an audio signal classification method and apparatus, which reduce the complexity of signal classification in the case of ensuring a classification and recognition rate of a mixed audio signal.
  • an audio signal classification method including:
  • the current audio frame is classified into a speech frame or a music frame according to a statistic of part or all of the valid data of the spectrum fluctuation stored in the spectrum fluctuation memory.
  • determining whether to obtain the spectrum fluctuation of the current audio frame and storing the data in the spectrum fluctuation memory according to the sound activity of the current audio frame includes:
  • the spectral fluctuation of the current audio frame is stored in the spectrum fluctuation memory.
  • determining whether to obtain the spectrum fluctuation of the current audio frame and storing the data in the spectrum fluctuation memory according to the sound activity of the current audio frame includes: If the current audio frame is an active frame, and the current audio frame does not belong to an energy impact, the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory.
  • determining whether to obtain the spectrum fluctuation of the current audio frame and storing the data in the spectrum fluctuation memory according to the sound activity of the current audio frame includes:
  • the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory.
  • Updating the spectrum fluctuations stored in the spectrum fluctuation memory according to whether the current audio frame is a tap music includes:
  • the value of the stored spectrum fluctuations in the spectrum fluctuation memory is modified.
  • Updating the spectrum fluctuations stored in the spectrum fluctuation memory according to the activity of the historical audio frame includes:
  • the data of the spectrum fluctuations other than the spectrum fluctuation of the current audio frame stored in the spectrum fluctuation memory is stored. Modified to invalid data
  • the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory, and the consecutive three frames of historical frames before the current audio frame are not all active frames, the spectrum fluctuation of the current audio frame is corrected to the first value;
  • the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory, and the historical classification result is a music signal and the spectrum fluctuation of the current audio frame is greater than the second value, the spectrum fluctuation of the current audio frame is corrected to a second value, where The second value is greater than the first value.
  • Classifying the current audio frame into a voice frame or a music frame according to a statistic of part or all of the valid data of the spectrum fluctuation stored in the spectrum fluctuation memory includes:
  • the audio signal classification method further includes:
  • the spectral high-band kurtosis represents the kurtosis or energy sharpness of the spectrum of the current audio frame on the high frequency band
  • the spectral correlation indicates the stability of the signal harmonic structure of the current audio frame between adjacent frames
  • the linear prediction residual energy slope indicates the degree to which the linear prediction residual energy of the audio signal varies with the linear prediction order.
  • the classifying the audio frames according to the statistics of part or all of the data fluctuations stored in the spectrum fluctuation memory includes:
  • the mean of the stored spectrum fluctuation effective data is less than a first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than The second threshold is; or the mean value of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy slope effective data is less than a fourth threshold.
  • an apparatus for classifying an audio signal for classifying an input audio signal includes: a storage confirmation unit, configured to determine, according to the sound activity of the current audio frame, whether to obtain and store a spectrum fluctuation of a current audio frame, where the spectrum fluctuation represents an energy fluctuation of a spectrum of the audio signal;
  • a memory configured to store the spectrum fluctuation when the storage confirmation unit outputs a result that needs to be stored; and an updating unit, configured to update a spectrum fluctuation stored in the memory according to whether the voice frame is the activity of tapping the music or the historical audio frame;
  • a classifying unit configured to classify the current audio frame as a voice frame or a music frame according to a measure of part or all of the valid data of the spectrum fluctuation stored in the memory.
  • the storage confirmation unit is specifically configured to: when confirming that the current audio frame is an active frame, output a result of storing a spectrum fluctuation of the current audio frame.
  • the storage confirmation unit is specifically configured to: when the current audio frame is an active frame, and the current audio frame does not belong to an energy impact, the output needs to store a result of the spectrum fluctuation of the current audio frame.
  • the storage confirmation unit is specifically configured to: confirm that the current audio frame is an active frame, and that multiple consecutive frames including the current audio frame and its historical frame are not energy impact, and output The result of the spectral fluctuations of the current audio frame needs to be stored.
  • the update unit is specifically configured to modify a value of the stored spectrum fluctuation in the spectrum fluctuation memory if the current audio frame belongs to the tap music.
  • the updating unit is specifically configured to: if the current audio frame is an active frame, and the previous frame of the audio frame is an inactive frame, the other spectrum fluctuations other than the spectrum fluctuation of the current audio frame stored in the memory are stored. Data is modified to invalid data; or
  • the spectrum fluctuation of the current audio frame is corrected to the first value
  • the spectral fluctuation of the current audio frame is corrected to a second value, wherein the second value is greater than the first value.
  • the classification unit includes:
  • a calculation unit configured to obtain an average value of part or all of the valid data of the spectrum fluctuation stored in the memory
  • a judging unit configured to compare an average value of the valid data of the spectrum fluctuation with a music classification condition, and classify the current audio frame into a music frame when an average value of the valid data of the spectrum fluctuation satisfies a music classification condition; The current audio frame is classified into a speech frame.
  • the audio signal classification device further includes:
  • a parameter obtaining unit configured to obtain a spectral high-band kurtosis, a spectral correlation, a voiced sound parameter, and a linear prediction residual energy tilt of the current audio frame; wherein, the spectral high-band kurtosis indicates that the spectrum of the current audio frame is high The kurtosis or energy sharpness in the frequency band; the spectral correlation indicates the stability of the signal harmonic structure of the current audio frame between adjacent frames; the voiced sound parameter indicates the time domain correlation of the current audio frame with the signal before a pitch period The linear prediction residual energy slope indicates the degree to which the linear prediction residual energy of the audio signal varies with the increase of the linear prediction order;
  • the storage confirmation unit is further configured to: determine, according to the sound activity of the current audio frame, whether to store the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient in a memory;
  • the storage unit is further configured to store the spectral high-band kurtosis, the spectral correlation degree, and the linear prediction residual energy tilt when the storage confirmation unit outputs a result that needs to be stored;
  • the classifying unit is specifically configured to obtain statistics of valid data in the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient, respectively, according to the valid data.
  • the statistic classifies the audio frame into a speech frame or a music frame.
  • the classification unit includes:
  • a calculating unit configured to respectively obtain a mean value of the stored spectrum fluctuation effective data, an average value of the spectrum high-band kurtosis effective data, a mean value of the spectral correlation effective data, and a variance of the linear prediction residual energy inclination effective data;
  • a determining unit configured to classify the current audio frame as a music frame when one of the following conditions is satisfied, and otherwise classify the current audio frame into a voice frame: the average value of the spectrum fluctuation effective data is less than a first threshold; or The mean value of the spectral high-band kurtosis effective data is greater than a second threshold; or the mean of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy gradient effective data is less than a fourth threshold.
  • an audio signal classification method including:
  • the audio frames are classified according to a statistic of the predicted residual energy slope partial data in the memory.
  • the method before the linear prediction residual energy tilt is stored in the memory, the method further includes:
  • the statistic of the prediction residual energy gradient partial data is the variance of the prediction residual energy gradient partial data.
  • the classification of audio frames includes:
  • the audio signal classification method further includes:
  • the classifying the audio frame according to the statistic of the prediction residual energy slope partial data in the memory includes:
  • the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory.
  • the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient are respectively effective.
  • the statistic of the data, classifying the audio frame into a voice frame or a music frame according to the statistic of the valid data includes:
  • the mean of the stored spectrum fluctuation effective data is less than a first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than The second threshold is; or the mean value of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy slope effective data is less than a fourth threshold.
  • the audio signal classification method further includes:
  • the classifying the audio frame according to the statistic of the prediction residual energy slope partial data in the memory includes:
  • the statistics of the stored linear prediction residual energy gradient and the statistics of the number of spectral tones are respectively obtained: The variance of the linear prediction residual energy slope;
  • Classifying the audio frame into a speech frame or a music frame according to a statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones, and a ratio of the number of spectral tones on the low frequency band include: when the current audio The frame is an active frame, and the current audio frame is classified into a music frame if one of the following conditions is met, otherwise the current audio frame is classified into a voice frame:
  • the variance of the linear prediction residual energy slope is less than a fifth threshold
  • the mean value of the number of spectral tones is greater than a sixth threshold
  • the ratio of the number of spectral tones on the low frequency band is less than the seventh threshold.
  • obtaining a linear prediction residual energy gradient of the current audio frame includes: :
  • ep ⁇ represents the prediction residual energy of the i-th order linear prediction of the current audio frame
  • n is a positive integer representing the order of the linear prediction, which is less than or equal to the maximum order of the linear prediction.
  • the number of spectral tones and the number of spectral tones of the current audio frame are obtained.
  • the ratios on the low frequency band include:
  • a signal classification apparatus for classifying an input audio signal, the method comprising:
  • a framing unit for performing framing processing on the input audio signal
  • a parameter obtaining unit configured to obtain a linear prediction residual energy tilt of the current audio frame;
  • the linear prediction residual energy tilt represents a degree of change of the linear prediction residual energy of the audio signal as the linear prediction order increases;
  • a storage unit configured to store a linear prediction residual energy tilt
  • a classification unit configured to classify the audio frame according to a statistic of the prediction residual energy slope partial data in the memory.
  • the signal classification device further includes:
  • a storage confirmation unit configured to determine, according to the sound activity of the current audio frame, whether to store the linear prediction residual energy tilt in a memory
  • the storage unit is specifically configured to: when the storage confirmation unit confirms that it is necessary to determine that storage is needed
  • the linear prediction residual energy tilt is stored in the memory.
  • the statistic of the prediction residual energy gradient partial data is the variance of the prediction residual energy gradient partial data.
  • the classifying unit is specifically configured to compare a variance of the prediction residual energy gradient partial data with a music classification threshold, and when the variance of the prediction residual energy gradient partial data is smaller than a music classification threshold, the current audio frame Classified as a music frame; otherwise the current audio frame is classified as a speech frame.
  • the parameter obtaining unit is further configured to: obtain spectrum fluctuation of the current audio frame, spectrum high-band kurtosis, and Frequency relevance, and stored in the corresponding memory;
  • the classifying unit is specifically configured to: obtain statistics of valid data in the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient, respectively, according to the statistics of the valid data
  • the audio frame is classified into a speech frame or a music frame; the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory.
  • the classification unit includes:
  • a calculating unit configured to respectively obtain a mean value of the stored spectrum fluctuation effective data, an average value of the spectrum high-band kurtosis effective data, a mean value of the spectral correlation effective data, and a variance of the linear prediction residual energy inclination effective data;
  • a determining unit configured to classify the current audio frame as a music frame when one of the following conditions is satisfied, and otherwise classify the current audio frame into a voice frame: the average value of the spectrum fluctuation effective data is less than a first threshold; or The mean value of the spectral high-band kurtosis effective data is greater than a second threshold; or the mean of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy gradient effective data is less than a fourth threshold.
  • the parameter obtaining unit is further configured to: obtain a spectrum tone number and a spectrum tone of the current audio frame. The ratio of the number on the low frequency band and stored in the memory;
  • the classifying unit is specifically configured to: separately obtain statistics of stored linear prediction residual energy gradients a quantity, a statistic of the number of spectral tones; classifying the audio frame as based on a statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones, and a ratio of the number of spectral tones on the low frequency band
  • the statistic of the valid data refers to a data value obtained after an operation operation on the data stored in the memory.
  • a fifth possible implementation manner of the fourth aspect in a sixth possible implementation manner,
  • a calculating unit configured to obtain a variance of a linear prediction residual energy gradient effective data and a mean value of the stored spectral tones
  • a determining unit configured to classify the current audio frame as a music frame when the current audio frame is an active frame, and satisfy one of the following conditions, otherwise classify the current audio frame into a voice frame: linear prediction residual energy tilt
  • the variance of the degree is less than the fifth threshold; or the mean of the number of spectral tones is greater than the sixth threshold; or the ratio of the number of spectral tones on the low frequency band is less than the seventh threshold.
  • the parameter obtaining unit calculates a current audio frame according to the following formula Linear prediction residual energy tilt:
  • the parameter obtaining unit is configured to collect the current audio frame at 0-8 kHz.
  • the number of frequency points in which the peak frequency of the frequency band is greater than a predetermined value is used as the number of spectral tones; the parameter obtaining unit is configured to calculate the number of frequency points whose frequency peaks are greater than a predetermined value in the 0 to 4 kHz band of the current audio frame and the frequency band of 0 to 8 kHz.
  • the ratio of the number of frequency points whose peak frequency is greater than a predetermined value, as the number of spectral tones is low.
  • the ratio of the frequency band the embodiment of the present invention classifies the audio signal according to the long-term statistics of the spectrum fluctuation, the parameters are less, the recognition rate is higher and the complexity is lower; and the spectrum of the sound activity and the percussion music are considered.
  • the fluctuation is adjusted to have a higher recognition rate for the music signal, which is suitable for the classification of the mixed audio signal.
  • 1 is a schematic diagram of framing an audio signal
  • FIG. 2 is a schematic flowchart of an embodiment of an audio signal classification method according to the present invention
  • FIG. 3 is a schematic flowchart of an embodiment of obtaining spectrum fluctuation according to the present invention
  • FIG. 4 is a schematic flowchart of another embodiment of an audio signal classification method according to the present invention
  • FIG. 5 is a schematic flowchart of another embodiment of an audio signal classification method according to the present invention
  • FIG. 7 to FIG. 10 is a specific classification flowchart of the audio signal classification provided by the present invention
  • FIG. 11 is a flowchart of another embodiment of the audio signal classification method provided by the present invention.
  • FIG. 12 is a specific classification flowchart of audio signal classification provided by the present invention.
  • FIG. 13 is a schematic structural diagram of an embodiment of an apparatus for classifying an audio signal according to the present invention
  • FIG. 14 is a schematic structural diagram of an embodiment of a classification unit provided by the present invention
  • FIG. 15 is a schematic structural diagram of another embodiment of an apparatus for classifying an audio signal according to the present invention
  • FIG. 16 is a schematic structural diagram of another embodiment of an apparatus for classifying an audio signal according to the present invention
  • 17 is a schematic structural diagram of an embodiment of a classification unit provided by the present invention
  • FIG. 18 is a schematic structural diagram of another embodiment of an apparatus for classifying an audio signal according to the present invention
  • FIG. 19 is a schematic structural diagram of another embodiment of an apparatus for classifying an audio signal according to the present invention.
  • audio codecs and video codecs are widely used in various electronic devices, such as: mobile phones, wireless devices, personal data assistants (PDAs), handheld or portable computers, GPS receivers/navigators. , cameras, audio/video players, camcorders, video recorders, surveillance equipment, etc.
  • PDAs personal data assistants
  • audio encoder or decoder can be directly implemented by a digital circuit or a chip such as a DSP (di gi ta l s l gna l pro s sor ) or a software code driven processor Implemented by executing the process in the software code.
  • DSP digital gi ta l s l gna l pro s sor
  • an audio signal is first classified, and different types of audio signals are encoded by different coding modes, and then the encoded code stream is transmitted to the decoding end.
  • the audio signal is processed in a framing manner, and each frame signal represents an audio signal of a certain duration.
  • the currently input audio frame to be classified may be referred to as a current audio frame; any one of the audio frames before the current audio frame may be referred to as a historical audio frame; according to the timing sequence from the current audio frame to the historical audio frame, history
  • the audio frame may sequentially become the previous audio frame, the first second frame audio frame, the first third frame audio frame, and the first Nth frame audio frame, and N is greater than or equal to four.
  • the input audio signal is a wideband audio signal sampled at 16 kHz, and the audio signal is input. Frames are framed at 20ms, that is, 320 time domain samples per frame. Before extracting the feature parameters, the input audio signal frame is first downsampled to a 12.8 kHz sampling rate, ie 256 samples per frame.
  • the input audio signal frames in the following text refer to the downsampled audio signal frames.
  • an embodiment of an audio signal classification method includes:
  • S101 Perform framing processing on the input audio signal, determine, according to the sound activity of the current audio frame, whether to obtain the spectrum fluctuation of the current audio frame and store it in the spectrum fluctuation memory, where the spectrum fluctuation represents the energy fluctuation of the spectrum of the audio signal;
  • the audio signal classification is generally performed in frames, and each audio signal frame extraction parameter is classified to determine whether the audio signal frame belongs to a speech frame or a music frame, and is encoded by using a corresponding coding mode.
  • the spectrum fluctuation of the current audio frame is obtained, and then according to the voice activity of the current audio frame, whether the spectrum fluctuation is stored in the spectrum fluctuation memory is determined;
  • the framing processing according to the sound activity of the current audio frame, it is determined whether the spectrum fluctuation is stored in the spectrum fluctuation memory, and the spectrum fluctuation is obtained and stored when the storage is needed.
  • the spectral fluctuation flux represents the short-term or long-term energy fluctuation of the signal spectrum, which is the mean value of the absolute value of the logarithmic energy difference of the corresponding frequency of the current audio frame and the historical frame on the medium-low frequency band spectrum; wherein the historical frame refers to the previous audio frame Any frame.
  • the frequency fluctuation is the mean of the absolute value of the logarithmic energy difference of the corresponding frequency of the current audio frame and its historical frame on the medium and low frequency band spectrum.
  • the mean of the absolute values is the mean of the absolute values.
  • one embodiment of obtaining spectral fluctuations includes the following steps:
  • the spectrum of the audio frame can be directly obtained.
  • the spectrum of any two subframes of the current audio frame that is, the energy spectrum is obtained; and the average of the spectrum of the two subframes is used to obtain the frequency of the current audio frame. ;
  • the history frame refers to any frame of audio frames before the current audio frame; in one embodiment, it may be the third frame of audio frames before the current audio frame.
  • S1013 Calculate an average value of absolute values of logarithmic energy differences of corresponding frequencies of the current audio frame and the historical frame respectively in the middle and low frequency bands, as the spectrum fluctuation of the current audio frame.
  • the mean value of the absolute value of the difference between the logarithmic energy of all frequency points of the current audio frame on the medium and low frequency band spectrum and the logarithmic energy of the corresponding frequency point of the historical frame on the medium and low frequency band spectrum may be calculated; In one embodiment, the mean of the absolute value of the difference between the logarithmic energy of the spectral peak of the current audio frame on the mid-lowband spectrum and the logarithmic energy of the corresponding spectral peak of the historical frame on the mid-lowband spectrum may be calculated.
  • Medium and low frequency band spectrum such as 0 ⁇ fs/4, or 0 ⁇ fs/3.
  • the input audio signal is taken as a frame of 20 ms.
  • two 256 FFTs are performed before and after, and two FFT windows are overlapped by 50% to obtain the current audio.
  • the FFT of the first subframe of the current audio frame requires the data of the second subframe of the previous frame.
  • the spectral fluctuation flux of the current audio frame is the mean of the absolute value of the logarithmic energy difference of the corresponding frequency of the current audio frame and its frame before 60 ms in the middle and low frequency band spectrum, In another embodiment, an interval other than 60 ms may also be used.
  • c ⁇ ⁇ [ 10 lo ⁇ ( c ( ) - 10 log( 3 (' ⁇ ))]
  • c ⁇ ) represents the third historical frame before the current current audio frame, that is, in the present embodiment, when the frame length is 20 ms
  • X- n () in this article Both represent the parameter X of the nth historical frame of the current audio frame, and the lower audio frame 0 can be omitted from the current audio frame.
  • Log(.) represents the base 10 logarithm.
  • the spectral fluctuation flux of the current audio frame can also be obtained by the following method, that is, the absolute value of the logarithmic energy difference of the corresponding frequency peak on the medium-low frequency band spectrum of the current audio frame and its frame before 60 ms.
  • ⁇ [ 10 lo g( p ('')) - 1 . 1 ⁇ ⁇ ( ⁇ 3 (0)]
  • P(o represents the first local peak energy of the spectrum of the current audio frame
  • the frequency at which the local peak is located is the frequency at which the energy on the spectrum is higher than the energy at the two adjacent frequencies.
  • the spectral fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored.
  • whether the spectral fluctuations are stored in the memory is determined based on the sound activity of the audio frame and whether the audio frame is an energy impact. If the audio activity parameter of the audio frame indicates that the audio frame is an active frame, and the parameter indicating whether the audio frame is an energy impact indicates that the audio frame is not an energy impact, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, the audio frame is not stored. In another embodiment, if the current audio frame is an active frame, and the plurality of consecutive frames including the current audio frame and its historical frame are not energy impact, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; Otherwise it will not be stored. For example, if the current audio frame is an active frame, and the current audio frame, the previous frame audio frame, and the previous second frame audio frame are not energy impacts, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored.
  • the voice activity flag vad_flag indicates that the current input signal is an active foreground signal (speech, tone) Le, etc.) is also the background signal of the foreground signal silence (such as background noise, mute, etc.), obtained by the sound activity detector VAD.
  • the sound impact flag attack_flag indicates whether the current current audio frame belongs to an energy impact in the music.
  • the sound impact flag attack_flag indicates whether the current current audio frame belongs to an energy impact in the music.
  • the spectrum fluctuation of the current audio frame is stored; the false positive rate of the inactive frame can be reduced, and the recognition rate of the audio classification can be improved.
  • attack_flag is set to 1, which means that the current current audio frame is an energy impact in a music:
  • the meaning of the above formula is that when several historical frames before the current audio frame are mainly music frames, if the frame energy of the current audio frame has a larger jump than the previous first historical frame, and the audio frame is earlier than the previous one.
  • the average energy has a large jump, and the current audio frame has a larger time envelope than the average envelope of the audio frame in the previous period of time, and the current current audio frame is considered to be an energy impact in the music.
  • the logarithmic frame energy etot is represented by the logarithmic total subband energy of the input audio frame: Where zb(/), /b( ) represent the high and low frequency boundaries of the _/subband in the input audio frame spectrum, respectively; C ( ) represents the frequency of the input audio frame.
  • the long-term moving average of the time-domain maximum logarithmic sample amplitude of the current audio frame mov_log_max_spl is only updated in the active sound frame:
  • the frequency fluctuation Fluc of the current audio frame is buffered in a flu's flux history buffer, this embodiment
  • the length of the flux history buffer is 60 (60 frames). Determining whether the sound activity of the current audio frame and the audio frame are energy impacts. When the current audio frame is a foreground signal frame and the current audio frame and the previous two frames do not have an energy impact belonging to music, the spectrum of the current audio frame is The fluctuating flux is stored in the memory. Before buffering the flux of the current current current audio frame, check if the following conditions are met:
  • attack _ flag _ 2 ⁇ 1 If it is satisfied, it will be cached, otherwise it will not be cached.
  • vad_flag indicates whether the current input signal is the active foreground signal or the foreground signal silent background signal
  • attack_flag indicates whether the current current audio frame belongs to an energy impact in the music
  • the meaning of the above formula is: The current audio frame is an active frame, and the current audio frame, the previous frame audio frame, and the previous second frame audio frame are not energy impacts.
  • S102 Update a spectrum fluctuation stored in the spectrum fluctuation memory according to whether the audio frame is the activity of tapping the music or the historical audio frame;
  • the value of the spectrum fluctuation stored in the spectrum fluctuation memory is modified, and the spectrum fluctuation is stored.
  • the effective spectral fluctuation value in the device is modified to a value less than or equal to the music threshold, wherein the audio is classified as a music frame when the spectral fluctuation of the audio frame is less than the music threshold.
  • the effective frequency fluctuation value is reset to five. That is, when the percussion sound flag percus_flag is set to 1, all valid buffer data in the flux history buffer is reset to 5.
  • the effective buffer data is equivalent to the effective spectrum fluctuation value.
  • the spectrum fluctuation value of the music frame is low, and the spectrum fluctuation value of the speech frame is high.
  • modifying the effective spectrum fluctuation value to a value less than or equal to the music threshold can improve the probability that the audio frame is classified into a music frame, thereby improving the accuracy of the audio signal classification.
  • the spectral fluctuations in the memory are updated based on the activity of the historical frames of the current audio frame. Specifically, in an embodiment, if it is determined that the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory, and the audio frame of the previous frame is an inactive frame, the frequency words stored in the spectrum fluctuation memory except the current audio frame are stored. The data of other frequency fluctuations other than fluctuations is changed to invalid data.
  • the audio frame of the previous frame is an inactive frame and the current audio frame is an active frame
  • the voice activity of the current audio frame is different from that of the historical frame. If the spectrum fluctuation of the historical frame is invalidated, the influence of the history frame on the audio classification can be reduced. Thereby improving the accuracy of the classification of the audio signal.
  • the spectrum fluctuation of the current audio frame is corrected to the first value.
  • the first value may be a speech threshold, wherein the audio is classified as a speech frame when the spectral fluctuation of the audio frame is greater than the speech threshold.
  • the spectrum fluctuation of the current audio frame is performed. Corrected to a second value, where the second value is greater than the first value.
  • mode_mov represents the long-term moving average of the historical final classification result in the signal classification
  • mode_mov>0.9 indicates that the signal is in the music signal, and the flux is limited according to the historical classification result of the audio signal to reduce the probability of the voice feature of the flux. The purpose is to improve the stability of the judgment classification.
  • the spectral fluctuation of the current audio frame can be modified to a speech (music) threshold or a value close to the speech (music) threshold.
  • the signal before the current signal is a voice (music) signal
  • the spectrum fluctuation of the current audio frame may be modified to a speech (music) threshold or a value close to the speech (music) threshold to improve the judgment classification. Stability.
  • the spectrum fluctuation may be limited, that is, the spectrum fluctuation of the current audio frame may be modified to be no more than a threshold to reduce the probability that the spectrum fluctuation is determined as a speech feature.
  • percus_flag indicates whether there is a knocking sound in the audio cassette.
  • Percus—flag Set to 1 to indicate that a tap sound is detected, and to set 0 to indicate that no knock sound is detected.
  • the current signal ie, some of the most recent signal frames including the current audio frame and some of its historical frames
  • the current signal does not have significant voiced features
  • the current signal is considered to be a tapping music; otherwise, if each sub-frame of the current signal does not have obvious voiced features and the current letter A percussion music.
  • the percussion sound flag percus_flag is obtained by the following steps:
  • the log frame energy etot of the input audio frame is obtained, which is represented by the logarithm total subband energy of the input audio frame:
  • zb(/), /b( ) represent the high and low frequency boundaries of the _/subband of the input frame spectrum, respectively
  • C ( ) represents the frequency of the input audio frame.
  • Percus_flag is set to 1 when the following conditions are met, otherwise 0 is set.
  • the normalized open-loop pitch correlation of the first and second subframes of a subframe and the first history frame, and the voiced parameter voicing is obtained by linear prediction analysis, representing the current audio frame and a signal before a pitch period
  • the time domain correlation is between 0 and 1; mode—mov represents the long-term moving average of the historical final classification results in the signal classification; log—max— spL 2 and mov—log—max— spL 2 represent the second The time-domain maximum logarithmic sample amplitude of the historical frame, and its long-term moving average.
  • /p— ⁇ eecz is updated in each active sound frame (
  • the voiced parameter voicing that is, the normalized open-loop pitch correlation, indicates the time-domain correlation between the current audio frame and the signal before a pitch period, which can be obtained from the open-loop pitch search of ACELP, and the value is 0 ⁇ 1. between.
  • the present invention is not described in detail because it belongs to the prior art.
  • each of the two subframes of the current audio frame is calculated as a voicing, and the voicing parameter of the current audio frame is obtained.
  • the voicing parameter of the current audio frame is also cached in a voicing history buffer.
  • the voicing history buffer has a length of 10.
  • Mode mov - 0.95 ⁇ move _mov_ l + 0.05 ⁇ mode mode is the classification result of the current input audio frame, binary value, "0" for the voice category, and "1" for the music category.
  • S103 classify the current audio frame into a voice frame or a music frame according to a statistic of part or all of data of the spectrum fluctuation stored in the spectrum fluctuation memory.
  • the statistic of the effective data of the spectrum fluctuation satisfies the voice classification condition
  • the current audio frame is classified into a voice frame
  • the statistic of the effective data of the spectrum fluctuation satisfies the music classification condition
  • the current audio frame is classified into music frame.
  • the statistics here are the values obtained by performing statistical operations on valid spectrum fluctuations (ie, valid data) stored in the spectrum fluctuation memory.
  • the statistical operation may be an average value or a variance.
  • the statistics in the examples below have similar meanings.
  • step S103 includes:
  • the current audio frame is classified into a music frame; otherwise, the current audio frame is classified into a speech frame.
  • the spectrum fluctuation value of the music frame is small, and the spectrum fluctuation value of the speech frame is large. Therefore, the current audio frame can be classified according to the spectrum fluctuation.
  • other classification methods can also be used to classify the current audio frame. For example, the number of valid data of the spectrum fluctuation stored in the spectrum fluctuation memory is counted; according to the quantity of the valid data, the spectrum fluctuation memory is divided from the near end to the far end, and the starting point of the interval is the current frame spectrum fluctuation storage location The near end is the end where the current frame spectrum fluctuation is stored, and the far end is the end where the historical frame spectrum fluctuation is stored; the audio frame is classified according to the spectrum fluctuation statistic in the shorter interval, if the parameter statistics in the interval The classification process ends when the amount is sufficient to distinguish the type of the audio frame, otherwise the classification process continues in the shortest interval of the remaining longer intervals, and so on.
  • the current audio frame is classified according to the classification threshold corresponding to each interval, and the current audio frame is classified into a speech frame or a music frame, and the statistics of the effective data of the spectrum fluctuation
  • the current audio frame is classified into a voice frame; when the statistics of the effective data of the spectrum fluctuation satisfy the music classification condition, the current audio frame is classified into a music frame.
  • the speech signal is encoded using a speech generation model based encoder (e.g., CELP), and the music signal is encoded using a transform based encoder (e.g., an MDCT based encoder).
  • a speech generation model based encoder e.g., CELP
  • a transform based encoder e.g., an MDCT based encoder
  • step S102 the method further includes:
  • S104 Obtain a spectral high-band kurtosis, a spectral correlation, and a linear prediction residual energy gradient of the current audio frame, and store the spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient in a memory.
  • the spectral high-band kurtosis indicates the kurtosis or energy sharpness of the current audio frame spectrum on the high frequency band;
  • the spectral correlation indicates the stability of the signal harmonic structure between adjacent frames;
  • the linear prediction residual energy slope Representing a linear prediction residual energy slope indicating a degree to which a linear prediction residual energy of an input audio signal varies with an increase in a linear prediction order;
  • the method before storing the parameters, further includes: determining, according to the sound activity of the current audio frame, whether to store the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient in the memory; If the current audio frame is an active frame, the above parameters are stored; otherwise, it is not stored.
  • the spectral high-band kurtosis represents the kurtosis or energy sharpness of the current audio frame spectrum over the high frequency band; in one embodiment, the spectral high-band kurtosis ph is calculated by the following formula:
  • Vl ⁇ i) and vr( ) respectively represent the frequency local valley value v(n) of the frequency side of the ith frequency point and the nearest low frequency side.
  • v VC( ) C(i) ⁇ C(i - 1), C(i) ⁇ C(i + 1)
  • the spectral high-band kurtosis h of the current audio frame is also buffered in a ph history buffer, this In the embodiment, the length of the ph history buffer is 60.
  • /b( «), zb( «) respectively represent the end position of the first “frequency valley interval (ie the region between two adjacent valleys), ie two of the valley interval The position of the frequency valley.
  • Cor _ map _ sum cor(inv[lb(n) ⁇ i, hb(n) > i])
  • Linear prediction residual energy tilt epsP_tilt represents the linear prediction residual energy of the input audio signal The degree to which the amount varies with increasing linear prediction order. It can be calculated by the following formula:
  • step S103 can be replaced by the following steps:
  • S105 respectively obtain statistics of valid data in the stored spectrum fluctuation, the frequency band high frequency kurtosis, the spectral correlation degree, and the linear prediction residual energy slope, and classify the audio frame according to the statistics of the valid data.
  • a speech frame or a music frame the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory, and the operation operation may include an operation of averaging, variance, and the like.
  • the step includes:
  • the mean of the stored spectrum fluctuation effective data is less than a first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than The second threshold is; or the mean value of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy slope effective data is less than a fourth threshold.
  • the spectrum fluctuation value of the music frame is small, and the spectrum fluctuation value of the speech frame is large; the spectral frame of the music frame has a large kurtosis value, and the spectrum of the speech frame has a small peak kurtosis; The value of the spectral correlation is large, and the spectral correlation value of the speech frame is small; the variation of the linear prediction residual energy tilt of the music frame is small, and the linear prediction residual energy gradient of the speech frame changes greatly. Therefore, the current audio frame can be classified according to the statistics of the above parameters. Of course, other classification methods can also be used to classify the current audio frame.
  • the current audio frame is classified according to a classification threshold corresponding to each interval, and the current audio frame is classified into a music frame when one of the following conditions is satisfied, otherwise the
  • the current audio frame is classified into a voice frame: the average value of the spectrum fluctuation effective data is smaller than the first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than the second threshold; or the mean value of the spectrum correlation effective data is greater than the third Threshold; or linear prediction residual energy slope
  • the effective variance of the data is less than the fourth threshold.
  • the speech signal is encoded using an encoder based on a speech generation model (e.g., CELP), and the music signal is encoded using a transform-based encoder (e.g., an MDCT-based encoder).
  • a speech generation model e.g., CELP
  • a transform-based encoder e.g., an MDCT-based encoder
  • the audio signals are classified according to the long-term statistics of spectrum fluctuation, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient, and the parameters are less, the recognition rate is higher, and the complexity is higher. Low; At the same time, considering the factors of sound activity and percussion music, the spectrum fluctuation is adjusted. According to the signal environment of the current audio frame, the spectrum fluctuation is corrected to improve the classification recognition rate, which is suitable for the classification of mixed audio signals.
  • another embodiment of the audio signal classification method includes:
  • S501 Performing frame processing on the input audio signal; classifying the audio signal is generally performed in a frame, and classifying each audio signal frame extraction parameter to determine whether the audio signal frame belongs to a voice frame or a music frame, and adopts a corresponding coding mode. coding.
  • S502 Obtain a linear prediction residual energy gradient of the current audio frame; a linear prediction residual energy slope indicates a degree of linear prediction residual energy of the audio signal as the linear prediction order increases; in one embodiment, linearity
  • the predicted residual energy tilt epsP_ tilt can be calculated by the following formula ⁇ epsP(i) ⁇ epsP(i + 1)
  • ep represents the prediction residual energy of the ith-order linear prediction
  • the linear prediction residual energy slope can be stored in memory.
  • the memory can be a buffer of FIFO having a length of 60 storage units (ie, 60 linear prediction residual energy slopes can be stored).
  • the method before storing the linear prediction residual energy slope, the method further includes: determining, according to the sound activity of the current audio frame, whether to store the linear prediction residual energy tilt in the memory; if the current audio frame is active Frame, then store the linear prediction residual energy slope; otherwise it is not stored.
  • S504 classify the audio frame according to a statistic of the prediction residual energy slope partial data in the memory.
  • step S504 includes:
  • the linear prediction residual energy gradient value of the music frame changes little, while the linear prediction residual energy gradient value of the speech frame changes greatly. Therefore, the current audio frame can be classified according to the statistic of the linear prediction residual energy tilt.
  • other current classification methods can be used to classify the current audio frame in combination with other parameters.
  • the method before step S504, further includes: obtaining spectrum fluctuations of the current audio frame, spectral high-band kurtosis, and frequency-related relevance, and storing the data in a corresponding memory.
  • Step S504 is specifically as follows: respectively obtaining stored spectrum fluctuations, spectral high-band kurtosis, spectral correlation, and linear prediction residuals a statistic of valid data in the energy gradient, classifying the audio frame into a speech frame or a music frame according to the statistic of the valid data; the statistic of the valid data is obtained by performing an operation operation on the valid data stored in the memory Data value.
  • statistics of valid data in the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient are respectively obtained, and the audio frame is classified according to the statistics of the valid data as Voice frames or music frames include:
  • the mean of the stored spectrum fluctuation effective data is less than a first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than The second threshold is; or the mean value of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy slope effective data is less than a fourth threshold.
  • the spectrum fluctuation value of the music frame is small, and the spectrum fluctuation value of the speech frame is large; the spectral frame of the music frame has a large kurtosis value, and the spectrum of the speech frame has a small peak kurtosis; The value of the spectral correlation is large, and the spectral correlation value of the speech frame is small; the linear prediction residual energy gradient value of the music frame changes little, and the linear prediction residual energy gradient value of the speech frame changes greatly. Therefore, the current audio frame can be classified according to the statistics of the above parameters.
  • the method before step S504, further includes: obtaining a ratio of the number of spectral tones of the current audio frame and the number of spectral tones on the low frequency band, and storing the information in the corresponding memory.
  • Step S504 is specifically as follows:
  • obtaining the statistics of the stored linear prediction residual energy gradient and the statistics of the number of spectral tones respectively include: obtaining a variance of the stored linear prediction residual energy gradient; obtaining the stored The average of the number of spectral tones.
  • classifying the audio frame into a voice frame or a music frame according to the statistic of the linear prediction residual energy gradient, the statistic of the number of spectral tones, and the ratio of the number of spectral tones on the low frequency band include:
  • the current audio frame is an active frame and one of the following conditions is met, the current audio frame is classified into a music frame, otherwise the current audio frame is classified into a voice frame:
  • the variance of the linear prediction residual energy gradient is less than a fifth threshold; or the mean of the number of spectral tones is greater than a sixth threshold; or
  • the ratio of the number of spectral tones on the low frequency band is less than the seventh threshold.
  • the ratio of the number of spectral tones and the number of spectral tones in the current audio frame on the low frequency band includes:
  • the predetermined value is 50.
  • Number of spectral tones Ntonal indicates the number of frequency points whose frequency peaks in the 0 to 8 kHz band in the current audio frame are greater than a predetermined value. In one embodiment, it can be obtained as follows: For the current audio frame, the number of peaks of the frequency point p2v_map(i) greater than 50 in the 0 ⁇ 8 kHz band is counted, that is, Ntonal, where ⁇ 2 ⁇ (0 represents the spectrum The kurtosis of i frequency points can be calculated by referring to the description of the above embodiment.
  • the ratio ratio_Ntonal_lf of the number of spectral tones on the low frequency band represents the ratio of the number of spectral tones to the number of low-band tones. In one embodiment, it can be obtained as follows: For the current current audio frame, count the number of p2v_map(i) greater than 50 in the 0 ⁇ 4 kHz band, Ntonal_lf. ratio-Ntonal-lf is the value of Ntonal-lf and Ntonal ⁇ t, Ntonal_lf/Ntonal. Where ⁇ 2 ⁇ - (0 represents the kurtosis of the ith frequency point of the spectrum, and the calculation manner thereof can be referred to the description of the above embodiment.
  • the mean of lf, the ratio of the mean of Ntonal-lf to the mean of Ntonal is calculated as the ratio of the number of spectral tones on the low frequency band.
  • the audio signal is classified according to the long-term statistic of the linear prediction residual energy gradient, and the classification robustness and the classification recognition speed are considered, and the classification parameters are less but the result is more accurate and the complexity is low. , low memory overhead.
  • another embodiment of the audio signal classification method includes:
  • S601 Perform framing processing on the input audio signal
  • S602 Obtain spectrum fluctuation of the current audio frame, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy tilt;
  • the spectral fluctuation flux represents the short-term or long-term energy fluctuation of the signal spectrum, which is the mean value of the absolute value of the logarithmic energy difference of the corresponding frequency of the current audio frame and the historical frame on the medium-low frequency band spectrum; wherein the historical frame refers to the previous audio frame Any frame.
  • the spectral high-band kurtosis ph represents the kurtosis or energy sharpness of the current audio frame spectrum over the high frequency band.
  • Frequency Correlation Cor_map_sum indicates the stability of the signal harmonic structure between adjacent frames.
  • Linear Prediction Residual Energy Slope epsP_tilt Represents the linear prediction residual energy slope as a measure of the extent to which the linear prediction residual energy of the input audio signal varies with increasing linear prediction order.
  • the specific calculation method of these parameters refers to the previous embodiment.
  • a voiced parameter can be obtained; the voiced parameter voicing represents the time domain correlation of the current audio frame with a signal before a pitch period.
  • the voiced parameter voicing is obtained by linear prediction analysis, which represents the time domain correlation between the current audio frame and the signal before a pitch period, and the value is between 0 and 1.
  • linear prediction analysis represents the time domain correlation between the current audio frame and the signal before a pitch period, and the value is between 0 and 1.
  • the present invention is not described in detail because it belongs to the prior art.
  • each of the two subframes of the current audio frame is calculated as a voicing, and the voicing parameter of the current audio frame is obtained.
  • the voicing parameter of the current audio frame is also cached in a voicing history buffer.
  • the voicing history buffer has a length of 10.
  • S603 storing, in the corresponding memory, the spectrum fluctuation, the frequency band high frequency kurtosis, the spectral correlation degree, and the linear prediction residual energy energy gradient respectively;
  • the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory; in another embodiment, if the current audio frame is an active frame, and the current audio is included If multiple consecutive frames including the frame and its history frame are not energy impact, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored. For example, if the current audio frame is an active frame, and the previous frame of the current audio frame and the second frame of the history are not energy impact, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored.
  • S604 respectively obtain statistics of valid data in the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient, and classify the audio frame into a voice according to the statistics of the valid data.
  • a frame or a music frame the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory, and the operation operation may include operations such as averaging, variance, and the like.
  • the method may further include:
  • the effective spectrum fluctuation value in the spectrum fluctuation memory is modified to be less than or equal to A value of the music threshold, wherein the audio is classified as a music frame when the spectral fluctuation of the audio frame is less than the music threshold.
  • the effective spectral fluctuation value in the spectrum fluctuation memory is reset to five.
  • the method may further include:
  • the spectrum fluctuations in the memory are updated according to the activity of the history frame of the current audio frame.
  • the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory, and the previous frame audio frame is an inactive frame
  • the spectrum fluctuation of the current audio frame stored in the spectrum fluctuation memory is stored. Data from other spectrum fluctuations outside is modified to invalid data.
  • the spectrum fluctuation of the current audio frame is corrected to the first value.
  • the first value may be a speech threshold, wherein the audio is classified as a speech frame when the spectral fluctuation of the audio frame is greater than the speech threshold.
  • the frequency of the current audio frame is used.
  • the fluctuation is corrected to a second value, wherein the second value is greater than the first value.
  • the calculation of the long-term smoothing result of the active frame and the historical signal classification result can be referred to the foregoing embodiment.
  • step S604 includes: respectively obtaining a mean value of the stored spectrum fluctuation effective data, a mean value of the spectrum high band kurtosis effective data, a mean value of the frequency term correlation effective data, and a linear prediction residual energy slope effective data.
  • the mean value of the spectrum fluctuation effective data is less than a first threshold; or a spectrum high frequency
  • the mean value of the kurtosis effective data is greater than the second threshold; or the mean of the spectral correlation effective data is greater than the third threshold; or the variance of the linear prediction residual energy slope effective data is less than the fourth threshold.
  • the spectrum fluctuation value of the music frame is small, and the spectrum fluctuation value of the speech frame is large; the spectral frame of the music frame has a large kurtosis value, and the spectrum of the speech frame has a small peak kurtosis; The value of the spectral correlation is large, and the spectral correlation value of the speech frame is small; the linear prediction residual energy gradient value of the music frame is small, and the linear prediction residual energy gradient value of the speech frame is large. Therefore, the current audio frame can be classified according to the statistics of the above parameters. Of course, other classification methods can also be used to classify the current audio frame.
  • the memory is divided into at least two intervals of different lengths from the near end to the far end, and the mean value of the effective data of the spectrum fluctuation corresponding to each interval is obtained, and the frequency band high frequency kurtosis is effective.
  • the far end is one end storing the spectrum fluctuation of the historical frame; the audio frame is classified according to the statistics of the valid data of the above parameter in the shorter interval, if the parameter statistic in the interval is sufficient to distinguish the audio
  • the type of frame ends with the classification process, otherwise the classification process continues in the shortest interval of the remaining longer intervals, and so on.
  • the classification process of each interval classifying the current audio frame classification according to a classification threshold corresponding to each interval interval, and classifying the current audio frame into a music frame when one of the following conditions is satisfied, otherwise
  • the current audio frame is classified into a voice frame: the average value of the spectrum fluctuation effective data is smaller than the first threshold; or the mean value of the spectrum high-band kurtosis effective data is greater than the second threshold; or the mean value of the spectrum correlation effective data is greater than The third threshold; or the variance of the linear prediction residual energy slope effective data is less than the fourth threshold.
  • the speech signal is encoded using an encoder based on a speech generation model (e.g., CELP), and the music signal is encoded using a transform-based encoder (e.g., an MDCT-based encoder).
  • a speech generation model e.g., CELP
  • a transform-based encoder e.g., an MDCT-based encoder
  • classification is based on long-term statistics of spectrum fluctuation, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient, while taking into account classification robustness and classification recognition speed, classification The parameters are less but the results are more accurate, the recognition rate is higher and the complexity is lower.
  • the frequency-dependent correlation cor _map_ s and the linear prediction residual energy gradient epsP_tilt are stored in the corresponding memory, according to the stored
  • the number of valid data for spectrum fluctuations is classified using different judgment processes. If the voice activity flag is set to 1, that is, the current audio frame is the active sound frame, then the number N of valid data of the stored spectrum fluctuations is checked.
  • the mean values of the near-end N data in the ph history buffer and cor_map_sum history buffer are respectively obtained, and are recorded as phN and cor-map-sumN. and epsP-tilt history.
  • the variance of the N data in the near end of the buffer is recorded as epsP_tiltN.
  • the number of data in the near-end 6 data in the voicing history buffer is greater than 0.9, voicing_cnt6.
  • the classification result of the previous audio frame is used as the classification type of the current audio frame.
  • the above embodiment is a specific classification process for classifying long-term statistics based on spectrum fluctuation, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient. Those skilled in the art can understand that Use another process to categorize.
  • the classification process in this embodiment can be applied to the corresponding steps in the foregoing embodiment, for example, as the specific classification method of step 103 of Fig. 2, step 105 of Fig. 4, or step 604 of Fig. 6.
  • another embodiment of an audio signal classification method includes:
  • S1102 Obtain a ratio of a linear prediction residual energy slope, a number of spectral tones, and a number of spectral tones on the low frequency band of the current audio frame;
  • the linear prediction residual energy tilt epsP_tilt represents the degree to which the linear prediction residual energy of the input audio signal varies with the linear prediction order; the number of spectral tones Ntonal represents the current audio frame The frequency of the frequency point peak in the 0 ⁇ 8 kHz band is greater than the predetermined value; the ratio of the number of spectral tones on the low frequency band rio_Ntona l - lf represents the ratio of the number of spectral tones to the number of low frequency tones.
  • the specific calculation is referred to the description of the foregoing embodiment.
  • S1103 respectively store the ratio of the linear prediction residual energy tilt epsP_tilt, the number of spectral tones, and the number of spectral tones on the low frequency band to the corresponding memory; the linear prediction residual energy tilt of the current audio frame epsP_ti 11
  • the number of the spectral tones is cached into the respective history buffers.
  • the lengths of the two buffers are also 60.
  • the method before storing the parameters, further includes: determining, according to the voice activity of the current audio frame, whether the linear prediction residual energy gradient, the number of spectral tones, and the number of spectral tones are on the low frequency band. The ratio is stored in the memory; and the linear prediction residual energy slope is stored in the memory when it is determined that storage is required. If the current audio frame is an active frame, the above parameters are stored; otherwise, it is not stored.
  • S1104 respectively obtain a statistic of the stored linear prediction residual energy gradient and a statistic of the number of spectral tones; the statistic refers to a data value obtained after an operation operation on the data stored in the memory, and the operation operation may include averaging , Find the variance and other operations.
  • the statistics of the stored linear prediction residual energy gradient and the statistics of the spectral tones are respectively obtained by: obtaining a variance of the stored linear prediction residual energy gradient; obtaining the stored spectral tonal number Mean.
  • S1105 classify the audio frame into a voice frame or a music frame according to a statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones, and a ratio of the number of spectral tones on the low frequency band;
  • the step includes: classifying the current audio frame into a music frame when the current audio frame is an active frame, and satisfying one of the following conditions, otherwise classifying the current audio frame into a voice frame: a linear prediction residual
  • the variance of the energy gradient is less than a fifth threshold; or the mean of the number of spectral tones is greater than a sixth threshold; or the ratio of the number of spectral tones on the low frequency band is less than a seventh threshold.
  • the linear prediction residual energy gradient value of the music frame is small, and the linear prediction of the speech frame
  • the value of the residual energy gradient is large; the number of spectral tones of the music frame is larger, and the number of spectral tones of the speech frame is smaller; the ratio of the number of spectral tones of the music frame to the low frequency band is lower, and the ratio of the speech frame is lower.
  • the ratio of the number of spectral tones in the low frequency band is high (the energy of the speech frame is mainly concentrated on the low frequency band). Therefore, the current audio frame can be classified according to the statistics of the above parameters. Of course, other classification methods can also be used to classify the current audio frame.
  • the speech signal is encoded using an encoder based on a speech generation model (e.g., CELP), and the music signal is encoded using a transform-based encoder (e.g., an MDCT-based encoder).
  • a speech generation model e.g., CELP
  • a transform-based encoder e.g., an MDCT-based encoder
  • the audio signal is classified according to the linear prediction residual energy gradient, the long-term statistic of the number of spectral tones, and the ratio of the number of spectral tones on the low frequency band, the parameters are small, the recognition rate is high and complicated. The degree is low.
  • the ratio of the linear prediction residual energy gradient epsP_tilt, the spectral tone number Ntonal, and the frequency speech tone number to the corresponding buffer are stored in the corresponding buffer, and all data in the epsP-tilt history buffer is obtained.
  • the variance is recorded as epsP_tilt60.
  • Ntonal-lf history buffer to obtain the average of all the data, and calculates the ratio of the mean and Ntonal60, denoted ratio_Ntonal-lf60 o
  • the current audio frame is classified according to the following rules:
  • the above embodiment is a specific classification process according to the statistics of the linear prediction residual energy gradient, the statistics of the number of spectral tones, and the ratio of the number of spectral tones on the low frequency band, which can be understood by those skilled in the art. Yes, you can use another process to categorize.
  • the classification process in this embodiment can be applied to the corresponding steps in the foregoing embodiment, for example, as the specific classification method of step 504 of Fig. 5 or step 1105 of Fig. 11.
  • the invention is a low complexity and low memory overhead audio coding mode selection method. At the same time, the robustness of classification and the recognition speed of classification are considered.
  • the present invention also provides an audio signal classification device, which may be located in a terminal device, or a network device. The audio signal classification device can perform the steps of the above method embodiments.
  • an embodiment of an apparatus for classifying an audio signal according to the present invention is used for classifying an input audio signal, which includes:
  • a storage confirmation unit 1301, configured to determine, according to the sound activity of the current audio frame, whether to obtain and store a spectrum fluctuation of a current audio frame, where the spectrum fluctuation represents an energy fluctuation of a spectrum of the audio signal;
  • the memory 1 302 is configured to store the spectrum fluctuation when the storage confirmation unit outputs a result that needs to be stored;
  • the updating unit 1303 is configured to update a spectrum fluctuation stored in the memory according to whether the voice frame is the activity of tapping the music or the historical audio frame;
  • the classifying unit 1 304 is configured to classify the current audio frame into a speech frame or a music frame according to a statistic of part or all of the valid data of the spectrum fluctuation stored in the memory.
  • the current audio frame is classified into a voice frame; when the statistic of the effective data of the spectrum fluctuation satisfies the music classification condition, the current audio frame is classified into music frame.
  • the storage confirmation unit is specifically configured to: when confirming that the current audio frame is an active frame, outputting a result of storing spectrum fluctuations of the current audio frame.
  • the storage confirmation unit is specifically configured to: when the current audio frame is an active frame, and the current audio frame does not belong to an energy impact, the output needs to store the result of the spectrum fluctuation of the current audio frame.
  • the storage confirmation unit is specifically configured to: confirm that the current audio frame is an active frame, and that the plurality of consecutive frames including the current audio frame and its historical frame are not energy impact, and the output needs to store the current audio frame. The result of spectrum fluctuations.
  • the updating unit is specifically configured to modify the value of the stored spectrum fluctuations in the spectrum fluctuation memory if the current audio frame belongs to the tap music.
  • the updating unit is specifically configured to: if the current audio frame is an active frame, and the previous frame audio frame is an inactive frame, then the stored spectrum in the memory except the current audio frame is fluctuated The data of other spectrum fluctuations is modified to be invalid data; or, if the current audio frame is an active frame, and the consecutive three frames before the current audio frame are not all active frames, the spectrum fluctuation of the current audio frame is corrected to the first value; Or, if the current audio frame is an active frame, and the historical classification result is a music signal and the frequency fluctuation of the current audio frame is greater than the second value, the spectrum fluctuation of the current audio frame is corrected to a second value, where the second value is greater than The first value.
  • the classifying unit 1303 includes: a calculating unit 1401, configured to obtain an average value of part or all of valid data of spectrum fluctuations stored in the memory;
  • the determining unit 1402 is configured to compare the mean value of the valid data of the spectrum fluctuation with a music classification condition, and when the average value of the valid data of the spectrum fluctuation satisfies a music classification condition, classify the current audio frame as a music frame; Otherwise the current audio frame is classified as a speech frame.
  • the current audio frame is classified into a music frame; otherwise, the current audio frame is classified into a speech frame.
  • the audio signal is classified according to the long-term statistics of the spectrum fluctuation, the parameters are less, the recognition rate is higher, and the complexity is lower; and the spectrum fluctuation is adjusted by considering the sound activity and the factor of tapping the music, It has a higher recognition rate for music signals and is suitable for mixed audio signal classification.
  • the audio signal classification device further includes:
  • a parameter obtaining unit configured to obtain a spectral high-band kurtosis, a spectral correlation, and a linear prediction residual energy gradient of the current audio frame; wherein, the spectral high-band kurtosis indicates that the spectrum of the current audio frame is on the high frequency band The kurtosis or energy sharpness; the spectral correlation indicates the stability of the signal harmonic structure of the current audio frame between adjacent frames; the linear prediction residual energy slope indicates the linear prediction residual energy of the audio signal with the linear prediction order
  • the storage confirmation unit is further configured to: determine, according to the sound activity of the current audio frame, whether to store the spectral high-band kurtosis, frequency dependence, and linear prediction residual energy gradient ;
  • the storage unit is further configured to store the spectral high-band kurtosis, the frequency-language correlation, and the linear prediction residual energy tilt when the storage confirmation unit outputs a result that needs to be stored;
  • the classification unit is specifically configured to obtain stored spectrum fluctuations, spectrum high-band kurtosis, and spectrum respectively. Correlation and linear prediction of the statistics of the effective data in the residual energy gradient, the audio frame being classified into a speech frame or a music frame according to the statistics of the valid data.
  • the statistic of the effective data of the spectrum fluctuation satisfies the voice classification condition
  • the current audio frame is classified into a voice frame
  • the statistic of the effective data of the spectrum fluctuation satisfies the music classification condition
  • the current audio frame is classified into music frame.
  • the classifying unit specifically includes:
  • a calculating unit configured to respectively obtain a mean value of the stored spectrum fluctuation effective data, an average value of the spectrum high-band kurtosis effective data, a mean value of the spectral correlation effective data, and a variance of the linear prediction residual energy inclination effective data;
  • a determining unit configured to classify the current audio frame as a music frame when one of the following conditions is satisfied, and otherwise classify the current audio frame into a voice frame: the average value of the spectrum fluctuation effective data is less than a first threshold; or The mean value of the spectral high-band kurtosis effective data is greater than a second threshold; or the mean of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy gradient effective data is less than a fourth threshold.
  • the audio signals are classified according to the long-term statistics of spectrum fluctuation, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient, and the parameters are less, the recognition rate is higher, and the complexity is higher. Low; At the same time, considering the factors of sound activity and percussion music, the spectrum fluctuation is adjusted. According to the signal environment of the current audio frame, the spectrum fluctuation is corrected to improve the classification recognition rate, which is suitable for the classification of mixed audio signals.
  • FIG. 15 another embodiment of an apparatus for classifying an audio signal according to the present invention is for classifying an input audio signal, which includes:
  • a framing unit 1501 configured to perform framing processing on the input audio signal
  • the parameter obtaining unit 1502 is configured to obtain a linear prediction residual energy gradient of the current audio frame, where the linear prediction residual energy gradient indicates a degree of change of the linear prediction residual energy of the audio signal as the linear prediction order increases. ;
  • a storage unit 1503 configured to store a linear prediction residual energy tilt
  • the classifying unit 1504 is configured to classify the audio frame according to a statistic of the prediction residual energy gradient partial data in the memory. Referring to FIG. 16, the audio signal classification apparatus further includes:
  • a storage confirmation unit 1505 configured to determine, according to the sound activity of the current audio frame, whether to store the linear prediction residual energy tilt in a memory
  • the storage unit 1503 is specifically configured to store the linear prediction residual energy tilt in the memory when the storage confirmation unit confirms that it is necessary to determine that storage is required.
  • the statistic of the prediction residual energy gradient partial data is a variance of the prediction residual energy tilt partial data
  • the classifying unit is specifically configured to compare a variance of the prediction residual energy gradient partial data with a music classification threshold, and when the variance of the prediction residual energy gradient partial data is smaller than a music classification threshold, the current audio frame Classified as a music frame; otherwise the current audio frame is classified as a speech frame.
  • the parameter obtaining unit is further configured to: obtain a spectrum fluctuation of the current audio frame, a spectral high-band kurtosis, and a spectral correlation, and store the data in a corresponding memory;
  • the classification unit is specifically configured to: obtain statistics of valid data in the stored spectrum fluctuation, the spectral high-band kurtosis, the spectral correlation, and the linear prediction residual energy gradient, respectively, according to the statistics of the valid data
  • the audio frame is classified into a speech frame or a music frame; the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory.
  • the classification unit 1504 includes:
  • the calculating unit 1701 is configured to respectively obtain a mean value of the stored spectrum fluctuation effective data, a mean value of the spectrum high frequency band kurtosis effective data, a mean value of the spectrum correlation effective data, and a variance of the linear prediction residual energy slope effective data;
  • the determining unit 1702 is configured to classify the current audio frame into a music frame when one of the following conditions is satisfied, and otherwise classify the current audio frame into a voice frame: the average value of the spectrum fluctuation effective data is less than a first threshold; Or the mean value of the spectral high-band kurtosis effective data is greater than the second threshold; or the mean of the spectral correlation effective data is greater than the third threshold; or the variance of the linear prediction residual energy gradient effective data is less than the fourth threshold.
  • the parameter obtaining unit is further configured to: obtain a ratio of a number of spectral tones of the current audio frame and a number of spectral tones on the low frequency band, and store the same in a memory;
  • the classification unit is specifically configured to: separately obtain statistics of the stored linear prediction residual energy gradient a quantity, a statistic of the number of spectral tones; classifying the audio frame as based on a statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones, and a ratio of the number of spectral tones on the low frequency band
  • the statistic of the valid data refers to a data value obtained after an operation operation on the data stored in the memory.
  • the specific classification unit includes:
  • a calculating unit configured to obtain a variance of a linear prediction residual energy gradient effective data and a mean value of the stored spectral tones
  • a determining unit configured to classify the current audio frame as a music frame when the current audio frame is an active frame, and satisfy one of the following conditions, otherwise classify the current audio frame into a voice frame: linear prediction residual energy tilt
  • the variance of the degree is less than the fifth threshold; or the mean of the number of spectral tones is greater than the sixth threshold; or the ratio of the number of spectral tones on the low frequency band is less than the seventh threshold.
  • the parameter obtaining unit calculates a linear prediction residual energy tilt of the current audio frame according to the following formula:
  • ep ⁇ ( ) represents the prediction residual energy of the i-th order linear prediction of the current audio frame
  • n is a positive integer representing the order of the linear prediction, which is less than or equal to the maximum order of the linear prediction.
  • the parameter obtaining unit is configured to count, as the number of frequency tones, that the frequency peak of the current audio frame is greater than a predetermined value in the frequency band of 0 to 8 kHz; the parameter obtaining unit is configured to calculate the current audio frame at 0 to 4 kHz.
  • the ratio of the number of frequency points whose frequency band peaks are larger than a predetermined value to the number of frequency points whose frequency point peaks in the 0 to 8 kHz band is larger than a predetermined value, as the ratio of the number of spectral tones on the low frequency band.
  • the audio signal is classified according to the long-term statistic of the linear prediction residual energy gradient, and the classification robustness and the classification recognition speed are considered, and the classification parameters are less but the result is more accurate and the complexity is low. , low memory overhead.
  • Another embodiment of a classification device for an audio signal according to the present invention is for inputting an audio signal Classified, which includes:
  • a framing unit configured to perform framing processing on the input audio signal
  • a parameter obtaining unit configured to obtain spectrum fluctuation of a current audio frame, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy tilt; wherein, the spectrum fluctuation represents an energy fluctuation of a spectrum of the audio signal, and the spectrum high frequency band
  • the kurtosis indicates the kurtosis or energy sharpness of the spectrum of the current audio frame on the high frequency band
  • the spectral correlation indicates the stability of the signal harmonic structure of the current audio frame between adjacent frames
  • the linear prediction residual energy slope representation The degree to which the linear prediction residual energy of the audio signal varies with increasing linear prediction order
  • a storage unit for storing spectral fluctuations, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy tilt;
  • a classification unit configured to respectively obtain statistics of stored data fluctuations, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient, and classify the audio frame according to statistics of valid data a speech frame or a music frame; wherein the statistic of the valid data refers to a data value obtained after an operation operation on the valid data stored in the memory, and the operation may include an operation of averaging, variance, and the like.
  • the apparatus for classifying the audio signal may further include:
  • a storage confirmation unit configured to determine, according to the sound activity of the current audio frame, whether to store spectrum fluctuations of the current audio frame, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient;
  • the storage unit is specifically configured to store the spectrum fluctuation, the frequency band high frequency kurtosis, the spectral correlation degree, and the linear prediction residual energy inclination when the storage confirmation unit outputs the result that needs to be stored.
  • the storage confirmation unit determines whether to store the spectrum fluctuation in the spectrum fluctuation memory according to the sound activity of the current audio frame. If the current audio frame is an active frame, the storage confirmation unit outputs the result of storing the above parameters; otherwise, the output does not require a stored result. In another embodiment, the storage confirmation unit determines whether to store the spectral fluctuations in the memory based on the sound activity of the audio frame and whether the audio frame is an energy impact.
  • the spectrum fluctuation of the current audio frame is stored in the spectrum fluctuation memory; in another embodiment, if the current audio frame is an active frame, and the current audio is included Frame and its history frame The plurality of consecutive frames are not energy shocks, and the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored. For example, if the current audio frame is an active frame, and the previous frame of the current audio frame and the second frame of the history are not energy impact, the spectrum fluctuation of the audio frame is stored in the spectrum fluctuation memory; otherwise, it is not stored.
  • the classification unit comprises:
  • a calculating unit configured to respectively obtain a mean value of the stored spectrum fluctuation effective data, an average value of the spectrum high-band kurtosis effective data, a mean value of the spectral correlation effective data, and a variance of the linear prediction residual energy inclination effective data;
  • a determining unit configured to classify the current audio frame as a music frame when one of the following conditions is satisfied, and otherwise classify the current audio frame into a voice frame: the average value of the spectrum fluctuation effective data is less than a first threshold; or The mean value of the spectral high-band kurtosis effective data is greater than a second threshold; or the mean of the spectral correlation effective data is greater than a third threshold; or the variance of the linear prediction residual energy gradient effective data is less than a fourth threshold.
  • apparatus for classifying the audio signal may further include:
  • an updating unit configured to update the spectrum fluctuation stored in the memory according to whether the voice frame is the activity of tapping the music or the historical audio frame.
  • the updating unit is specifically configured to modify the value of the stored spectrum fluctuations in the spectrum fluctuation memory if the current audio frame belongs to the tap music.
  • the updating unit is specifically configured to: if the current audio frame is an active frame, and the previous frame of the audio frame is an inactive frame, then the stored spectrum in the memory is other than the spectrum fluctuation of the current audio frame.
  • the data of the spectrum fluctuation is modified to be invalid data; or, if the current audio frame is an active frame, and the consecutive three frames before the current audio frame are not all active frames, the spectrum fluctuation of the current audio frame is corrected to the first value; or If the current audio frame is an active frame, and the historical classification result is a music signal and the spectrum fluctuation of the current audio frame is greater than the second value, the spectrum fluctuation of the current audio frame is corrected to a second value, where the second value is greater than the first value .
  • classification is based on long-term statistics of spectral fluctuations, spectral high-band kurtosis, spectral correlation, and linear prediction residual energy gradient, while taking into account classification robustness and classification identification. Speed, less classification parameters but more accurate results, higher recognition rate and lower complexity.
  • Another embodiment of the apparatus for classifying an audio signal according to the present invention is for classifying an input audio signal, which includes:
  • a framing unit for performing framing processing on the input audio signal
  • a parameter obtaining unit configured to obtain a ratio of a linear prediction residual energy gradient, a number of spectral tones, and a spectral tonal number of the current audio frame on a low frequency band; wherein, the linear prediction residual energy tilt epsP_til represents an input audio
  • the degree of linear prediction residual energy of the signal varies with the increase of the linear prediction order;
  • the number of spectral tones Ntona l represents the number of frequency points whose frequency peaks in the 0-8 kHz band in the current audio frame are greater than a predetermined value;
  • the ratio of the number of pitches on the low frequency band, rat io_Ntona l - lf represents the ratio of the number of spectral tones to the number of low-band tones.
  • the specific calculation is referred to the description of the foregoing embodiment.
  • a storage unit configured to store a linear prediction residual energy gradient, a number of spectral tones, and a ratio of the number of spectral tones on the low frequency band
  • a classification unit configured to separately obtain a stored statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones; classifying the audio frame into a speech frame according to a statistic of the linear prediction residual energy gradient, a statistic of the number of spectral tones, and a ratio of the number of spectral tones on the low frequency band Or a music frame;
  • the statistic of the valid data refers to a data value obtained after an operation operation on the data stored in the memory.
  • the classification unit includes:
  • a calculating unit configured to obtain a variance of a linear prediction residual energy gradient effective data and a mean value of the stored spectral tones
  • a determining unit configured to classify the current audio frame as a music frame when the current audio frame is an active frame, and satisfy one of the following conditions, otherwise classify the current audio frame into a voice frame: linear prediction residual energy tilt
  • the variance of the degree is less than the fifth threshold; or the mean of the number of spectral tones is greater than the sixth threshold; or the ratio of the number of spectral tones on the low frequency band is less than the seventh threshold.
  • the parameter obtaining unit calculates a linear prediction residual energy tilt of the current audio frame according to the following formula: ⁇ epsP(i) ⁇ epsP(i + 1)
  • ep ⁇ ( ) represents the prediction residual energy of the ith-order linear prediction of the current audio frame
  • n is a positive integer representing the order of the linear prediction, which is less than or equal to the maximum order of the linear prediction.
  • the parameter obtaining unit is configured to count, as the number of frequency tones, that the frequency peak of the current audio frame is greater than a predetermined value in the frequency band of 0 to 8 kHz; the parameter obtaining unit is configured to calculate the current audio frame at 0 to 4 kHz.
  • the ratio of the number of frequency points whose frequency band peaks are larger than a predetermined value to the number of frequency points whose frequency point peaks in the 0 to 8 kHz band is larger than a predetermined value, as the ratio of the number of spectral tones on the low frequency band.
  • the audio signal is classified according to the linear prediction residual energy gradient, the long-term statistic of the number of spectral tones, and the ratio of the number of spectral tones on the low frequency band, the parameters are small, the recognition rate is high and complicated. The degree is low.
  • the above-mentioned audio signal classification device can be connected to different encoders, and different signals are encoded by different encoders.
  • the classification device of the audio signal is respectively connected to two encoders, the speech signal is encoded by a speech generation model-based encoder (such as CELP), and the music signal is subjected to a transform-based encoder (such as an MDCT-based encoder).
  • a speech generation model-based encoder such as CELP
  • a transform-based encoder such as an MDCT-based encoder
  • the present invention also provides an audio signal classification device, which may be located in a terminal device, or a network device.
  • the audio signal classification device can be implemented by a hardware circuit or by software in conjunction with hardware.
  • the audio signal classification means is called by a processor to implement classification of the audio signal.
  • the audio signal classification device can perform various methods and processes in the above method embodiments. Specific modules and functions of the audio signal classification device can be referred to the related description of the above device embodiments.
  • An example of the device 1900 of Figure 19 is an encoder. Apparatus 100 includes a processor 1910 and storage 1920.
  • Memory 1920 can include random access memory, flash memory, read only memory, programmable read only memory, nonvolatile memory or registers, and the like.
  • the processor 1920 can be a central processing unit (Centra l Proces s ing Uni t, CPU).
  • Memory 1910 is for storing executable instructions.
  • Processor 1920 can execute executable instructions stored in memory 1910 for:
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
  • ROM read-only memory
  • RAM random access memory
  • the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
  • the units described as separate components may or may not be physically separate, as The components displayed by the unit may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Auxiliary Devices For Music (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephone Function (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Television Receiver Circuits (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种音频信号分类的方法,该方法包括:根据当前音频帧的声音活动性,确定是否获得当前音频帧的频谱波动并存储于频谱波动存储器中(101);根据音频帧是否为敲击音乐或历史音频帧的活动性,更新频谱波动存储器中存储的频谱波动(102);根据频谱波动存储器中存储的频谱波动的部分或全部有效数据的统计量,将所述当前音频帧分类为语音帧或者音乐帧(103);还提供了一种音频信号分类装置。

Description

一种音频信号分类方法和装置
本申请要求于 2013年 8月 6日提交中国专利局、申请号为 201310339218.5 , 发明名称为 "一种音频信号分类方法和装置" 的中国专利申请优先权, 上述专 利的全部内容通过引用结合在本申请中。 技术领域 本发明涉及数字信号处理技术领域, 尤其是一种音频信号分类方法和装置。 背景技术 为了降低视频信号存储或者传输过程中占用的资源, 音频信号在发送端进 行压缩处理后传输到接收端, 接收端通过解压缩处理恢复音频信号。 在音频处理应用中, 音频信号分类是一种应用广泛而重要的技术。 例如, 在音频编解码应用中, 目前比较流行的编解码器是一种混合编解码。 这种编解 码器通常包含了一个基于语音产生模型的编码器(如 CELP )和一个基于变换的 编码器(如基于 MDCT 的编码器)。 在中低码率下, 基于语音产生模型的编码器 可以获得较好的语音编码质量, 但对音乐的编码质量比较差, 而基于变换的编 码器能够获得较好的音乐编码质量, 对语音的编码质量又比较差。 因此, 混合 编解码器通过对语音信号采用基于语音产生模型的编码器进行编码, 对音乐信 号采用基于变换的编码器进行编码, 从而获得整体最佳的编码效果。 这里, 一 个核心的技术就是音频信号分类, 或具体到这个应用, 就是编码模式选择。 混合编解码器需要获得准确的信号类型信息, 才能获得最优的编码模式选 择。 这里的音频信号分类器也可以被大致认为是一种语音 /音乐分类器。 语音识 别率和音乐识别率是衡量语音 /音乐分类器性能的重要指标。 尤其对于音乐信 号, 由于其信号特征的多样 /复杂性,对音乐信号的识别通常较语音困难。此外, 识别延时也是非常重要的指标之一。 由于语音 /音乐特征在短时上的模糊性, 通 常需要在一段相对长的时间区间内才能够较准确的识别出语音 /音乐来。 一般来 说, 在同一类信号中段时, 识别延时越长, 识别越准确。 但在两类信号的过渡 段时, 识别延时越长, 识别准确率反而降低。 这在输入是混合信号 (如有背景 音乐的语音) 的情况下尤为严重。 因此, 同时兼具高识别率和低识别延时是一 个高性能语音 /音乐识别器的必要属性。 此外, 分类的稳定性也是影响到混合编 码器编码质量的重要属性。 一般来说, 混合编码器在不同类型编码器之间切换 时会产生质量下降。 如果分类器在同一类信号中发生频繁的类型切换, 对编码 质量的影响是比较大的, 这就要求分类器的输出分类结果要准确而平滑。 另夕卜, 在一些应用中, 如通信系统中的分类算法, 也要求其计算复杂度和存储开销要 尽可能的低, 以满足商业需求。
ITU-T标准 G. 720. 1 包含有一个语音 /音乐分类器。 这个分类器以一个主参 数, 频谱波动方差 var_f lux, 做为信号分类的主要依据, 并结合两个不同的频 谱峰度参数 pl,p2 , 做为辅助依据。 根据 var_f lux对输入信号的分类, 是通过 在一个 FIFO的 var.f lux buffer中, 根据 var_f lux的局部统计量来完成的。 具体过程概述如下。 首先对每一输入音频帧提取频谱波动 f lux, 并緩存在一个 第一 buffer中, 这里的 f lux是在包括当前输入帧在内的最新的 4帧中计算的, 也可以有其它的计算方法。然后,计算包括当前输入帧在内的 N个最新帧的 f lux 的方差, 得到当前输入帧的 var_f lux, 并緩存在第二 buff er中。 然后, 统计第 二 buffer中包括当前输入帧在内的 M个最新帧的 var_f lux中大于第一门限值 的帧的个数 K。 如果 Κ与 Μ的比值大于一个第二门限值, 则判断当前输入帧为语 音帧, 否则为音乐帧。 辅助参数 l , p2 主要用于对分类的修正, 也是对每一输 入音频帧计算的。 当 pi和 /或 p2大于某第三门限和 /或第四门限时, 则直接判 断当前输入音频帧为音乐帧。 这个语音 /音乐分类器的缺点一方面对音乐的绝对识别率仍然有待提高, 另 一方面, 由于该分类器的目标应用没有针对混合信号的应用场景, 所以对混合 信号的识别性能也还有一定的提升空间。 现有的语音 /音乐分类器有很多都是基于模式识别原理设计的。 这类分类器 通常都是对输入音频帧提取多个特征参数(十几到几十不等), 并将这些参数馈 入一个或者基于高斯混合模型, 或者基于神经网络, 或者基于其它经典分类方 法的分类器来进行分类的。 这类分类器虽然有较高的理论基石出, 但通常具有较高的计算或存储复杂度, 实现成本较高。 发明内容 本发明实施例的目的在于提供一种音频信号分类方法和装置, 在保证混合 音频信号分类识别率的情况下, 降低信号分类的复杂度。 第一方面, 提供了一种音频信号分类方法, 包括:
根据当前音频帧的声音活动性, 确定是否获得当前音频帧的频谱波动并存 储于频谱波动存储器中, 其中, 所述频谱波动表示音频信号的频谱的能量波动; 根据音频帧是否为敲击音乐或历史音频帧的活动性, 更新频谱波动存储器 中存储的频谱波动;
根据频谱波动存储器中存储的频谱波动的部分或全部有效数据的统计量, 将所述当前音频帧分类为语音帧或者音乐帧。
在第一种可能的实现方式中, 根据当前音频帧的声音活动性, 确定是否获 得当前音频帧的频谱波动并存储于频谱波动存储器中包括:
若当前音频帧为活动帧, 则将当前音频帧的频谱波动存储于频谱波动存储 器中。
在第二种可能的实现方式中, 根据当前音频帧的声音活动性, 确定是否获 得当前音频帧的频谱波动并存储于频谱波动存储器中包括: 若当前音频帧为活动帧, 且当前音频帧不属于能量冲击, 则将当前音频帧 的频谱波动存储于频谱波动存储器中。
在第三种可能的实现方式中, 根据当前音频帧的声音活动性, 确定是否获 得当前音频帧的频谱波动并存储于频谱波动存储器中包括:
若当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内的多个连续帧 都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中。
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种可 能的实现方式或第一方面的第三种可能的实现方式, 在第四种可能的实现方式 中, 根据所述当前音频帧是否为敲击音乐, 更新频谱波动存储器中存储的频谱 波动包括:
若当前音频帧属于敲击音乐, 则修改频谱波动存储器中已存储的频谱波动 的值。
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种可 能的实现方式或第一方面的第三种可能的实现方式, 在第五种可能的实现方式 中, 根据所述历史音频帧的活动性, 更新频谱波动存储器中存储的频谱波动包 括:
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且前一帧音频 帧为非活动帧, 则将频谱波动存储器中已存储的除当前音频帧的频谱波动之外 的其他频谱波动的数据修改为无效数据;
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且当前音频帧 之前连续三帧历史帧不全都为活动帧, 则将当前音频帧的频谱波动修正为第一 值;
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且历史分类结 果为音乐信号且当前音频帧的频谱波动大于第二值, 则将当前音频帧的频谱波 动修正为第二值, 其中, 第二值大于第一值。
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种可 能的实现方式或第一方面的第三种可能的实现方式或第一方面的第四种可能的 实现方式或第一方面的第五种可能的实现方式, 在第六种可能的实现方式中, 根据频谱波动存储器中存储的频谱波动的部分或全部有效数据的统计量, 将所 述当前音频帧分类为语音帧或者音乐帧包括:
获得频谱波动存储器中存储的频谱波动的部分或全部有效数据的均值; 当所获得的频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前 音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种可 能的实现方式或第一方面的第三种可能的实现方式或第一方面的第四种可能的 实现方式或第一方面的第五种可能的实现方式, 在第七种可能的实现方式中, 该音频信号分类方法还包括:
获得当前音频帧的频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜 度; 其中, 频谱高频带峰度表示当前音频帧的频谱在高频带上的峰度或能量锐 度; 频谱相关度表示当前音频帧的信号谐波结构在相邻帧间的稳定度; 线性预 测残差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶数的升高而 变化的程度;
根据所述当前音频帧的声音活动性, 确定是否将所述频谱高频带峰度、 频 语相关度和线性预测残差能量倾斜度存储于存储器中;
其中, 所述根据频谱波动存储器中存储的频谱波动的部分或全部数据的统 计量, 对所述音频帧进行分类包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
第二方面, 提供了一种音频信号的分类装置, 用于对输入的音频信号进行 分类, 包括: 存储确认单元, 用于根据所述当前音频帧的声音活动性, 确定是否获得并 存储当前音频帧的频谱波动, 其中, 所述频谱波动表示音频信号的频谱的能量 波动;
存储器, 用于在存储确认单元输出需要存储的结果时存储所述频谱波动; 更新单元, 用于根据语音帧是否为敲击音乐或历史音频帧的活动性, 更新 存储器中存储的频谱波动;
分类单元, 用于根据存储器中存储的频谱波动的部分或全部有效数据的统 计量, 将所述当前音频帧分类为语音帧或者音乐帧。
在第一种可能的实现方式中, 所述存储确认单元具体用于: 确认当前音频 帧为活动帧时, 输出需要存储当前音频帧的频谱波动的结果。
在第二种可能的实现方式中, 所述存储确认单元具体用于: 确认当前音频 帧为活动帧, 且当前音频帧不属于能量冲击时, 输出需要存储当前音频帧的频 谱波动的结果。
在第三种可能的实现方式中, 所述存储确认单元具体用于: 确认当前音频 帧为活动帧, 且包含当前音频帧与其历史帧在内的多个连续帧都不属于能量冲 击时, 输出需要存储当前音频帧的频谱波动的结果。
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种可 能的实现方式或第二方面的第三种可能的实现方式, 在第四种可能的实现方式 中, 所述更新单元具体用于若当前音频帧属于敲击音乐, 则修改频谱波动存储 器中已存储的频谱波动的值。
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种可 能的实现方式或第二方面的第三种可能的实现方式, 在第五种可能的实现方式 中, 所述更新单元具体用于: 如果当前音频帧为活动帧, 且前一帧音频帧为非 活动帧时, 则将存储器中已存储的除当前音频帧的频谱波动之外的其他频谱波 动的数据修改为无效数据; 或
如果当前音频帧为活动帧, 且当前音频帧之前连续三帧不全都为活动帧时, 则将当前音频帧的频谱波动修正为第一值; 或
如果当前音频帧为活动帧, 且历史分类结果为音乐信号且当前音频帧的频 谱波动大于第二值, 则将当前音频帧的频谱波动修正为第二值, 其中, 第二值 大于第一值。
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种可 能的实现方式或第二方面的第三种可能的实现方式或第二方面的第四种可能的 实现方式或第二方面的第五种可能的实现方式, 在第六种可能的实现方式中, 所述分类单元包括:
计算单元, 用于获得存储器中存储的频谱波动的部分或全部有效数据的均 值;
判断单元, 用于将所述频谱波动的有效数据的均值与音乐分类条件做比较, 当所述频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前音频帧分 类为音乐帧; 否则将所述当前音频帧分类为语音帧。
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种可 能的实现方式或第二方面的第三种可能的实现方式或第二方面的第四种可能的 实现方式或第二方面的第五种可能的实现方式, 在第七种可能的实现方式中, 该音频信号分类装置还包括:
参数获得单元, 用于获得当前音频帧的频谱高频带峰度、 频谱相关度、 浊 音度参数和线性预测残差能量倾斜度; 其中, 频谱高频带峰度表示当前音频帧 的频谱在高频带上的峰度或能量锐度; 频谱相关度表示当前音频帧的信号谐波 结构在相邻帧间的稳定度; 浊音度参数表示当前音频帧与一个基音周期之前的 信号的时域相关度; 线性预测残差能量倾斜度表示音频信号的线性预测残差能 量随线性预测阶数的升高而变化的程度;
所述存储确认单元还用于, 根据所述当前音频帧的声音活动性, 确定是否 将所述频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度存储于存储器 中;
所述存储单元还用于, 当存储确认单元输出需要存储的结果时存储所述频 谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度;
所述分类单元具体用于, 分别获得存储的频谱波动、 频谱高频带峰度、 频 谱相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据 的统计量将所述音频帧分类为语音帧或者音乐帧。
结合第二方面的第七种可能的实现方式, 在第八种可能的实现方式中, 所 述分类单元包括:
计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
第三方面, 提供了一种音频信号分类方法, 包括:
将输入音频信号进行分帧处理; 获得当前音频帧的线性预测残差能量倾斜度; 所述线性预测残差能量倾斜 度表示音频信号的线性预测残差能量随线性预测阶数的升高而变化的程度; 将线性预测残差能量倾斜度存储到存储器中;
根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述音频帧进行 分类。
在第一种可能的实现方式中, 将线性预测残差能量倾斜度存储到存储器中 之前还包括:
根据所述当前音频帧的声音活动性, 确定是否将所述线性预测残差能量倾 斜度存储于存储器中; 并在确定需要存储时将将所述线性预测残差能量倾斜度 存储于存储器中。
结合第三方面的或第三方面的第一种可能的实现方式, 在第二种可能的实 现方式中, 预测残差能量倾斜度部分数据的统计量为预测残差能量倾斜度部分 数据的方差; 所述根据存储器中预测残差能量倾斜度部分数据的统计量, 对所 述音频帧进行分类包括:
将预测残差能量倾斜度部分数据的方差与音乐分类阈值相比较, 当所述预 测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分 类为音乐帧; 否则将所述当前音频帧分类为语音帧。
结合第三方面的或第三方面的第一种可能的实现方式, 在第三种可能的实 现方式中, 该音频信号分类方法还包括:
获得当前音频帧的频谱波动、 频谱高频带峰度和频谱相关度, 并存储于对 应的存储器中;
其中, 所述根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述 音频帧进行分类包括:
分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残差 能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音频帧分 类为语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的有效数据 运算操作后获得的数据值。
结合第三方面的第三种可能的实现方式, 在第四种可能的实现方式中, 分 别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残差能量倾 斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音频帧分类为语 音帧或者音乐帧包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
结合第三方面的或第三方面的第一种可能的实现方式, 在第五种可能的实 现方式中, 该音频信号分类方法还包括:
获得当前音频帧的频谱音调个数和频谱音调个数在低频带上的比率, 并存 储于对应的存储器;
其中, 所述根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述 音频帧进行分类包括:
分别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调个数的统计 量;
根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量和频 谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧; 所述 统计量指对存储器中存储的数据运算操作后获得的数据值。
结合第三方面的第五种可能的实现方式, 在第六种可能的实现方式中, 分 别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量包括: 获得存储的线性预测残差能量倾斜度的方差;
获得存储的频谱音调个数的均值;
根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量和频 谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧包括: 当当前音频帧为活动帧, 且满足下列条件之一, 则将所述当前音频帧分类 为音乐帧, 否则将所述当前音频帧分类为语音帧:
线性预测残差能量倾斜度的方差小于第五阈值; 或
频谱音调个数的均值大于第六阈值; 或
频谱音调个数在低频带上的比率小于第七阈值。
结合第三方面或第三方面的第一种可能的实现方式或第三方面的第二种可 能的实现方式或第三方面的第三种可能的实现方式或第三方面的第四种可能的 实现方式或第三方面的第五种可能的实现方式或第三方面的第六种可能的实现 方式, 在第七种可能的实现方式中, 获得当前音频帧的线性预测残差能量倾斜 度包括:
根据下列公式计算当前音频帧的线性预测残差能量倾斜度: ^ epsP(i) · epsP(i + 1)
epsP _ tilt =
^ epsP(i) · epsP(i)
其中, ep^ )表示当前音频帧第 i阶线性预测的预测残差能量; n为正整 数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。
结合第三方面的第五种可能的实现方式或第三方面的第六种可能的实现方 式, 在第八种可能的实现方式中, 获得当前音频帧的频谱音调个数和频谱音调 个数在低频带上的比率包括:
统计当前音频帧在 0 ~ 8kHz频带上频点峰值大于预定值的频点数量作为频 谱音调个数;
计算当前音频帧在 0 ~ 4kHz频带上频点峰值大于预定值的频点数量与 0 ~ 8kHz频带上频点峰值大于预定值的频点数量的比值, 作为频谱音调个数在低频 带上的比率。
第四方面, 提供一种信号分类装置, 用于对输入的音频信号进行分类, 其 包括:
分帧单元, 用于对输入音频信号进行分帧处理;
参数获得单元, 用于获得当前音频帧的线性预测残差能量倾斜度; 所述线 性预测残差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶数的升 高而变化的程度;
存储单元, 用于存储线性预测残差能量倾斜度;
分类单元, 用于根据存储器中预测残差能量倾斜度部分数据的统计量, 对 所述音频帧进行分类。
在第一种可能的实现方式中, 信号分类装置还包括:
存储确认单元, 用于根据所述当前音频帧的声音活动性, 确定是否将所述 线性预测残差能量倾斜度存储于存储器中;
所述存储单元具体用于, 当存储确认单元确认需要确定需要存储时将将所 述线性预测残差能量倾斜度存储于存储器中。
结合第四方面的或第四方面的第一种可能的实现方式, 在第二种可能的实 现方式中, 预测残差能量倾斜度部分数据的统计量为预测残差能量倾斜度部分 数据的方差;
所述分类单元具体用于将预测残差能量倾斜度部分数据的方差与音乐分类 阈值相比较, 当所述预测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
结合第四方面的或第四方面的第一种可能的实现方式, 在第三种可能的实 现方式中, 参数获得单元还用于: 获得当前音频帧的频谱波动、 频谱高频带峰 度和频语相关度, 并存储于对应的存储器中;
所述分类单元具体用于: 分别获得存储的频谱波动、 频谱高频带峰度、 频 谱相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据 的统计量将所述音频帧分类为语音帧或者音乐帧; 所述有效数据的统计量指对 存储器中存储的有效数据运算操作后获得的数据值。
第四方面的第三种可能的实现方式, 在第四种可能的实现方式中, 所述分 类单元包括:
计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
结合第四方面的或第四方面的第一种可能的实现方式, 在第五种可能的实 现方式中, 所述参数获得单元还用于: 获得当前音频帧的频谱音调个数和频谱 音调个数在低频带上的比率, 并存储于存储器;
所述分类单元具体用于: 分别获得存储的线性预测残差能量倾斜度的统计 量、 频谱音调个数的统计量; 根据所述线性预测残差能量倾斜度的统计量、 频 谱音调个数的统计量和频谱音调个数在低频带上的比率, 将所述音频帧分类为 语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的数据运算操作 后获得的数据值。
第四方面的第五种可能的实现方式, 在第六种可能的实现方式中, 所述分 类单元包括:
计算单元, 用于获得线性预测残差能量倾斜度有效数据的方差和存储的频 谱音调个数的均值;
判断单元, 用于当当前音频帧为活动帧, 且满足下列条件之一, 则将所述 当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 线性预测残 差能量倾斜度的方差小于第五阈值; 或频谱音调个数的均值大于第六阈值; 或 频谱音调个数在低频带上的比率小于第七阈值。
结合第四方面或第四方面的第一种可能的实现方式或第四方面的第二种可 能的实现方式或第四方面的第三种可能的实现方式或第四方面的第四种可能的 实现方式或第四方面的第五种可能的实现方式或第四方面的第六种可能的实现 方式, 在第七种可能的实现方式中, 所述参数获得单元根据下列公式计算当前 音频帧的线性预测残差能量倾斜度:
^ epsP(i) · epsP(i + 1)
epsP _ tilt =
^ epsP(i) · epsP(i)
!•=1 其中, epsP(i)表示当前音频帧第 i阶线性预测的预测残差能量; n为正整数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。 结合第四方面的第五种可能的实现方式或第四方面的第六种可能的实现方 式, 在第八种可能的实现方式中, 所述参数获得单元用于统计当前音频帧在 0 ~ 8kHz频带上频点峰值大于预定值的频点数量作为频谱音调个数; 所述参数获得 单元用于计算当前音频帧在 0 ~ 4kHz频带上频点峰值大于预定值的频点数量与 0 ~ 8kHz频带上频点峰值大于预定值的频点数量的比值,作为频谱音调个数在低 频带上的比率, 本发明实施例根据频谱波动的长时统计量对音频信号进行分类, 参数较少 , 识别率较高且复杂度较低; 同时考虑声音活动性和敲击音乐的因素对频谱波动 进行调整, 对音乐信号识别率更高, 适合混合音频信号分类。
附图说明 例或现有技术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面描述 中的附图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付 出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。
图 1为对音频信号分帧的示意图;
图 2为本发明提供的音频信号分类方法的一个实施例的流程示意图; 图 3为本发明提供的获得频谱波动的一个实施例的流程示意图;
图 4为本发明提供的音频信号分类方法的另一个实施例的流程示意图; 图 5为本发明提供的音频信号分类方法的另一个实施例的流程示意图; 图 6为本发明提供的音频信号分类方法的另一个实施例的流程示意图; 图 7至图 1 0为本发明提供的音频信号分类的一种具体分类流程图; 图 11为本发明提供的音频信号分类方法的另一个实施例的流程示意图; 图 12为本发明提供的音频信号分类的一种具体分类流程图;
图 1 3为本发明提供的音频信号的分类装置一个实施例的结构示意图; 图 14为本发明提供的分类单元一个实施例的结构示意图;
图 15为本发明提供的音频信号的分类装置另一个实施例的结构示意图; 图 16为本发明提供的音频信号的分类装置另一个实施例的结构示意图; 图 17为本发明提供的分类单元一个实施例的结构示意图;
图 18为本发明提供的音频信号的分类装置另一个实施例的结构示意图; 图 19为本发明提供的音频信号的分类装置另一个实施例的结构示意图。 具体实施方式
下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行清 楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是 全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。
数字信号处理领域, 音频编解码器、 视频编解码器广泛应用于各种电子设 备中, 例如: 移动电话, 无线装置, 个人数据助理(PDA ), 手持式或便携式计 算机, GPS接收机 /导航器, 照相机, 音频 /视频播放器, 摄像机, 录像机, 监控 设备等。 通常, 这类电子设备中包括音频编码器或音频解码器, 音频编码器或 者解码器可以直接由数字电路或芯片例如 DSP ( di g i ta l s i gna l proces sor ) 实 现, 或者由软件代码驱动处理器执行软件代码中的流程而实现。 一种音频编码 器中, 首先对音频信号进行分类, 对不同类型的音频信号采用不同的编码模式 进行编码后, 再将编码后码流传输给解码端。 一般的, 音频信号在处理时采用分帧的方式, 每一帧信号代表一定时长的 音频信号。 参考图 1 , 当前输入的需要分类的音频帧可以称为当前音频帧; 当前 音频帧之前的任意一帧音频帧可以称为历史音频帧; 按照从当前音频帧到历史 音频帧的时序顺序, 历史音频帧可以依次成为前一音频帧, 前第二帧音频帧, 前第三帧音频帧, 前第 N帧音频帧, N大于等于四。
本实施例中, 输入音频信号为 16kHz采样的宽带音频信号, 输入音频信号 以 20ms为一帧进行分帧, 即每帧 320个时域样点。 在提取特征参数前, 输入音 频信号帧首先经降采样为 12.8kHz采样率, 即 256采样点每帧。后文中的输入音 频信号帧均指降采样后的音频信号帧。
参考图 2, —种音频信号分类方法的一个实施例包括:
S101 : 将输入音频信号进行分帧处理, 根据当前音频帧的声音活动性, 确 定是否获得当前音频帧的频谱波动并存储于频谱波动存储器中, 其中, 频谱波 动表示音频信号的频谱的能量波动;
音频信号分类一般按帧进行, 对每个音频信号帧提取参数进行分类, 以确 定该音频信号帧属于语音帧还是音乐帧, 以采用对应的编码模式进行编码。 一 个实施例中, 可以在音频信号进行分帧处理后, 获得当前音频帧的频谱波动, 再根据当前音频帧的声音活动性, 确定是否将该频谱波动存储于频谱波动存储 器中; 另一个实施例中, 可以在音频信号进行分帧处理后, 根据当前音频帧的 声音活动性, 确定是否将该频谱波动存储于频谱波动存储器中, 在需要存储时 再获得该频谱波动并存储。
频谱波动 flux表示信号频谱的短时或长时能量波动, 为当前音频帧与历史 帧在中低频带频谱上对应频率的对数能量差的绝对值的均值; 其中历史帧指当 前音频帧之前的任意一帧。 一个实施例中, 频语波动为当前音频帧与其历史帧 在中低频带频谱上对应频率的对数能量差的绝对值的均值。 另一个实施例中, 的绝对值的均值。
参考图 3 , 获得频谱波动的一个实施例包括如下步骤:
S1011 : 获得当前音频帧的频语;
一个实施例中, 可以直接获得音频帧的频谱; 另一个实施例中, 获得当前 音频帧任意两个子帧的频谱, 即能量谱; 利用两个子帧的频谱的平均值得到当 前音频帧的频语;
S1012: 获得当前音频帧历史帧的频谱;
其中历史帧指当前音频帧之前的任意一帧音频帧; 一个实施例中可以为当 前音频帧之前的第三帧音频帧。 S1013: 计算当前音频帧与历史帧分别在中低频带频语上对应频率的对数能 量差的绝对值的均值, 作为当前音频帧的频谱波动。
一个实施例中, 可以计算当前音频帧在中低频带频谱上所有频点的对数能 量与历史帧在中低频带频谱上对应频点的对数能量之间差值的绝对值的均值; 另一个实施例中, 可以计算当前音频帧在中低频带频谱上频谱峰值的对数 能量与历史帧在中低频带频谱上对应频谱峰值的对数能量之间差值的绝对值的 均值。
中低频带频谱, 例如 0〜fs/4,或者 0〜fs/3的频谱范围。
以输入音频信号为 16kHz采样的宽带音频信号,输入音频信号以 20ms为一 帧为例 ,对每 20ms当前音频帧分别做前后两个 256点的 FFT ,两个 FFT窗 50 % 重叠, 得到当前音频帧两个子帧的频谱 (能量谱), 分别记做 cQ(o, c i) , i=0,1...127, 其中 Cx( )表示第 x个子帧的频谱。 当前音频帧第 1子帧的 FFT需 要用到前一帧第 2子帧的数据。
Cx (i) - rel (i) + img (i) 其中, re/( )和 分别表示第 i频点 FFT系数的实部和虚部。 当前音频帧 的频谱 c(o则由两个子帧的频谱平均得到。
c( ) = (c。( ) + 一个实施例中,当前音频帧的频谱波动 flux为当前音频帧与其 60ms前的帧 在中低频带频谱上对应频率的对数能量差的绝对值的均值, 在另一实施例中也 可为不同于 60ms的间隔。
1 42
^ = ^∑[10 lo§(c( ) - 10 log( 3 ('·))] 其中 c^)表示当前当前音频帧之前的第三历史帧, 即在本实施例中当帧长 为 20ms时,当前音频帧 60ms以前的历史帧的频语。在本文中类似 X-n()的形式, 均表示当前音频帧的第 n历史帧的参数 X, 当前音频帧可省略下角标 0。 log(.) 表示以 10为底的对数。 在另一个实施例中, 当前音频帧的频谱波动 flux也可由下述方法得到, 即, 为当前音频帧与其 60ms前的帧在中低频带频谱上对应频语峰值的对数能量差的 绝对值的均值,
^ = ∑[10 log(p('')) - 11ο§(^3 (0)] 其中 P(o表示当前音频帧的频谱的第 个局部峰值能量, 局部峰值所在的频 点即为频谱上能量高于高低两相邻频点上能量的频点 表示中低频带频谱上局 部峰值的个数。 其中, 根据当前音频帧的声音活动性, 确定是否将该频谱波动存储于频谱 波动存储器中, 可以用多种方式实现:
一个实施例中, 若音频帧的声音活动性参数表示音频帧为活动帧, 则将音 频帧的频谱波动存储于频谱波动存储器中; 否则不存储。
另一个实施例中, 根据音频帧的声音活动性和音频帧是否为能量冲击, 确 定是否将所述频谱波动存储于存储器中。 若音频帧的声音活动性参数表示音频 帧为活动帧, 且表示音频帧是否为能量冲击的参数表示音频帧不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中; 否则不存储; 另一个实施例 中, 若当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内的多个连续帧 都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中; 否则不 存储。 例如, 当前音频帧为活动帧, 且当前音频帧、 前一帧音频帧和前第二帧 音频帧都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中; 否则不存储。
声音活动性标识 vad— flag表示当前输入信号是活动的前景信号 (语音, 音 乐等)还是前景信号静默的背景信号(如背景噪声, 静音等), 由声音活动性检 测器 VAD获得。 vad— flag=l表示输入信号帧为活动帧, 即前景信号帧, 反之 vad— flag=0表示背景信号帧。 由于 VAD不属本发明的发明内容, VAD的具体算 法在 jtb不再伴述。
声音冲击标识 attack— flag表示当前当前音频帧是否属于音乐中的一个能量 冲击。 当当前音频帧之前的若干历史帧以音乐帧为主时, 若当前音频帧的帧能 量较其之前第一历史帧有较大跃升, 且较其之前一段时间内音频帧的平均能量 有较大跃升, 且当前音频帧的时域包络较其之前一段时间内音频帧的平均包络 也有较大跃升时, 则认为当前当前音频帧属于音乐中的能量冲击。
根据所述当前音频帧的声音活动性, 当当前音频帧为活动帧时, 才存储当 前音频帧的频谱波动; 能够降低非活动帧的误判率, 提高音频分类的识别率。
当如下条件满足时, attack— flag置 1 , 即表示当前当前音频帧为一个音乐中 的能量冲击:
etot - etot_x > 6
etot— lp— speec > 5
mode mov > 0.9
log— max— spl - mov _ log— max— spl > 5 其中, eto 表示当前音频帧的对数帧能量; eto^表示前一音频帧的对数帧 能量; lp— speech 表示对数帧能量 etot 的长时滑动平均; log— max— spl 和 mov— log— max— spl 分别表示当前音频帧的时域最大对数样点幅度及其长时滑动 平均; mode— mov表示信号分类中历史最终分类结果的长时滑动平均。
以上公式的含义是, 当当前音频帧之前的若干历史帧以音乐帧为主时, 若 当前音频帧的帧能量较其之前第一历史帧有较大跃升, 且较其之前一段时间内 音频帧的平均能量有较大跃升, 且当前音频帧的时域包络较其之前一段时间内 音频帧的平均包络也有较大跃升时, 则认为当前当前音频帧属于音乐中的能量 冲击。
对数帧能量 etot, 由输入音频帧的对数总子带能量表示:
Figure imgf000020_0001
其中, zb(/),/b( )分别表示输入音频帧频谱中第 _/子带的高低频边界; C ( )表 示输入音频帧的频语。 当前音频帧的时域最大对数样点幅度的长时滑动平均 mov— log— max— spl 只 在活动声音帧中更新:
mov log— max— spl -
[0.95 - mov _ log_max_spl_l + 0.05 - \og_max_spl \og_ max_spl > mov _ log_ max_ spl_x
0.995 · mov _ log_ max_ spl , + 0.005 - \og_max_spl \og_ max_spl < mov _ log_ max_ spl . 一个实施例中,当前音频帧的频语波动 flux被緩存在一个 FIFO的 flux历史 buffer中, 本实施例中 flux历史 buffer的长度为 60 ( 60帧)。 判断当前音频帧的 声音活动性和音频帧是否为能量冲击, 当当前音频帧为前景信号帧且当前音频 帧及其之前的两帧均未出现属于音乐的能量冲击, 则将当前音频帧的频谱波动 flux存储于存储器中。 在緩存当前当前音频帧的 flux之前, 检查是否满足如下条件:
V ad _ flag≠ 0
attack _ flag≠ 1
attack _ flag _x≠ 1
attack _ flag _2≠ 1 若满足, 则緩存, 否则不緩存。 其中, vad— flag表示当前输入信号是活动的前景信号还是前景信号静默的背 景信号, vad— flag=0表示背景信号帧; attack— flag表示当前当前音频帧是否属于 音乐中的一个能量冲击, attack— flag=l表示当前当前音频帧为一个音乐中的能量 冲击。 上述公式的含义为: 当前音频帧为活动帧, 且当前音频帧、 前一帧音频帧 和前第二帧音频帧均不属于能量冲击。 S102: 根据音频帧是否为敲击音乐或历史音频帧的活动性, 更新频谱波动 存储器中存储的频谱波动;
一个实施例中, 若表示音频帧是否属于敲击音乐的参数表示当前音频帧属 于敲击音乐, 则修改频谱波动存储器中存储的频谱波动的值, 将频谱波动存储 器中有效的频谱波动值修改为小于等于音乐阈值的一个值, 其中当音频帧的频 谱波动小于该音乐阈值时该音频被分类为音乐帧。 一个实施例中, 将有效的频 语波动值重置为 5。 即当敲击声响标识 percus— flag被置为 1时, flux历史 buffer 中所有的有效緩冲数据均被重置为 5。 这里,有效緩冲数据等价于有效频谱波动 值。 一般的, 音乐帧的频谱波动值较低, 而语音帧的频谱波动值较高。 当音频 帧属于敲击音乐时, 将有效的频谱波动值修改为小于等于音乐阈值的一个值, 则能提高该音频帧被分类为音乐帧的概率, 从而提高音频信号分类的准确率。
另一个实施例中, 根据当前音频帧的历史帧的活动性, 更新存储器中的频 谱波动。 具体的, 一个实施例中, 如果确定当前音频帧的频谱波动存储于频谱 波动存储器中, 且前一帧音频帧为非活动帧, 则将频谱波动存储器中已存储的 除当前音频帧的频语波动之外的其他频语波动的数据 ^ίι爹改为无效数据。 前一帧 音频帧为非活动帧而当前音频帧为活动帧时, 当前音频帧与历史帧的语音活动 性不同, 将历史帧的频谱波动无效化, 则能降低历史帧对音频分类的影响, 从 而提高音频信号分类的准确率。
另一个实施例中, 如果确定当前音频帧的频谱波动存储于频谱波动存储器 中, 且当前音频帧之前连续三帧不全都为活动帧, 则将当前音频帧的频谱波动 修正为第一值。 第一值可以为语音阈值, 其中当音频帧的频谱波动大于该语音 阈值时该音频被分类为语音帧。 另一个实施例中, 如果确定当前音频帧的频谱 波动存储于频谱波动存储器中, 且历史帧的分类结果为音乐帧且当前音频帧的 频谱波动大于第二值, 则将当前音频帧的频谱波动修正为第二值, 其中, 第二 值大于第一值。
如果当前音频帧的 flux被緩存, 且前一帧音频帧为非活动帧( vad— flag=0 ), 则除被新緩存入 flux历史 buffer的当前音频帧 flux以外, 其余 flux历史 buffer 中的数据全部重置为 -1 (等价于将这些数据无效化)。
如果 flux被緩存入 flux历史 buffer,且当前音频帧之前连续三帧不全都为活 动帧( vad— flag=l ),则将刚緩存入 flux历史 buffer的当前音频帧 flux修正为 16, 即是否满足如下条件: vad _ fl g_ = 1
< vad _ flag _2 = 1若不满足, 则将刚緩存入 flux历史 buffer的当前音频帧 flux vad _ fl g_3 = 1 修正为 16; 如果当前音频帧之前连续三帧都为活动帧 (vad— flag=l ), 则检查是否满足 如下条件:
ί mod e _ mov > 0.9
flux > 20 若满足, 则将刚緩存入 flux历史 buffer的当前音频帧 flux修正为 20, 否则 不做操作。
其中, mode— mov 表示信号分类中历史最终分类结果的长时滑动平均; mode_mov>0.9表示信号处于音乐信号中, 根据音频信号的历史分类结果将 flux 进行限制, 以降低 flux出现语音特征的概率, 目的是提高判断分类的稳定性。
当当前音频帧之前连续三帧历史帧都为非活动帧, 当前音频帧为活动帧时, 或当前音频帧之前连续三帧不全都为活动帧, 当前音频帧为活动帧时, 此时处 于分类的初始化阶段。 在一个实施例中为了使分类结果倾向于语音(音乐), 可 以将当前音频帧的频谱波动修改为语音(音乐) 阈值或接近于语音(音乐) 阈 值的数值。 在另一个实施例中, 如果当前信号之前的信号是语音(音乐)信号, 则可以将当前音频帧的频谱波动修改为语音(音乐) 阈值或接近于语音(音乐) 阈值的数值以提高判断分类的稳定性。 在另一个实施例中, 为了使分类结果倾 向于音乐, 可以对频谱波动进行限制, 即可以修改当前音频帧的频谱波动使其 不大于一个阈值, 以降低频谱波动判定为语音特征的概率。
敲击声响标识 percus— flag表示音频†贞中是否有敲击声响存在。 percus— flag 置 1表示检测到敲击声响, 置 0则表示没有检测到敲击声响。
当当前信号(即包括当前音频帧和其若干历史帧在内的若干最新的信号帧) 在短时和长时均出现较尖锐的能量突起, 且当前信号不具有明显的浊音特征时, 若当前音频帧之前的若干历史帧以音乐帧为主, 则认为当前信号是一个敲击音 乐; 否则, 进一步的若当前信号的每个子帧均不具有明显的浊音特征且当前信 一个敲击音乐。
敲击声响标识 percus— flag通过如下步骤获得:
首先获得输入音频帧的对数帧能量 etot,由输入音频帧的对数总子带能量表 示:
Figure imgf000024_0001
其中, zb(/),/b( )分别表示输入帧频谱第 _/子带的高低频边界, C ( )表示输入 音频帧的频语。
当满足如下条件时, percus— flag置 1 , 否则置 0。
etot—2 - etot ^ > 6
etot—2 - etot_ > 0
etot - etot > 3
etot_x - etot > 0
etot —Ip— speech > 3
Q.5■ voicing ^ (1) + ^ 25 - voicing (0) + 0.25 · voicing(\) < 0.75
mod e mov > 0.9
etot— 2 - etot— 3 > 6
etot—2 - etot_x > 0
etot—2 - etot > 3
etot_x - etot > 0
etot— ^ - lp _ speech > 3
0.5 - voicing + 025 - voicing(0) + 0.25 · voicing (Ϊ) < 0.75
voicing^ (0) < 0.8
voicing^ (1) < 0.8
voicing(0) < 0.8
log— max— spl_2 - mov _ log_ max_ spl_2 > 10 其中 , etot表示当前音频帧的对数帧能量; lp— speech表示对数帧能量 etot 的长时滑动平均; voicing O voicing^O), voicing^l)分别表示当前输入音频帧第 一子帧和第一历史帧的第一, 第二子帧的归一化开环基音相关度, 浊音度参数 voicing是通过线性预测分析得到的, 代表当前音频帧与一个基音周期之前的信 号的时域相关度, 取值在 0~ 1之间; mode— mov表示信号分类中历史最终分类 结果的长时滑动平均; log— max— spL2和 mov— log— max— spL2分别表示第二历史帧 的时域最大对数样点幅度, 及其长时滑动平均。 /p— ^eecz在每一活动声音帧中 进行更新 (即 vad— flag=l的帧), 其更新方法为:
lp _ speech - 0.99■ lp speech^ + 0.01 - etot 以上两式的含义为: 当当前信号 (即包括当前音频帧和其若干历史帧在内 的若干最新的信号帧)在短时和长时均出现较尖锐的能量突起, 且当前信号不 具有明显的浊音特征时, 若当前音频帧之前的若干历史帧以音乐帧为主, 则认 为当前信号是一个敲击音乐, 否则进一步的若当前信号的每个子帧均不具有明 时, 则也认为当前信号是一个敲击音乐。
浊音度参数 voicing, 即归一化开环基音相关度, 表示当前音频帧与一个基 音周期之前的信号的时域相关度, 可以由 ACELP的开环基音搜索中获得, 取值 在 0~ 1之间。 由于属现有技术, 本发明不做详述。 本实施例中当前音频帧的两 个子帧各计算一个 voicing, 求平均得到当前音频帧的 voicing参数。 当前音频 帧的 voicing参数也被緩存在一个 voicing历史 buffer中,本实施例中 voicing 历史 buffer的长度为 10。
mode— mov在每一活动声音帧且在该帧之前已出现连续 30帧以上的声音活 动帧时进行更新, 更新方法为:
mode mov - 0.95 · move _mov_l + 0.05 · mode 其中 mode为当前输入音频帧的分类结果, 二元取值, "0" 表示语音类别, "1" 表示音乐类别。
S 103: 根据频谱波动存储器中存储的频谱波动的部分或全部数据的统计量, 将该当前音频帧分类为语音帧或者音乐帧。 当频谱波动的有效数据的统计量满 足语音分类条件时, 将所述当前音频帧分类为语音帧; 当频谱波动的有效数据 的统计量满足音乐分类条件时, 将所述当前音频帧分类为音乐帧。 此处的统计量为频谱波动存储器中存储的有效的频谱波动 (即有效数据) 做统计操作得到的值, 例如统计操作可以为平均值或者方差。 下面实施例中的 统计量具有类似的含义。
一个实施例中, 步骤 S 103包括:
获得频谱波动存储器中存储的频谱波动的部分或全部有效数据的均值; 当所获得的频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前 音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
例如, 当所获得的频谱波动的有效数据的均值小于音乐分类阈值时, 将所 述当前音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
一般的, 音乐帧的频谱波动值较小, 而语音帧的频谱波动值较大。 因此可 以根据频谱波动对当前音频帧进行分类。 当然还可以采用其他分类方法对该当 前音频帧进行信号分类。 例如, 统计频谱波动存储器中存储的频谱波动的有效 数据的数量; 根据该有效数据的数量, 将频谱波动存储器由近端到远端划分出 其中, 所述区间的起点为当前帧频谱波动存储位置, 近端为存储有当前帧频谱 波动的一端, 远端为存储有历史帧频谱波动的一端; 根据较短区间内的频谱波 动统计量对所述音频帧进行分类, 若此区间内的参数统计量足够区分出所述音 频帧的类型则分类过程结束, 否则在其余较长区间中最短的区间内继续分类过 程, 并以此类推。 在每个区间的分类过程中, 根据每一个区间对应的分类阈值, 对所述当前音频帧进行分类, 将所述当前音频帧分类为语音帧或者音乐帧, 当 频谱波动的有效数据的统计量满足语音分类条件时, 将所述当前音频帧分类为 语音帧; 当频谱波动的有效数据的统计量满足音乐分类条件时, 将所述当前音 频帧分类为音乐帧。
在信号分类后, 可以对不同的信号采用不同的编码模式进行编码。 例如, 语音信号采用基于语音产生模型的编码器(如 CELP )进行编码, 对音乐信号采 用基于变换的编码器(如基于 MDCT的编码器)进行编码。 上述实施例, 由于根据频谱波动的长时统计量对音频信号进行分类, 参数 较少, 识别率较高且复杂度较低; 同时考虑声音活动性和敲击音乐的因素对频 谱波动进行调整, 对音乐信号识别率更高, 适合混合音频信号分类。
参考图 4, 另一个实施例中, 在步骤 S102之后还包括:
S104: 获得当前音频帧的频谱高频带峰度、 频谱相关度和线性预测残差能 量倾斜度, 将所述频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度存 储于存储器中; 频谱高频带峰度表示当前音频帧频谱在高频带上的峰度或能量 锐度; 频谱相关度表示信号谐波结构在相邻帧间的稳定度; 线性预测残差能量 倾斜度表示线性预测残差能量倾斜度表示输入音频信号的线性预测残差能量随 线性预测阶数的升高而变化的程度;
可选的, 在存储这些参数之前, 还包括: 根据所述当前音频帧的声音活动 性, 确定是否将频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度存储 于存储器中; 如果当前音频帧为活动帧, 则存储上述参数; 否则不存储。
频谱高频带峰度表示当前音频帧频谱在高频带上的峰度或能量锐度; 一个 实施例中, 通过下列公式计算频谱高频带峰度 ph:
126
ph = ^ p2v map{i)
=64 其中 p2v—map(i)表示频谱第 i个频点的峰度, 峰度 p2v_map{i)由下式得到 i)≠ 0
Figure imgf000027_0001
其中 ? ea^) = C( ), 如果第 i频点是频谱的局部峰值, 否则 ? ea^)=0。 vl{i) 和 vr( )分别表示第 i个频点的高频侧和低频侧与之最临近的频语局部谷值 v(n)。
> C(i - 1), C(i) > C(i + 1)
Figure imgf000027_0002
v = VC( ) C(i) < C(i - 1), C(i) < C(i + 1) 当前音频帧的频谱高频带峰度 h也被緩存在一个 ph历史 buffer中, 本实 施例中 ph历史 buffer的长度为 60。 频语相关度 cor_map-sum表示信号谐波结构在相邻帧间的稳定度, 其通过
Figure imgf000028_0001
首先获得输入音频帧 C (i)的去底频谱 C'(i)t = C(i)-floor(i) 其中, floor(fi, i=0,l,...127, 表示输入音频帧频语的语底(
floorij)
Figure imgf000028_0002
idx[vr i)] - idx[vl i ] 其中, ίχ[χ]表示 x在频谱上的位置, idx[x]=0,l,..A27c 然后在每两个相邻的频语谷值之间, 求输入音频帧与之前一帧的去底频语 的互相关 cor(n),
Figure imgf000028_0003
其中, /b(«), zb(«)分别表示第《个频语谷值区间 (即位于两个相邻得谷值之 间的区域) 的端点位置, 即限定该谷值区间的两个频语谷值的位置。
最后, 通过下列公式计算输入音频帧的频谱相关度 cor— map— sum:
127
cor _ map _ sum = cor(inv[lb(n) < i, hb(n) > i])
!•=0
其中, «v[]表示函数/的反函数。
线性预测残差能量倾斜度 epsP_tilt 表示输入音频信号的线性预测残差能 量随线性预测阶数的升高而变化的程度。 可以通过下列公式计算获得:
^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i)
!·=ι 其中, ep )表示第 i阶线性预测的预测残差能量; n为正整数, 表示线 性预测的阶数, 其小于等于线性预测的最大阶数。 例如一个实施例中, n=15。 则步骤 S 103可以被以下步骤替代:
S105: 分别获得存储的频谱波动、 频语高频带峰度、 频谱相关度和线性预 测残差能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音 频帧分类为语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的有 效数据运算操作后获得的数据值, 运算操作可以包括求均值, 求方差等操作。
一个实施例中, 该步骤包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
一般的, 音乐帧的频谱波动值较小, 而语音帧的频谱波动值较大; 音乐帧 的频谱高频带峰度值较大, 语音帧的频谱高频带峰度较小; 音乐帧的频谱相关 度的值较大, 语音帧的频谱相关度值较小; 音乐帧的线性预测残差能量倾斜度 的变化较小, 而语音帧的线性预测残差能量倾斜度的变化较大。 而因此可以根 据上述参数的统计量对当前音频帧进行分类。 当然还可以采用其他分类方法对 该当前音频帧进行信号分类。 例如, 统计频谱波动存储器中存储的频谱波动的 有效数据的数量; 根据该有效数据的数量, 将存储器由近端到远端划分出至少 高频带峰度有效数据的均值、 频谱相关度有效数据的均值和线性预测残差能量 倾斜度有效数据的方差; 其中, 所述区间的起点为当前帧频谱波动的存储位置, 近端为存储有当前帧频谱波动的一端, 远端为存储有历史帧频谱波动的一端; 根据较短区间内的上述参数的有效数据的统计量对所述音频帧进行分类, 若此 区间内的参数统计量足够区分出所述音频帧的类型则分类过程结束, 否则在其 余较长区间中最短的区间内继续分类过程, 并以此类推。 在每个区间的分类过 程中, 根据每一个区间对应的分类阈值, 对所述当前音频帧进行分类, 当下列 条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分 类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频谱高频带峰 度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的均值大于第 三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈值。
在信号分类后, 可以对不同的信号采用不同的编码模式进行编码。 例如, 语音信号采用基于语音产生模型的编码器(如 CELP )进行编码, 对音乐信号采 用基于变换的编码器 (如基于 MDCT的编码器 )进行编码。
上述实施例中, 根据频谱波动、 频谱高频带峰度、 频谱相关度和线性预测 残差能量倾斜度的长时统计量对音频信号进行分类, 参数较少, 识别率较高且 复杂度较低; 同时考虑声音活动性和敲击音乐的因素对频谱波动进行调整, 根 据当前音频帧所处信号环境, 对频谱波动进行修正, 提高分类识别率, 适合混 合音频信号分类。
参考图 5 , 音频信号分类方法的另一个实施例包括:
S501 : 将输入音频信号进行分帧处理; 音频信号分类一般按帧进行, 对每个音频信号帧提取参数进行分类, 以确 定该音频信号帧属于语音帧还是音乐帧, 以采用对应的编码模式进行编码。
S502 : 获得当前音频帧的线性预测残差能量倾斜度; 线性预测残差能量倾 斜度表示音频信号的线性预测残差能量随线性预测阶数的升高而变化的程度; 一个实施例中, 线性预测残差能量倾斜度 epsP_ t i l t 可以通过下列公式计 ^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i)
其中, ep )表示第 i阶线性预测的预测残差能量; n为正整数, 表示线 性预测的阶数, 其小于等于线性预测的最大阶数。 例如一个实施例中, n=15。
S503: 将线性预测残差能量倾斜度存储到存储器中;
可以将线性预测残差能量倾斜度存储到存储器中。 一个实施例中, 该存储 器可以为一个 FIFO的 buffer, 该 buffer的长度为 60个存储单位(即可存储 60 个线性预测残差能量倾斜度)。 可选的, 在存储线性预测残差能量倾斜度之前, 还包括: 根据所述当前音 频帧的声音活动性, 确定是否将线性预测残差能量倾斜度存储于存储器中; 如 果当前音频帧为活动帧, 则存储线性预测残差能量倾斜度; 否则不存储。
S504: 根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述音频 帧进行分类。
一个实施例中, 预测残差能量倾斜度部分数据的统计量为预测残差能量倾 斜度部分数据的方差; 则步骤 S504包括:
将预测残差能量倾斜度部分数据的方差与音乐分类阈值相比较, 当所述预 测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分 类为音乐帧; 否则将所述当前音频帧分类为语音帧。
一般的, 音乐帧的线性预测残差能量倾斜度值变化较小, 而语音帧的线性 预测残差能量倾斜度值变化较大。 而因此可以根据线性预测残差能量倾斜度的 统计量对当前音频帧进行分类。 当然还可以结合其他参数采用其他分类方法对 该当前音频帧进行信号分类。
另一个实施例中, 步骤 S504之前还包括: 获得当前音频帧的频谱波动、 频 谱高频带峰度和频语相关度, 并存储于对应的存储器中。 则步骤 S504具体为: 分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残差 能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音频帧分 类为语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的有效数据 运算操作后获得的数据值。 进一步的, 分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线 性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所 述音频帧分类为语音帧或者音乐帧包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
一般的, 音乐帧的频谱波动值较小, 而语音帧的频谱波动值较大; 音乐帧 的频谱高频带峰度值较大, 语音帧的频谱高频带峰度较小; 音乐帧的频谱相关 度的值较大, 语音帧的频谱相关度值较小; 音乐帧的线性预测残差能量倾斜度 值变化较小, 而语音帧的线性预测残差能量倾斜度值变化较大。 而因此可以根 据上述参数的统计量对当前音频帧进行分类。
另一个实施例中, 步骤 S504之前还包括: 获得当前音频帧的频谱音调个数 和频谱音调个数在低频带上的比率, 并存储于对应的存储器。 则步骤 S504具体 为:
分别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调个数的统计 量;
根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量和频 谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧; 所述 统计量指对存储器中存储的数据运算操作后获得的数据值。
进一步的, 分别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调 个数的统计量包括: 获得存储的线性预测残差能量倾斜度的方差; 获得存储的 频谱音调个数的均值。 根据所述线性预测残差能量倾斜度的统计量、 频谱音调 个数的统计量和频谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧 或者音乐帧包括:
当当前音频帧为活动帧, 且满足下列条件之一, 则将所述当前音频帧分类 为音乐帧, 否则将所述当前音频帧分类为语音帧:
线性预测残差能量倾斜度的方差小于第五阈值; 或 频谱音调个数的均值大于第六阈值; 或
频谱音调个数在低频带上的比率小于第七阈值。 其中, 获得当前音频帧的频谱音调个数和频谱音调个数在低频带上的比率 包括:
统计当前音频帧在 0~8kHz频带上频点峰值大于预定值的频点数量作为频 谱音调个数;
计算当前音频帧在 0~4kHz频带上频点峰值大于预定值的频点数量与 0~ 8kHz频带上频点峰值大于预定值的频点数量的比值, 作为频谱音调个数在低频 带上的比率。 一个实施例中, 预定值为 50。
频谱音调个数 Ntonal表示当前音频帧中的 0 ~ 8kHz频带上频点峰值大于预 定值的频点个数。 一个实施例中, 可以通过如下方式获得: 对当前音频帧, 统 计其在 0~8kHz频带上频点峰值 p2v_map(i)大于 50的个数, 即为 Ntonal, 其 中, ρ2ν— (0表示频谱第 i个频点的峰度, 其计算方式可以参考上述实施例的 描述。
频谱音调个数在低频带上的比率 ratio_Ntonal_lf 表示频谱音调个数与低 频带音调个数的比值。 一个实施例中, 可以通过如下方式获得: 对当前当前音 频帧, 统计其在 0~ 4kHz 频带上 p2v_map(i)大于 50 的个数, Ntonal_lf。 ratio-Ntonal-lf 为 Ntonal-lf 与 Ntonal ό t匕值, Ntonal_lf/Ntonal。 其中, ρ2ν— (0表示频谱第 i个频点的峰度,其计算方式可以参考上述实施例的描述。 另一个实施例中, 分别获得存储的多个 Ntonal的均值和存储的多个 Ntonal-lf 的均值, 计算 Ntonal-lf 的均值与 Ntonal的均值的比值, 作为频谱音调个数在 低频带上的比率。 本实施例中, 根据线性预测残差能量倾斜度的长时统计量对音频信号进行 分类, 同时兼顾了分类的鲁棒性和分类的识别速度, 分类参数较少但结果较为 准确, 复杂度低、 内存开销低。
参考图 6, 音频信号分类方法的另一个实施例包括:
S601: 将输入音频信号进行分帧处理;
S602: 获得当前音频帧的频谱波动、 频谱高频带峰度、 频谱相关度和线性 预测残差能量倾斜度;
频谱波动 flux表示信号频谱的短时或长时能量波动, 为当前音频帧与历史 帧在中低频带频谱上对应频率的对数能量差的绝对值的均值; 其中历史帧指当 前音频帧之前的任意一帧。 频谱高频带峰度 ph表示当前音频帧频谱在高频带上 的峰度或能量锐度。 频语相关度 cor_map_sum表示信号谐波结构在相邻帧间的 稳定度。 线性预测残差能量倾斜度 epsP_tilt 表示线性预测残差能量倾斜度表 示输入音频信号的线性预测残差能量随线性预测阶数的升高而变化的程度。 这 几个参数的具体计算方法参照前文实施例。
进一步的, 可以获得浊音度参数; 浊音度参数 voicing表示当前音频帧与 一个基音周期之前的信号的时域相关度。 浊音度参数 voicing是通过线性预测 分析得到的, 代表当前音频帧与一个基音周期之前的信号的时域相关度, 取值 在 0~1之间。 由于属现有技术, 本发明不做详述。 本实施例中当前音频帧的两 个子帧各计算一个 voicing, 求平均得到当前音频帧的 voicing参数。 当前音频 帧的 voicing参数也被緩存在一个 voicing历史 buffer中,本实施例中 voicing 历史 buffer的长度为 10。
S603: 分别将所述频谱波动、 频语高频带峰度、 频谱相关度和线性预测残 差能量倾斜度存储于对应的存储器;
可选的, 在存储这几个参数之前, 还包括:
一个实施例, 根据所述当前音频帧的声音活动性, 确定是否将所述频谱波 动存储频谱波动存储器中。 若当前音频帧为活动帧, 则将当前音频帧的频谱波 动存储于频谱波动存储器中。 另一个实施例, 根据音频帧的声音活动性和音频帧是否为能量冲击, 确定 是否将所述频谱波动存储于存储器中。 若当前音频帧为活动帧, 且当前音频帧 不属于能量冲击, 则将当前音频帧的频谱波动存储于频谱波动存储器中; 另一 个实施例中, 若当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内的多 个连续帧都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中; 否则不存储。 例如, 当前音频帧为活动帧, 且当前音频帧其前一帧以及历史第 二帧都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中; 否 则不存储。
声音活动性标识 vad— flag和声音冲击标识 attack— flag的定义和获得方式参 照前述实施例的描述。
可选的, 在存储这些参数之前, 还包括:
根据所述当前音频帧的声音活动性, 确定是否将频谱高频带峰度、 频谱相 关度和线性预测残差能量倾斜度存储于存储器中; 如果当前音频帧为活动帧, 则存储上述参数; 否则不存储。
S604 : 分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线性预 测残差能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音 频帧分类为语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的有 效数据运算操作后获得的数据值, 运算操作可以包括求均值, 求方差等操作。
可选的, 在步骤 S604之前, 还可以包括:
根据所述当前音频帧是否为敲击音乐, 更新频谱波动存储器中存储的频谱 波动; 一个实施例中, 若当前音频帧为敲击音乐, 将频谱波动存储器中有效的 频谱波动值修改为小于等于音乐阈值的一个值, 其中当音频帧的频谱波动小于 该音乐阈值时该音频被分类为音乐帧。 一个实施例中, 若当前音频帧为敲击音 乐, 则将频谱波动存储器中有效的频谱波动值重置为 5。
可选的, 在步骤 S604之前, 还可以包括:
根据当前音频帧的历史帧的活动性, 更新存储器中的频谱波动。 一个实施 例中, 如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且前一帧音 频帧为非活动帧, 则将频谱波动存储器中已存储的除当前音频帧的频谱波动之 外的其他频谱波动的数据修改为无效数据。 另一个实施例中, 如果确定当前音 频帧的频谱波动存储于频谱波动存储器中, 且当前音频帧之前连续三帧不全都 为活动帧, 则将当前音频帧的频谱波动修正为第一值。 第一值可以为语音阈值, 其中当音频帧的频谱波动大于该语音阈值时该音频被分类为语音帧。 另一个实 施例中, 如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且历史帧 的分类结果为音乐帧且当前音频帧的频谱波动大于第二值, 则将当前音频帧的 频语波动修正为第二值, 其中, 第二值大于第一值。
例如, 如果当前音频帧前一帧为非活动帧 (vad— flag=0 ), 则除被新緩存入 flux历史 buffer的当前音频帧 flux以外, 其余 flux历史 buffer中的数据全部重 置为 -1 (等价于将这些数据无效化); 如果当前音频帧之前连续三帧不全都为活 动帧( vad— flag=l ),则将刚緩存入 flux历史 buffer的当前音频帧 flux修正为 16; 如果当前音频帧之前连续三帧都为活动帧 (vad— flag=l ), 且历史的信号分类结 果长时平滑结果为音乐信号且当前音频帧 flux大于 20, 则将緩存的当前音频帧 的频谱波动修改为 20。 其中, 活动帧以及历史的信号分类结果长时平滑结果的 计算可以参考前述实施例。
一个实施例中, 步骤 S604包括: 分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
一般的, 音乐帧的频谱波动值较小, 而语音帧的频谱波动值较大; 音乐帧 的频谱高频带峰度值较大, 语音帧的频谱高频带峰度较小; 音乐帧的频谱相关 度的值较大, 语音帧的频谱相关度值较小; 音乐帧的线性预测残差能量倾斜度 值较小, 而语音帧的线性预测残差能量倾斜度值较大。 而因此可以根据上述参 数的统计量对当前音频帧进行分类。 当然还可以采用其他分类方法对该当前音 频帧进行信号分类。 例如, 统计频谱波动存储器中存储的频谱波动的有效数据 的数量; 根据该有效数据的数量, 将存储器由近端到远端划分出至少两个不同 长度的区间, 获得每个区间对应的频谱波动的有效数据的均值、 频语高频带峰 度有效数据的均值、 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差; 其中, 所述区间的起点为当前帧频谱波动的存储位置, 近端为 存储有当前帧频谱波动的一端, 远端为存储有历史帧频谱波动的一端; 根据较 短区间内的上述参数的有效数据的统计量对所述音频帧进行分类, 若此区间内 的参数统计量足够区分出所述音频帧的类型则分类过程结束, 否则在其余较长 区间中最短的区间内继续分类过程, 并以此类推。 在每个区间的分类过程中, 根据每一个区间区间对应的分类阈值, 对所述当前音频帧分类进行分类, 当下 列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧 分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频谱高频带 峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的均值大于 第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈值。
在信号分类后, 可以对不同的信号采用不同的编码模式进行编码。 例如, 语音信号采用基于语音产生模型的编码器(如 CELP )进行编码, 对音乐信号采 用基于变换的编码器(如基于 MDCT的编码器)进行编码。
本实施例中, 根据频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残 差能量倾斜度的长时统计量进行分类, 同时兼顾了分类的鲁棒性和分类的识别 速度, 分类参数较少但结果较为准确, 识别率较高且复杂度较低。
一个实施例中, 在将上述频谱波动 f lux、 频谱高频带峰度 ph、 频语相关度 cor _map_ s讓和线性预测残差能量倾斜度 epsP_ t i l t存储于对应的存储器之后, 可以根据存储的频谱波动的有效数据的数量, 采用不同判断流程进行分类。 如 果声音活动性标识置为 1 , 即当前音频帧为活动声音帧, 则, 检查存储的频谱波 动的有效数据的个数 N。
存储器中存储的频谱波动中有效数据的个数 N的值不同, 判断流程也不同:
( 1 )参考图 7 , 若 N=60, 则分别获得 f lux历史 buffer中全部数据的均值, 记为 f lux60 , 近端 30个数据的均值, 记为 f lux30 , 近端 10个数据的均值, 记 为 f luxl O。 分别获得 h历史 buffer 中全部数据的均值, 记为 ph60, 近端 30 个数据的均值, 记为 ph30, 近端 10 个数据的均值, 记为 phlO。 分别获得 cor_map_sum历史 buffer中全部数据的均值, 记为 cor_map_sum60, 近端 30个 数据的均值,记为 cor_map-sum30,近端 10个数据的均值,记为 cor_map_sumlO。 并分别获得 epsP_tilt历史 buffer中全部数据的方差, 记为 epsP_tilt60, 近 端 30 个数据的方差, 记为 epsP_tilt30, 近端 10 个数据的方差, 记为 epsP_tiltl0o 获得 voicing 历史 buffer 中数值大于 0.9 的数据的个数 voicing_cnt。 其中, 近端为存储有当前音频帧对应的上述参数的一端。
首先检查 f luxlO, phlO, epsP-tiltlO, cor_map-Suml 0, voicing_cnt 是否满 足条件: f luxl0<10或 epsPtiltl0<0.0001或 hl0>1050或 cor_map_suml0>95, 并且 voicing_cnt<6, 若满足, 则将当前音频帧分类为音乐类型 (即 Mode=l )。 否则, 检查 fluxlO是否大于 15且 voicing_cnt是否大于 2, 或者 fluxlO是否 大于 16, 若满足, 则将当前音频帧分类为语音类型 (即 Mode=0)。 否则, 检查 flux 30, fluxlO, ph30, epsP_tilt30, cor_map-sum30, voicing_cnt 是否满足条 件 : flux30<13 且 fluxl0<15, 或 epsPtilt30<0.001 或 ph30>800 或 cor_map-sum30>75,若满足,则将当前音频帧分类为音乐类型。否则,检查 flux60 f lux30, ph60, epsP_tilt60, cor_map_sum60 是否满足条件: flux60<14.5 或 cor_map_sum30>75或 ph60>770或 epsP_tiltl0<0.002,并且 f lux30<14。若满足, 则将当前音频帧分类为音乐类型, 否则分类为语音类型。
(2)参考图 8, 如果 N<60且 N>=30, 则分别获得 flux历史 buffer , ph历 史 buffer和 cor_map-sum历史 buffer中近端 N个数据的均值,记为 fluxN,phN, cor_map-sumN, 并同时得到 epsP_tilt历史 buffer中近端 N个数据的方差, 记 为 epsP_tiltN。 检查 f luxN, phN, epsP.tiltN, cor_map_sumN 是否满足条件: fluxN<13+(N-30)/20 或 cor_map_sumN>75+ (N-30) /6 或 phN>800 或 epsP_tiltN<0.00L 若满足, 则将当前音频帧分类为音乐类型, 否则为语音类 型。
( 3)参考图 9, 如果 N<30且 N>=10, 则分别获得 flux历史 buffer, ph历 史 buffer和 cor—map- sum历史 buffer中近端 N个数据的均值,记为 f luxN, phN 和 cor_map_sumN, 并同时得到 epsP_tilt历史 buffer中近端 N个数据的方差, 记为 epsP_tiltN。 首先检查历史分类结果的长时滑动平均 mode_mOV是否大于 0.8。 若是, 则 检查 fluxN, phN, epsP-tiltN, cor_map_sumN是否满足条件: f luxN<16+ (N-10) /20 或 phN>1000-12.5χ (N-10) 或 epsP_tiltN<0.0005+0.000045x (N-10) 或 cor_map-SumN>90-(N-10) 0 否则, 获得 voicing历史 buff er中数值大于 0.9的 数据的个数 voicing_cnt, 并检查是否满足条件: f luxN<12+ (N-10) /20 或 phN>1050-12.5x (N-10) 或 epsP_tiltN<0.0001+0.000045x (N-10) 或 cor_map-SumN>95- (N-10) , 并且 voicing_cnt<6。如果满足上面两组条件中的任 一组, 则将当前音频帧分类为音乐类型, 否则为语音类型。
( 4 )参考图 10,如果 N<10且 N>5,则分别获得 ph历史 buffer, cor_map_sum 历史 buffer中近端 N个数据的均值,记为 phN和 cor-map-sumN. 以及 epsP—tilt 历史 buffer中近端 N个数据的方差, 记为 epsP_tiltN。 同时获得 voicing历史 buffer中近端 6个数据中数值大于 0.9的数据的个数 voicing_cnt6。
检查是否满足条件: epsP_tiltN<0.00008或 phN>1100或 cor_map-sumN>100 并且 voicing_cnt<4。 若满足, 则将当前音频帧分类为音乐类型, 否则为语音类 型。
(5)如果 N<=5, 则以前一音频帧的分类结果作为当前音频帧的分类类型。 上述实施例为根据频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残 差能量倾斜度的长时统计量进行分类的一种具体分类流程, 本领域技术人员可 以理解的是, 可以使用别的流程进行分类。 本实施例中的分类流程可以应用在 前述实施例中的对应步骤, 例如作为图 2的步骤 103、 图 4的步骤 105或者图 6 中的步骤 604的具体分类方法。
参考图 11, 一种音频信号分类方法的另一个实施例包括:
S1101: 将输入音频信号进行分帧处理;
S1102: 获得当前音频帧的线性预测残差能量倾斜度、 频谱音调个数和频谱 音调个数在低频带上的比率;
线性预测残差能量倾斜度 epsP_tilt 表示输入音频信号的线性预测残差能 量随线性预测阶数的升高而变化的程度; 频谱音调个数 Ntonal表示当前音频帧 中的 0 ~ 8kHz 频带上频点峰值大于预定值的频点个数; 频谱音调个数在低频带 上的比率 rat io_Ntona l - l f 表示频谱音调个数与低频带音调个数的比值。 具体 计算参照前述实施例的描述。
S1103: 分别将线性预测残差能量倾斜度 epsP_ t i l t、 频谱音调个数和频谱 音调个数在低频带上的比率存储到对应的存储器中; 当前音频帧的线性预测残差能量倾斜度 epsP_ t i 11、频谱音调个数各自被緩 存入各自的历史 buffer中, 本实施例中这两个 buffer的长度也均为 60。
可选的, 在存储这些参数之前, 还包括: 根据所述当前音频帧的声音活动 性, 确定是否将所述线性预测残差能量倾斜度、 频谱音调个数和频谱音调个数 在低频带上的比率存储于存储器中; 并在确定需要存储时将将所述线性预测残 差能量倾斜度存储于存储器中。 如果当前音频帧为活动帧, 则存储上述参数; 否则不存储。
S1104 : 分别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调个数 的统计量; 所述统计量指对存储器中存储的数据运算操作后获得的数据值, 运 算操作可以包括求均值, 求方差等操作。
一个实施例中, 分别获得存储的线性预测残差能量倾斜度的统计量、 频谱 音调个数的统计量包括: 获得存储的线性预测残差能量倾斜度的方差; 获得存 储的频谱音调个数的均值。
S1105 : 根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计 量和频谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧; 一个实施例中, 该步骤包括: 当当前音频帧为活动帧, 且满足下列条件之一, 则将所述当前音频帧分类 为音乐帧, 否则将所述当前音频帧分类为语音帧: 线性预测残差能量倾斜度的方差小于第五阈值; 或 频谱音调个数的均值大于第六阈值; 或 频谱音调个数在低频带上的比率小于第七阈值。
一般的, 音乐帧的线性预测残差能量倾斜度值较小, 而语音帧的线性预测 残差能量倾斜度值较大; 音乐帧的频谱音调个数较多, 而语音帧的频谱音调个 数较少; 音乐帧的频谱音调个数在低频带上的比率较低, 而语音帧的频谱音调 个数在低频带上的比率较高 (语音帧的能量主要集中在低频带上)。 而因此可以 根据上述参数的统计量对当前音频帧进行分类。 当然还可以采用其他分类方法 对该当前音频帧进行信号分类。
在信号分类后, 可以对不同的信号采用不同的编码模式进行编码。 例如, 语音信号采用基于语音产生模型的编码器(如 CELP)进行编码, 对音乐信号采 用基于变换的编码器(如基于 MDCT的编码器)进行编码。
上述实施例中, 根据线性预测残差能量倾斜度、 频谱音调个数的长时统计 量和频谱音调个数在低频带上的比率对音频信号进行分类, 参数较少, 识别率 较高且复杂度较低。
一个实施例中, 分别将线性预测残差能量倾斜度 epsP_tilt、频谱音调个数 Ntonal 和频语音调个数在低频带上的比率 ratio_Ntonal_lf 存储到对应的 buffer后,获得 epsP— tilt历史 buffer中所有数据的方差,记为 epsP_tilt60。 获得 Ntonal历史 buffer中所有数据的均值, 记为 Ntonal60。 获得 Ntonal-lf 历史 buffer 中所有数据的均值, 并计算该均值与 Ntonal60 的比, 记为 ratio_Ntonal-lf60o 参考图 12, 根据如下法则进行当前音频帧的分类:
如果声音活动性标识为 1 (即 vad_flag=l ), 即当前音频帧为活动声音帧, 则 , 则检查是否满足条件: epsP_tilt60<0.002 或 Ntonal60>18 或 ratio_Ntonal-lf60<0.42, 若满足, 则将当前音频帧分类为音乐类型 (即 Mode=l ), 否则为语音类型 (即 Mode=0 )。
上述实施例为根据线性预测残差能量倾斜度的统计量、 频谱音调个数的统 计量和频谱音调个数在低频带上的比率进行分类的一种具体分类流程, 本领域 技术人员可以理解的是, 可以使用别的流程进行分类。 本实施例中的分类流程 可以应用在前述实施例中的对应步骤,例如作为图 5的步骤 504或图 11步骤 1105 的具体分类方法。
本发明是一种低复杂度低内存开销的音频编码模式选择方法。 同时兼顾了 分类的鲁棒性和分类的识别速度。 与上述方法实施例相关联, 本发明还提供一种音频信号分类装置, 该装置 可以位于终端设备, 或网络设备中。 该音频信号分类装置可以执行上述方法实 施例的步骤。
参考图 1 3 , 本发明一种音频信号的分类装置的一个实施例, 用于对输入的 音频信号进行分类, 其包括:
存储确认单元 1 301 , 用于根据所述当前音频帧的声音活动性, 确定是否获 得并存储当前音频帧的频谱波动, 其中, 所述频谱波动表示音频信号的频谱的 能量波动;
存储器 1 302 , 用于在存储确认单元输出需要存储的结果时存储所述频谱波 动;
更新单元 1 303 , 用于根据语音帧是否为敲击音乐或历史音频帧的活动性, 更新存储器中存储的频谱波动;
分类单元 1 304 , 用于根据存储器中存储的频谱波动的部分或全部有效数据 的统计量, 将所述当前音频帧分类为语音帧或者音乐帧。 当频谱波动的有效数 据的统计量满足语音分类条件时, 将所述当前音频帧分类为语音帧; 当频谱波 动的有效数据的统计量满足音乐分类条件时, 将所述当前音频帧分类为音乐帧。
一个实施例中, 存储确认单元具体用于: 确认当前音频帧为活动帧时, 输 出需要存储当前音频帧的频谱波动的结果。
另一个实施例中, 存储确认单元具体用于: 确认当前音频帧为活动帧, 且 当前音频帧不属于能量冲击时, 输出需要存储当前音频帧的频谱波动的结果。
另一个实施例中, 存储确认单元具体用于: 确认当前音频帧为活动帧, 且 包含当前音频帧与其历史帧在内的多个连续帧都不属于能量冲击时, 输出需要 存储当前音频帧的频谱波动的结果。
一个实施例中, 更新单元具体用于若当前音频帧属于敲击音乐, 则修改频 谱波动存储器中已存储的频谱波动的值。
另一个实施例中, 更新单元具体用于: 如果当前音频帧为活动帧, 且前一 帧音频帧为非活动帧时, 则将存储器中已存储的除当前音频帧的频谱波动之外 的其他频谱波动的数据修改为无效数据; 或, 如果当前音频帧为活动帧, 且当 前音频帧之前连续三帧不全都为活动帧时, 则将当前音频帧的频谱波动修正为 第一值; 或, 如果当前音频帧为活动帧, 且历史分类结果为音乐信号且当前音 频帧的频语波动大于第二值, 则将当前音频帧的频谱波动修正为第二值, 其中, 第二值大于第一值。
参考图 14, 一个实施例中, 分类单元 1303包括: 计算单元 1401 , 用于获得存储器中存储的频谱波动的部分或全部有效数据 的均值;
判断单元 1402, 用于将所述频谱波动的有效数据的均值与音乐分类条件做 比较, 当所述频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前音 频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
例如, 当所获得的频谱波动的有效数据的均值小于音乐分类阈值时, 将所 述当前音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
上述实施例, 由于根据频谱波动的长时统计量对音频信号进行分类, 参数 较少, 识别率较高且复杂度较低; 同时考虑声音活动性和敲击音乐的因素对频 谱波动进行调整, 对音乐信号识别率更高, 适合混合音频信号分类。
另一个实施例中, 音频信号分类装置还包括:
参数获得单元, 用于获得当前音频帧的频谱高频带峰度、 频谱相关度和线 性预测残差能量倾斜度; 其中, 频谱高频带峰度表示当前音频帧的频谱在高频 带上的峰度或能量锐度; 频谱相关度表示当前音频帧的信号谐波结构在相邻帧 间的稳定度; 线性预测残差能量倾斜度表示音频信号的线性预测残差能量随线 性预测阶数的升高而变化的程度; 该存储确认单元还用于, 根据所述当前音频帧的声音活动性, 确定是否存 储所述频谱高频带峰度、 频语相关度和线性预测残差能量倾斜度;
该存储单元还用于, 当存储确认单元输出需要存储的结果时存储所述频谱 高频带峰度、 频语相关度和线性预测残差能量倾斜度;
该分类单元具体用于, 分别获得存储的频谱波动、 频谱高频带峰度、 频谱 相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据的 统计量将所述音频帧分类为语音帧或者音乐帧。 当频谱波动的有效数据的统计 量满足语音分类条件时, 将所述当前音频帧分类为语音帧; 当频谱波动的有效 数据的统计量满足音乐分类条件时, 将所述当前音频帧分类为音乐帧 .
一个实施例中, 该分类单元具体包括:
计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
上述实施例中, 根据频谱波动、 频谱高频带峰度、 频谱相关度和线性预测 残差能量倾斜度的长时统计量对音频信号进行分类, 参数较少, 识别率较高且 复杂度较低; 同时考虑声音活动性和敲击音乐的因素对频谱波动进行调整, 根 据当前音频帧所处信号环境, 对频谱波动进行修正, 提高分类识别率, 适合混 合音频信号分类。
参考图 15 , 本发明一种音频信号的分类装置的另一个实施例, 用于对输入 的音频信号进行分类, 其包括:
分帧单元 1501 , 用于对输入音频信号进行分帧处理;
参数获得单元 1502, 用于获得当前音频帧的线性预测残差能量倾斜度; 其 中, 线性预测残差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶 数的升高而变化的程度;
存储单元 1503 , 用于存储线性预测残差能量倾斜度;
分类单元 1504,用于根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述音频帧进行分类。 参考图 16, 音频信号的分类装置还包括:
存储确认单元 1505 , 用于根据所述当前音频帧的声音活动性, 确定是否将 所述线性预测残差能量倾斜度存储于存储器中;
则该存储单元 1503具体用于, 当存储确认单元确认需要确定需要存储时将 将所述线性预测残差能量倾斜度存储于存储器中。
一个实施例中, 预测残差能量倾斜度部分数据的统计量为预测残差能量倾 斜度部分数据的方差;
所述分类单元具体用于将预测残差能量倾斜度部分数据的方差与音乐分类 阈值相比较, 当所述预测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
另一个实施例中, 参数获得单元还用于: 获得当前音频帧的频谱波动、 频 谱高频带峰度和频谱相关度, 并存储于对应的存储器中;
则该分类单元具体用于: 分别获得存储的频谱波动、 频谱高频带峰度、 频 谱相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据 的统计量将所述音频帧分类为语音帧或者音乐帧; 所述有效数据的统计量指对 存储器中存储的有效数据运算操作后获得的数据值。
参考图 17, 具体的, 一个实施例中, 分类单元 1504包括:
计算单元 1701 , 用于分别获得存储的频谱波动有效数据的均值, 频谱高频 带峰度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜 度有效数据的方差;
判断单元 1702, 用于当下列条件之一满足时, 将所述当前音频帧分类为音 乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小 于第一阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频 谱相关度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数 据的方差小于第四阈值。
另一个实施例中, 参数获得单元还用于: 获得当前音频帧的频谱音调个数 和频谱音调个数在低频带上的比率, 并存储于存储器;
则该分类单元具体用于: 分别获得存储的线性预测残差能量倾斜度的统计 量、 频谱音调个数的统计量; 根据所述线性预测残差能量倾斜度的统计量、 频 谱音调个数的统计量和频谱音调个数在低频带上的比率, 将所述音频帧分类为 语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的数据运算操作 后获得的数据值。
具体的该分类单元包括:
计算单元, 用于获得线性预测残差能量倾斜度有效数据的方差和存储的频 谱音调个数的均值;
判断单元, 用于当当前音频帧为活动帧, 且满足下列条件之一, 则将所述 当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 线性预测残 差能量倾斜度的方差小于第五阈值; 或频谱音调个数的均值大于第六阈值; 或 频谱音调个数在低频带上的比率小于第七阈值。
具体的, 参数获得单元根据下列公式计算当前音频帧的线性预测残差能量 倾斜度:
^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i)
!·=ι 其中, ep^ ( )表示当前音频帧第 i阶线性预测的预测残差能量; n为正整数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。
具体的, 该参数获得单元用于统计当前音频帧在 0 ~ 8kHz频带上频点峰值 大于预定值的频点数量作为频谱音调个数; 所述参数获得单元用于计算当前音 频帧在 0 ~ 4kHz频带上频点峰值大于预定值的频点数量与 0 ~ 8kHz频带上频点 峰值大于预定值的频点数量的比值, 作为频谱音调个数在低频带上的比率。
本实施例中, 根据线性预测残差能量倾斜度的长时统计量对音频信号进行 分类, 同时兼顾了分类的鲁棒性和分类的识别速度, 分类参数较少但结果较为 准确, 复杂度低、 内存开销低。
本发明一种音频信号的分类装置的另一个实施例, 用于对输入的音频信号 进行分类, 其包括:
分帧单元, 用于将输入音频信号进行分帧处理;
参数获得单元, 用于获得当前音频帧的频谱波动、 频谱高频带峰度、 频谱 相关度和线性预测残差能量倾斜度; 其中, 频谱波动表示音频信号的频谱的能 量波动, 频谱高频带峰度表示当前音频帧的频谱在高频带上的峰度或能量锐度; 频谱相关度表示当前音频帧的信号谐波结构在相邻帧间的稳定度; 线性预测残 差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶数的升高而变化 的程度;
存储单元, 用于存储频谱波动、 频谱高频带峰度、 频谱相关度和线性预测 残差能量倾斜度;
分类单元, 用于分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度 和线性预测残差能量倾斜度中有效数据的统计量, 根据有效数据的统计量将所 述音频帧分类为语音帧或者音乐帧; 其中, 所述有效数据的统计量指对存储器 中存储的有效数据运算操作后获得的数据值, 运算操作可以包括求均值, 求方 差等操作。
一个实施例中, 音频信号的分类装置还可以包括:
存储确认单元, 用于根据所述当前音频帧的声音活动性, 确定是否存储当 前音频帧的频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜 度;
存储单元, 具体用于当存储确认单元输出需要存储的结果时, 存储频谱波 动、 频语高频带峰度、 频谱相关度和线性预测残差能量倾斜度。
具体的, 一个实施例中, 存储确认单元根据所述当前音频帧的声音活动性, 确定是否将所述频谱波动存储频谱波动存储器中。 如果当前音频帧为活动帧, 则存储确认单元输出存储上述参数的结果; 否则输出不需要存储的结果。 另一 个实施例中, 存储确认单元根据音频帧的声音活动性和音频帧是否为能量冲击, 确定是否将所述频谱波动存储于存储器中。 若当前音频帧为活动帧, 且当前音 频帧不属于能量冲击, 则将当前音频帧的频谱波动存储于频谱波动存储器中; 另一个实施例中, 若当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内 的多个连续帧都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储 器中; 否则不存储。 例如, 当前音频帧为活动帧, 且当前音频帧其前一帧以及 历史第二帧都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器 中; 否则不存储。
一个实施例中, 分类单元包括:
计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
当前音频帧的频谱波动、 频语高频带峰度、 频谱相关度和线性预测残差能 量倾斜度的具体计算方式, 可以参照上述方法实施例。
进一步的, 该音频信号的分类装置还可以包括:
更新单元, 用于根据语音帧是否为敲击音乐或历史音频帧的活动性, 更新 存储器中存储的频谱波动。 一个实施例中, 更新单元具体用于若当前音频帧属 于敲击音乐, 则修改频谱波动存储器中已存储的频谱波动的值。 另一个实施例 中, 更新单元具体用于: 如果当前音频帧为活动帧, 且前一帧音频帧为非活动 帧时, 则将存储器中已存储的除当前音频帧的频谱波动之外的其他频谱波动的 数据修改为无效数据; 或, 如果当前音频帧为活动帧, 且当前音频帧之前连续 三帧不全都为活动帧时, 则将当前音频帧的频谱波动修正为第一值; 或, 如果 当前音频帧为活动帧, 且历史分类结果为音乐信号且当前音频帧的频谱波动大 于第二值, 则将当前音频帧的频谱波动修正为第二值, 其中, 第二值大于第一 值。
本实施例中, 根据频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残 差能量倾斜度的长时统计量进行分类, 同时兼顾了分类的鲁棒性和分类的识别 速度, 分类参数较少但结果较为准确, 识别率较高且复杂度较低。
本发明一种音频信号的分类装置的另一个实施例, 用于对输入的音频信号 进行分类, 其包括:
分帧单元, 用于对输入音频信号进行分帧处理;
参数获得单元, 用于获得获得当前音频帧的线性预测残差能量倾斜度、 频 谱音调个数和频谱音调个数在低频带上的比率; 其中, 线性预测残差能量倾斜 度 epsP_ t i l t 表示输入音频信号的线性预测残差能量随线性预测阶数的升高而 变化的程度; 频谱音调个数 Ntona l表示当前音频帧中的 0 ~ 8kHz频带上频点峰 值大于预定值的频点个数; 频谱音调个数在低频带上的比率 rat io_Ntona l - lf 表示频谱音调个数与低频带音调个数的比值。 具体计算参照前述实施例的描述。
存储单元, 用于存储线性预测残差能量倾斜度、 频谱音调个数和频谱音调 个数在低频带上的比率; 分类单元, 用于分别获得存储的线性预测残差能量倾斜度的统计量、 频谱 音调个数的统计量; 根据所述线性预测残差能量倾斜度的统计量、 频谱音调个 数的统计量和频谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或 者音乐帧; 所述有效数据的统计量指对存储器中存储的数据运算操作后获得的 数据值。
具体的, 该分类单元包括:
计算单元, 用于获得线性预测残差能量倾斜度有效数据的方差和存储的频 谱音调个数的均值;
判断单元, 用于当当前音频帧为活动帧, 且满足下列条件之一, 则将所述 当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 线性预测残 差能量倾斜度的方差小于第五阈值; 或频谱音调个数的均值大于第六阈值; 或 频谱音调个数在低频带上的比率小于第七阈值。
具体的, 参数获得单元根据下列公式计算当前音频帧的线性预测残差能量 倾斜度: ^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i)
!•=1 其中, ep^ ( )表示当前音频帧第 i阶线性预测的预测残差能量; n为正整数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。
具体的, 该参数获得单元用于统计当前音频帧在 0 ~ 8kHz频带上频点峰值 大于预定值的频点数量作为频谱音调个数; 所述参数获得单元用于计算当前音 频帧在 0 ~ 4kHz频带上频点峰值大于预定值的频点数量与 0 ~ 8kHz频带上频点 峰值大于预定值的频点数量的比值, 作为频谱音调个数在低频带上的比率。
上述实施例中, 根据线性预测残差能量倾斜度、 频谱音调个数的长时统计 量和频谱音调个数在低频带上的比率对音频信号进行分类, 参数较少, 识别率 较高且复杂度较低。
上述音频信号的分类装置可以与不同的编码器相连接, 对不同的信号采用 不同的编码器进行编码。 例如, 音频信号的分类装置分别与两个编码器连接, 对语音信号采用基于语音产生模型的编码器(如 CELP )进行编码, 对音乐信号 采用基于变换的编码器(如基于 MDCT的编码器)进行编码。 上述装置实施例 中的各个具体参数的定义和获得方法可以参照方法实施例的相关描述。
与上述方法实施例相关联, 本发明还提供一种音频信号分类装置, 该装置 可以位于终端设备, 或网络设备中。 该音频信号分类装置可以由硬件电路来实 现, 或者由软件配合硬件来实现。 例如, 参考图 18 , 由一个处理器调用音频信 号分类装置来实现对音频信号的分类。 该音频信号分类装置可以执行上述方法 实施例中的各种方法和流程。 该音频信号分类装置的具体模块和功能可以参照 上述装置实施例的相关描述。 图 19的设备 1900的一个例子是编码器。 设备 100包括处理器 1910和存储 器 1920。
存储器 1920可以包括随机存储器、 闪存、只读存储器、可编程只读存储器、 非易失性存储器或寄存器等。 处理器 1920 可以是中央处理器 ( Centra l Proces s ing Uni t , CPU )。
存储器 1910用于存储可执行指令。 处理器 1920可以执行存储器 1910中存 储的可执行指令, 用于:
设备 1900的其它功能和操作可参照上面图 3至图 12的方法实施例的过程, 为了避免重复, 此处不再贅述。 本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程, 是可以通过计算机程序来指令相关的硬件来完成, 所述的程序可存储于一计算 机可读取存储介质中, 该程序在执行时, 可包括如上述各方法的实施例的流程。 其中, 所述的存储介质可为磁碟、 光盘、 只读存储记忆体(Read-Only Memory, ROM )或随机存储记忆体 ( Random Acces s Memory, RAM )等。 在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统、 装置和方 法, 可以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是示意性 的, 例如, 所述单元的划分, 仅仅为一种逻辑功能划分, 实际实现时可以有另 外的划分方式, 例如多个单元或组件可以结合或者可以集成到另一个系统, 或 一些特征可以忽略, 或不执行。 另一点, 所显示或讨论的相互之间的耦合或直 接耦合或通信连接可以是通过一些接口, 装置或单元的间接耦合或通信连接, 可以是电性, 机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的, 作为 单元显示的部件可以是或者也可以不是物理单元, 即可以位于一个地方, 或者 也可以分布到多个网络单元上。 可以根据实际的需要选择其中的部分或者全部 单元来实现本实施例方案的目的。
另外, 在本发明各个实施例中的各功能单元可以集成在一个处理单元中, 也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一个单元 中。
以上所述仅为本发明的几个实施例, 本领域的技术人员依据申请文件公开 的可以对本发明进行各种改动或变型而不脱离本发明的精神和范围。

Claims

权 利 要 求
1、 一种音频信号分类方法, 其特征在于, 包括:
根据当前音频帧的声音活动性, 确定是否获得当前音频帧的频谱波动并存 储于频谱波动存储器中, 其中, 所述频谱波动表示音频信号的频谱的能量波动; 根据音频帧是否为敲击音乐或历史音频帧的活动性, 更新频谱波动存储器 中存储的频谱波动;
根据频谱波动存储器中存储的频谱波动的部分或全部有效数据的统计量, 将所述当前音频帧分类为语音帧或者音乐帧。
2、 根据权利要求 1所述的方法, 其特征在于, 根据当前音频帧的声音活动 性, 确定是否获得当前音频帧的频谱波动并存储于频谱波动存储器中包括: 若当前音频帧为活动帧, 则将当前音频帧的频谱波动存储于频谱波动存储 器中。
3、 根据权利要求 1所述的方法, 其特征在于, 根据当前音频帧的声音活动 性, 确定是否获得当前音频帧的频谱波动并存储于频谱波动存储器中包括: 若当前音频帧为活动帧, 且当前音频帧不属于能量冲击, 则将当前音频帧 的频谱波动存储于频谱波动存储器中。
4、 根据权利要求 1所述的方法, 其特征在于, 根据当前音频帧的声音活动 性, 确定是否获得当前音频帧的频谱波动并存储于频谱波动存储器中包括: 若当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内的多个连续帧 都不属于能量冲击, 则将音频帧的频谱波动存储于频谱波动存储器中。
5、 根据权利要求 1至 4所述的任一方法, 其特征在于, 根据所述当前音频 帧是否为敲击音乐, 更新频谱波动存储器中存储的频谱波动包括:
若当前音频帧属于敲击音乐, 则修改频谱波动存储器中已存储的频谱波动 的值。
6、 根据权利要求 1至 4所述的任一方法, 其特征在于, 根据所述历史音频 帧的活动性, 更新频谱波动存储器中存储的频谱波动包括:
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且前一帧音频 帧为非活动帧, 则将频谱波动存储器中已存储的除当前音频帧的频谱波动之外 的其他频谱波动的数据修改为无效数据;
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且当前音频帧 之前连续三帧历史帧不全都为活动帧, 则将当前音频帧的频谱波动修正为第一 值;
如果确定当前音频帧的频谱波动存储于频谱波动存储器中, 且历史分类结 果为音乐信号且当前音频帧的频谱波动大于第二值, 则将当前音频帧的频谱波 动修正为第二值, 其中, 第二值大于第一值。
7、 根据权利要求 1-6所述的任一方法, 其特征在于, 根据频谱波动存储器 中存储的频谱波动的部分或全部有效数据的统计量, 将所述当前音频帧分类为 语音帧或者音乐帧包括:
获得频谱波动存储器中存储的频谱波动的部分或全部有效数据的均值; 当所获得的频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前 音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
8、 根据权利要求 1-6所述的方法, 其特征在于, 还包括:
获得当前音频帧的频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜 度; 其中, 频谱高频带峰度表示当前音频帧的频谱在高频带上的峰度或能量锐 度; 频谱相关度表示当前音频帧的信号谐波结构在相邻帧间的稳定度; 线性预 测残差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶数的升高而 变化的程度;
根据所述当前音频帧的声音活动性, 确定是否将所述频谱高频带峰度、 频 语相关度和线性预测残差能量倾斜度存储于存储器中;
其中, 所述根据频谱波动存储器中存储的频谱波动的部分或全部数据的统 计量, 对所述音频帧进行分类包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
9、 一种音频信号的分类装置, 用于对输入的音频信号进行分类, 其特征在 于, 包括:
存储确认单元, 用于根据所述当前音频帧的声音活动性, 确定是否获得并 存储当前音频帧的频谱波动, 其中, 所述频谱波动表示音频信号的频谱的能量 波动;
存储器, 用于在存储确认单元输出需要存储的结果时存储所述频谱波动; 更新单元, 用于根据语音帧是否为敲击音乐或历史音频帧的活动性, 更新 存储器中存储的频谱波动;
分类单元, 用于根据存储器中存储的频谱波动的部分或全部有效数据的统 计量, 将所述当前音频帧分类为语音帧或者音乐帧。
10、 根据权利要求 9所述的装置, 其特征在于, 所述存储确认单元具体用 于: 确认当前音频帧为活动帧时, 输出需要存储当前音频帧的频谱波动的结果。
11、 根据权利要求 9 所述的装置, 其特征在于, 所述存储确认单元具体用 于: 确认当前音频帧为活动帧, 且当前音频帧不属于能量冲击时, 输出需要存 储当前音频帧的频谱波动的结果。
12、 根据权利要求 9所述的装置, 其特征在于, 所述存储确认单元具体用 于: 确认当前音频帧为活动帧, 且包含当前音频帧与其历史帧在内的多个连续 帧都不属于能量冲击时, 输出需要存储当前音频帧的频谱波动的结果。
13、 根据权利要求 9-12所述的任一装置, 其特征在于, 所述更新单元具体 用于若当前音频帧属于敲击音乐, 则修改频谱波动存储器中已存储的频谱波动 的值。
14、 根据权利要求 9-12所述的任一装置, 其特征在于, 所述更新单元具体 用于: 如果当前音频帧为活动帧, 且前一帧音频帧为非活动帧时, 则将存储器 中已存储的除当前音频帧的频谱波动之外的其他频谱波动的数据修改为无效数 据; 或 如果当前音频帧为活动帧, 且当前音频帧之前连续三帧不全都为活动帧时, 则将当前音频帧的频谱波动修正为第一值; 或
如果当前音频帧为活动帧, 且历史分类结果为音乐信号且当前音频帧的频 谱波动大于第二值, 则将当前音频帧的频谱波动修正为第二值, 其中, 第二值 大于第一值。
15、根据权利要求 9-14所述的任一装置, 其特征在于, 所述分类单元包括: 计算单元, 用于获得存储器中存储的频谱波动的部分或全部有效数据的均 值;
判断单元, 用于将所述频谱波动的有效数据的均值与音乐分类条件做比较, 当所述频谱波动的有效数据的均值满足音乐分类条件时, 将所述当前音频帧分 类为音乐帧; 否则将所述当前音频帧分类为语音帧。
16、 根据权利要求 9-14所述的任一装置, 其特征在于, 还包括:
参数获得单元, 用于获得当前音频帧的频谱高频带峰度、 频谱相关度、 浊 音度参数和线性预测残差能量倾斜度; 其中, 频谱高频带峰度表示当前音频帧 的频谱在高频带上的峰度或能量锐度; 频谱相关度表示当前音频帧的信号谐波 结构在相邻帧间的稳定度; 浊音度参数表示当前音频帧与一个基音周期之前的 信号的时域相关度; 线性预测残差能量倾斜度表示音频信号的线性预测残差能 量随线性预测阶数的升高而变化的程度;
所述存储确认单元还用于, 根据所述当前音频帧的声音活动性, 确定是否 将所述频谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度存储于存储器 中;
所述存储单元还用于, 当存储确认单元输出需要存储的结果时存储所述频 谱高频带峰度、 频谱相关度和线性预测残差能量倾斜度;
所述分类单元具体用于, 分别获得存储的频谱波动、 频谱高频带峰度、 频 谱相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据 的统计量将所述音频帧分类为语音帧或者音乐帧。
17、 根据权利要求 16所述的任一装置, 其特征在于, 所述分类单元包括: 计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
18、 一种音频信号分类方法, 其特征在于, 包括:
将输入音频信号进行分帧处理; 获得当前音频帧的线性预测残差能量倾斜度; 所述线性预测残差能量倾斜 度表示音频信号的线性预测残差能量随线性预测阶数的升高而变化的程度; 将线性预测残差能量倾斜度存储到存储器中;
根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述音频帧进行 分类。
19、 根据权利要求 18所述的方法, 其特征在于, 将线性预测残差能量倾斜 度存储到存储器中之前还包括:
根据所述当前音频帧的声音活动性, 确定是否将所述线性预测残差能量倾 斜度存储于存储器中; 并在确定需要存储时将将所述线性预测残差能量倾斜度 存储于存储器中。
20、 根据权利要求 18或 19所述的方法, 其特征在于, 预测残差能量倾斜 度部分数据的统计量为预测残差能量倾斜度部分数据的方差; 所述根据存储器 中预测残差能量倾斜度部分数据的统计量, 对所述音频帧进行分类包括:
将预测残差能量倾斜度部分数据的方差与音乐分类阈值相比较, 当所述预 测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分 类为音乐帧; 否则将所述当前音频帧分类为语音帧。
21、 根据权利要求 18或 19所述的方法, 其特征在于, 还包括:
获得当前音频帧的频谱波动、 频谱高频带峰度和频谱相关度, 并存储于对 应的存储器中;
其中, 所述根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述 音频帧进行分类包括:
分别获得存储的频谱波动、 频谱高频带峰度、 频谱相关度和线性预测残差 能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音频帧分 类为语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的有效数据 运算操作后获得的数据值。
22、根据权利要求 21所述的方法, 其特征在于, 分别获得存储的频谱波动、 频谱高频带峰度、 频语相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据的统计量将所述音频帧分类为语音帧或者音乐帧包括:
分别获得存储的频谱波动有效数据的均值, 频谱高频带峰度有效数据的均 值, 频语相关度有效数据的均值和线性预测残差能量倾斜度有效数据的方差; 当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当 前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一阈值; 或者频 谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关度有效数据的 均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方差小于第四阈 值。
23、 根据权利要求 18或 19所述的方法, 其特征在于, 还包括:
获得当前音频帧的频谱音调个数和频谱音调个数在低频带上的比率, 并存 储于对应的存储器;
其中, 所述根据存储器中预测残差能量倾斜度部分数据的统计量, 对所述 音频帧进行分类包括:
分别获得存储的线性预测残差能量倾斜度的统计量、 频谱音调个数的统计 量;
根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量和频 谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧; 所述 统计量指对存储器中存储的数据运算操作后获得的数据值。
24、 根据权利要求 23所述的方法, 其特征在于, 分别获得存储的线性预测 残差能量倾斜度的统计量、 频谱音调个数的统计量包括:
获得存储的线性预测残差能量倾斜度的方差;
获得存储的频谱音调个数的均值;
根据所述线性预测残差能量倾斜度的统计量、 频谱音调个数的统计量和频 谱音调个数在低频带上的比率, 将所述音频帧分类为语音帧或者音乐帧包括: 当当前音频帧为活动帧, 且满足下列条件之一, 则将所述当前音频帧分类 为音乐帧, 否则将所述当前音频帧分类为语音帧:
线性预测残差能量倾斜度的方差小于第五阈值; 或
频谱音调个数的均值大于第六阈值; 或
频谱音调个数在低频带上的比率小于第七阈值。
25、根据权利要求 18-24所述的任一方法, 其特征在于, 获得当前音频帧的 线性预测残差能量倾斜度包括:
根据下列公式计算当前音频帧的线性预测残差能量倾斜度:
^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i)
!·=ι 其中, ep^ )表示当前音频帧第 i阶线性预测的预测残差能量; n为正整 数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。
26、根据权利要求 23-24所述的任一方法, 其特征在于, 获得当前音频帧的 频谱音调个数和频谱音调个数在低频带上的比率包括:
统计当前音频帧在 0 ~ 8kHz频带上频点峰值大于预定值的频点数量作为频 谱音调个数;
计算当前音频帧在 0 ~ 4kHz频带上频点峰值大于预定值的频点数量与 0 ~ 8kHz频带上频点峰值大于预定值的频点数量的比值, 作为频谱音调个数在低频 带上的比率。
27、 一种信号分类装置, 用于对输入的音频信号进行分类, 其特征在于, 包括:
分帧单元, 用于对输入音频信号进行分帧处理;
参数获得单元, 用于获得当前音频帧的线性预测残差能量倾斜度; 所述线 性预测残差能量倾斜度表示音频信号的线性预测残差能量随线性预测阶数的升 高而变化的程度;
存储单元, 用于存储线性预测残差能量倾斜度;
分类单元, 用于根据存储器中预测残差能量倾斜度部分数据的统计量, 对 所述音频帧进行分类。
28、 根据权利要求 27所述的装置, 其特征在于, 还包括:
存储确认单元, 用于根据所述当前音频帧的声音活动性, 确定是否将所述 线性预测残差能量倾斜度存储于存储器中;
所述存储单元具体用于, 当存储确认单元确认需要确定需要存储时将将所 述线性预测残差能量倾斜度存储于存储器中。
29、 根据权利要求 27或 28所述的装置, 其特征在于,
预测残差能量倾斜度部分数据的统计量为预测残差能量倾斜度部分数据的 方差;
所述分类单元具体用于将预测残差能量倾斜度部分数据的方差与音乐分类 阈值相比较, 当所述预测残差能量倾斜度部分数据的方差小于音乐分类阈值时, 将所述当前音频帧分类为音乐帧; 否则将所述当前音频帧分类为语音帧。
30、 根据权利要求 27或 28所述的装置, 其特征在于, 参数获得单元还用 于: 获得当前音频帧的频谱波动、 频谱高频带峰度和频谱相关度, 并存储于对 应的存储器中;
所述分类单元具体用于: 分别获得存储的频谱波动、 频谱高频带峰度、 频 谱相关度和线性预测残差能量倾斜度中有效数据的统计量, 根据所述有效数据 的统计量将所述音频帧分类为语音帧或者音乐帧; 所述有效数据的统计量指对 存储器中存储的有效数据运算操作后获得的数据值。
31、 根据权利要求 30所述的装置, 其特征在于, 所述分类单元包括: 计算单元, 用于分别获得存储的频谱波动有效数据的均值, 频谱高频带峰 度有效数据的均值, 频谱相关度有效数据的均值和线性预测残差能量倾斜度有 效数据的方差;
判断单元, 用于当下列条件之一满足时, 将所述当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 所述频谱波动有效数据的均值小于第一 阈值; 或者频谱高频带峰度有效数据的均值大于第二阈值; 或者所述频谱相关 度有效数据的均值大于第三阈值; 或者线性预测残差能量倾斜度有效数据的方 差小于第四阈值。
32、 根据权利要求 27或 28所述的装置, 其特征在于, 所述参数获得单元 还用于: 获得当前音频帧的频谱音调个数和频谱音调个数在低频带上的比率, 并存储于存储器;
所述分类单元具体用于: 分别获得存储的线性预测残差能量倾斜度的统计 量、 频谱音调个数的统计量; 根据所述线性预测残差能量倾斜度的统计量、 频 谱音调个数的统计量和频谱音调个数在低频带上的比率, 将所述音频帧分类为 语音帧或者音乐帧; 所述有效数据的统计量指对存储器中存储的数据运算操作 后获得的数据值。
33、 根据权利要求 32所述的装置, 其特征在于, 所述分类单元包括: 计算单元, 用于获得线性预测残差能量倾斜度有效数据的方差和存储的频 谱音调个数的均值;
判断单元, 用于当当前音频帧为活动帧, 且满足下列条件之一, 则将所述 当前音频帧分类为音乐帧, 否则将所述当前音频帧分类为语音帧: 线性预测残 差能量倾斜度的方差小于第五阈值; 或频谱音调个数的均值大于第六阈值; 或 频谱音调个数在低频带上的比率小于第七阈值。
34、 根据权利要求 27-33所述的任一装置, 其特征在于, 所述参数获得单元 根据下列公式计算当前音频帧的线性预测残差能量倾斜度:
^ epsP(i) · epsP(i + 1)
epsP _ tilt = n
^ epsP(i) · epsP(i) 其中, ep^ )表示当前音频帧第 i阶线性预测的预测残差能量; n为正整 数, 表示线性预测的阶数, 其小于等于线性预测的最大阶数。
35、根据权利要求 32-33所述的任一装置, 其特征在于, 所述参数获得单元 用于统计当前音频帧在 0~8kHz频带上频点峰值大于预定值的频点数量作为频 谱音调个数; 所述参数获得单元用于计算当前音频帧在 0~ 4kHz频带上频点峰 值大于预定值的频点数量与 0~8kHz频带上频点峰值大于预定值的频点数量的 比值, 作为频谱音调个数在低频带上的比率。
PCT/CN2013/084252 2013-08-06 2013-09-26 一种音频信号分类方法和装置 WO2015018121A1 (zh)

Priority Applications (22)

Application Number Priority Date Filing Date Title
JP2016532192A JP6162900B2 (ja) 2013-08-06 2013-09-26 オーディオ信号分類方法及び装置
EP21213287.2A EP4057284A3 (en) 2013-08-06 2013-09-26 Audio signal classification method and apparatus
KR1020167006075A KR101805577B1 (ko) 2013-08-06 2013-09-26 오디오 신호 분류 방법 및 장치
EP17160982.9A EP3324409B1 (en) 2013-08-06 2013-09-26 Audio signal classification method and apparatus
KR1020207002653A KR102296680B1 (ko) 2013-08-06 2013-09-26 오디오 신호 분류 방법 및 장치
SG11201600880SA SG11201600880SA (en) 2013-08-06 2013-09-26 Audio signal classification method and apparatus
BR112016002409-5A BR112016002409B1 (pt) 2013-08-06 2013-09-26 Método e dispositivo de classificação de sinal de áudio
EP13891232.4A EP3029673B1 (en) 2013-08-06 2013-09-26 Audio signal classification method and device
KR1020177034564A KR101946513B1 (ko) 2013-08-06 2013-09-26 오디오 신호 분류 방법 및 장치
KR1020197003316A KR102072780B1 (ko) 2013-08-06 2013-09-26 오디오 신호 분류 방법 및 장치
MX2016001656A MX353300B (es) 2013-08-06 2013-09-26 Método y aparato de clasificación de señal de audio.
EP19189062.3A EP3667665B1 (en) 2013-08-06 2013-09-26 Audio signal classification methods and apparatuses
ES13891232.4T ES2629172T3 (es) 2013-08-06 2013-09-26 Procedimiento y dispositivo de clasificación de señales de audio
AU2013397685A AU2013397685B2 (en) 2013-08-06 2013-09-26 Audio signal classification method and apparatus
US15/017,075 US10090003B2 (en) 2013-08-06 2016-02-05 Method and apparatus for classifying an audio signal based on frequency spectrum fluctuation
HK16107115.7A HK1219169A1 (zh) 2013-08-06 2016-06-21 種音頻信號分類方法和裝置
AU2017228659A AU2017228659B2 (en) 2013-08-06 2017-09-14 Audio signal classification method and apparatus
AU2018214113A AU2018214113B2 (en) 2013-08-06 2018-08-09 Audio signal classification method and apparatus
US16/108,668 US10529361B2 (en) 2013-08-06 2018-08-22 Audio signal classification method and apparatus
US16/723,584 US11289113B2 (en) 2013-08-06 2019-12-20 Linear prediction residual energy tilt-based audio signal classification method and apparatus
US17/692,640 US11756576B2 (en) 2013-08-06 2022-03-11 Classification of audio signal as speech or music based on energy fluctuation of frequency spectrum
US18/360,675 US20240029757A1 (en) 2013-08-06 2023-07-27 Linear Prediction Residual Energy Tilt-Based Audio Signal Classification Method and Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310339218.5A CN104347067B (zh) 2013-08-06 2013-08-06 一种音频信号分类方法和装置
CN201310339218.5 2013-08-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/017,075 Continuation US10090003B2 (en) 2013-08-06 2016-02-05 Method and apparatus for classifying an audio signal based on frequency spectrum fluctuation

Publications (1)

Publication Number Publication Date
WO2015018121A1 true WO2015018121A1 (zh) 2015-02-12

Family

ID=52460591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/084252 WO2015018121A1 (zh) 2013-08-06 2013-09-26 一种音频信号分类方法和装置

Country Status (15)

Country Link
US (5) US10090003B2 (zh)
EP (4) EP3324409B1 (zh)
JP (3) JP6162900B2 (zh)
KR (4) KR102296680B1 (zh)
CN (3) CN106409313B (zh)
AU (3) AU2013397685B2 (zh)
BR (1) BR112016002409B1 (zh)
ES (3) ES2629172T3 (zh)
HK (1) HK1219169A1 (zh)
HU (1) HUE035388T2 (zh)
MX (1) MX353300B (zh)
MY (1) MY173561A (zh)
PT (3) PT3029673T (zh)
SG (2) SG10201700588UA (zh)
WO (1) WO2015018121A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509601A (zh) * 2020-11-18 2021-03-16 中电海康集团有限公司 一种音符起始点检测方法及系统
CN113192488A (zh) * 2021-04-06 2021-07-30 青岛信芯微电子科技股份有限公司 一种语音处理方法及装置

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106409313B (zh) 2013-08-06 2021-04-20 华为技术有限公司 一种音频信号分类方法和装置
US9934793B2 (en) * 2014-01-24 2018-04-03 Foundation Of Soongsil University-Industry Cooperation Method for determining alcohol consumption, and recording medium and terminal for carrying out same
WO2015111772A1 (ko) * 2014-01-24 2015-07-30 숭실대학교산학협력단 음주 판별 방법, 이를 수행하기 위한 기록매체 및 단말기
KR101621766B1 (ko) 2014-01-28 2016-06-01 숭실대학교산학협력단 음주 판별 방법, 이를 수행하기 위한 기록매체 및 단말기
KR101569343B1 (ko) 2014-03-28 2015-11-30 숭실대학교산학협력단 차신호 고주파 신호의 비교법에 의한 음주 판별 방법, 이를 수행하기 위한 기록 매체 및 장치
KR101621780B1 (ko) 2014-03-28 2016-05-17 숭실대학교산학협력단 차신호 주파수 프레임 비교법에 의한 음주 판별 방법, 이를 수행하기 위한 기록 매체 및 장치
KR101621797B1 (ko) 2014-03-28 2016-05-17 숭실대학교산학협력단 시간 영역에서의 차신호 에너지법에 의한 음주 판별 방법, 이를 수행하기 위한 기록 매체 및 장치
ES2758517T3 (es) * 2014-07-29 2020-05-05 Ericsson Telefon Ab L M Estimación del ruido de fondo en las señales de audio
TWI576834B (zh) * 2015-03-02 2017-04-01 聯詠科技股份有限公司 聲頻訊號的雜訊偵測方法與裝置
US10049684B2 (en) * 2015-04-05 2018-08-14 Qualcomm Incorporated Audio bandwidth selection
TWI569263B (zh) * 2015-04-30 2017-02-01 智原科技股份有限公司 聲頻訊號的訊號擷取方法與裝置
WO2016188329A1 (zh) * 2015-05-25 2016-12-01 广州酷狗计算机科技有限公司 一种音频处理方法、装置及终端
US9965685B2 (en) * 2015-06-12 2018-05-08 Google Llc Method and system for detecting an audio event for smart home devices
JP6501259B2 (ja) * 2015-08-04 2019-04-17 本田技研工業株式会社 音声処理装置及び音声処理方法
CN106571150B (zh) * 2015-10-12 2021-04-16 阿里巴巴集团控股有限公司 一种识别音乐中的人声的方法和系统
US10902043B2 (en) 2016-01-03 2021-01-26 Gracenote, Inc. Responding to remote media classification queries using classifier models and context parameters
US9852745B1 (en) 2016-06-24 2017-12-26 Microsoft Technology Licensing, Llc Analyzing changes in vocal power within music content using frequency spectrums
GB201617408D0 (en) 2016-10-13 2016-11-30 Asio Ltd A method and system for acoustic communication of data
GB201617409D0 (en) 2016-10-13 2016-11-30 Asio Ltd A method and system for acoustic communication of data
EP3309777A1 (en) * 2016-10-13 2018-04-18 Thomson Licensing Device and method for audio frame processing
CN107221334B (zh) * 2016-11-01 2020-12-29 武汉大学深圳研究院 一种音频带宽扩展的方法及扩展装置
GB201704636D0 (en) 2017-03-23 2017-05-10 Asio Ltd A method and system for authenticating a device
GB2565751B (en) 2017-06-15 2022-05-04 Sonos Experience Ltd A method and system for triggering events
CN109389987B (zh) 2017-08-10 2022-05-10 华为技术有限公司 音频编解码模式确定方法和相关产品
US10586529B2 (en) * 2017-09-14 2020-03-10 International Business Machines Corporation Processing of speech signal
CN111279414B (zh) 2017-11-02 2022-12-06 华为技术有限公司 用于声音场景分类的基于分段的特征提取
CN107886956B (zh) * 2017-11-13 2020-12-11 广州酷狗计算机科技有限公司 音频识别方法、装置及计算机存储介质
GB2570634A (en) 2017-12-20 2019-08-07 Asio Ltd A method and system for improved acoustic transmission of data
CN108501003A (zh) * 2018-05-08 2018-09-07 国网安徽省电力有限公司芜湖供电公司 一种应用于变电站智能巡检机器人的声音识别系统和方法
CN108830162B (zh) * 2018-05-21 2022-02-08 西华大学 无线电频谱监测数据中的时序模式序列提取方法及存储方法
US11240609B2 (en) * 2018-06-22 2022-02-01 Semiconductor Components Industries, Llc Music classifier and related methods
US10692490B2 (en) * 2018-07-31 2020-06-23 Cirrus Logic, Inc. Detection of replay attack
CN108986843B (zh) * 2018-08-10 2020-12-11 杭州网易云音乐科技有限公司 音频数据处理方法及装置、介质和计算设备
JP7115556B2 (ja) 2018-10-19 2022-08-09 日本電信電話株式会社 認証認可システム及び認証認可方法
US11342002B1 (en) * 2018-12-05 2022-05-24 Amazon Technologies, Inc. Caption timestamp predictor
CN109360585A (zh) * 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 一种语音激活检测方法
CN110097895B (zh) * 2019-05-14 2021-03-16 腾讯音乐娱乐科技(深圳)有限公司 一种纯音乐检测方法、装置及存储介质
BR112022000806A2 (pt) * 2019-08-01 2022-03-08 Dolby Laboratories Licensing Corp Sistemas e métodos para atenuação de covariância
CN110600060B (zh) * 2019-09-27 2021-10-22 云知声智能科技股份有限公司 一种硬件音频主动探测hvad系统
KR102155743B1 (ko) * 2019-10-07 2020-09-14 견두헌 대표음량을 적용한 컨텐츠 음량 조절 시스템 및 그 방법
CN113162837B (zh) * 2020-01-07 2023-09-26 腾讯科技(深圳)有限公司 语音消息的处理方法、装置、设备及存储介质
CA3170065A1 (en) * 2020-04-16 2021-10-21 Vladimir Malenovsky Method and device for speech/music classification and core encoder selection in a sound codec
US11988784B2 (en) 2020-08-31 2024-05-21 Sonos, Inc. Detecting an audio signal with a microphone to determine presence of a playback device
CN112331233A (zh) * 2020-10-27 2021-02-05 郑州捷安高科股份有限公司 听觉信号识别方法、装置、设备及存储介质
US20220157334A1 (en) * 2020-11-19 2022-05-19 Cirrus Logic International Semiconductor Ltd. Detection of live speech
CN112201271B (zh) * 2020-11-30 2021-02-26 全时云商务服务股份有限公司 一种基于vad的语音状态统计方法、系统和可读存储介质
CN113593602B (zh) * 2021-07-19 2023-12-05 深圳市雷鸟网络传媒有限公司 一种音频处理方法、装置、电子设备和存储介质
CN113689861B (zh) * 2021-08-10 2024-02-27 上海淇玥信息技术有限公司 一种单声道通话录音的智能分轨方法、装置和系统
KR102481362B1 (ko) * 2021-11-22 2022-12-27 주식회사 코클 음향 데이터의 인식 정확도를 향상시키기 위한 방법, 장치 및 프로그램
CN114283841B (zh) * 2021-12-20 2023-06-06 天翼爱音乐文化科技有限公司 一种音频分类方法、系统、装置及存储介质
CN117147966B (zh) * 2023-08-30 2024-05-07 中国人民解放军军事科学院系统工程研究院 一种电磁频谱信号能量异常检测方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197135A (zh) * 2006-12-05 2008-06-11 华为技术有限公司 声音信号分类方法和装置
CN101221766A (zh) * 2008-01-23 2008-07-16 清华大学 音频编码器切换的方法
CN101546557A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类器参数更新方法
CN101546556A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类系统
CN102044246A (zh) * 2009-10-15 2011-05-04 华为技术有限公司 一种音频信号检测方法和装置
CN102543079A (zh) * 2011-12-21 2012-07-04 南京大学 一种实时的音频信号分类方法及设备

Family Cites Families (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6570991B1 (en) * 1996-12-18 2003-05-27 Interval Research Corporation Multi-feature speech/music discrimination system
JP3700890B2 (ja) * 1997-07-09 2005-09-28 ソニー株式会社 信号識別装置及び信号識別方法
DE69926821T2 (de) * 1998-01-22 2007-12-06 Deutsche Telekom Ag Verfahren zur signalgesteuerten Schaltung zwischen verschiedenen Audiokodierungssystemen
US6901362B1 (en) 2000-04-19 2005-05-31 Microsoft Corporation Audio segmentation and classification
JP4201471B2 (ja) 2000-09-12 2008-12-24 パイオニア株式会社 音声認識システム
US6658383B2 (en) * 2001-06-26 2003-12-02 Microsoft Corporation Method for coding speech and music signals
JP4696418B2 (ja) 2001-07-25 2011-06-08 ソニー株式会社 情報検出装置及び方法
US6785645B2 (en) * 2001-11-29 2004-08-31 Microsoft Corporation Real-time speech and music classifier
WO2004034379A2 (en) 2002-10-11 2004-04-22 Nokia Corporation Methods and devices for source controlled variable bit-rate wideband speech coding
KR100841096B1 (ko) * 2002-10-14 2008-06-25 리얼네트웍스아시아퍼시픽 주식회사 음성 코덱에 대한 디지털 오디오 신호의 전처리 방법
US7232948B2 (en) * 2003-07-24 2007-06-19 Hewlett-Packard Development Company, L.P. System and method for automatic classification of music
US20050159942A1 (en) * 2004-01-15 2005-07-21 Manoj Singhal Classification of speech and music using linear predictive coding coefficients
CN1815550A (zh) 2005-02-01 2006-08-09 松下电器产业株式会社 可识别环境中的语音与非语音的方法及系统
US20070083365A1 (en) 2005-10-06 2007-04-12 Dts, Inc. Neural network classifier for separating audio sources from a monophonic audio signal
JP4738213B2 (ja) * 2006-03-09 2011-08-03 富士通株式会社 利得調整方法及び利得調整装置
TWI312982B (en) * 2006-05-22 2009-08-01 Nat Cheng Kung Universit Audio signal segmentation algorithm
US20080033583A1 (en) * 2006-08-03 2008-02-07 Broadcom Corporation Robust Speech/Music Classification for Audio Signals
KR100883656B1 (ko) 2006-12-28 2009-02-18 삼성전자주식회사 오디오 신호의 분류 방법 및 장치와 이를 이용한 오디오신호의 부호화/복호화 방법 및 장치
US8849432B2 (en) 2007-05-31 2014-09-30 Adobe Systems Incorporated Acoustic pattern identification using spectral characteristics to synchronize audio and/or video
CN101320559B (zh) * 2007-06-07 2011-05-18 华为技术有限公司 一种声音激活检测装置及方法
CA2690433C (en) * 2007-06-22 2016-01-19 Voiceage Corporation Method and device for sound activity detection and sound signal classification
CN101393741A (zh) * 2007-09-19 2009-03-25 中兴通讯股份有限公司 一种宽带音频编解码器中的音频信号分类装置及分类方法
CA2715432C (en) * 2008-03-05 2016-08-16 Voiceage Corporation System and method for enhancing a decoded tonal sound signal
US8428949B2 (en) * 2008-06-30 2013-04-23 Waves Audio Ltd. Apparatus and method for classification and segmentation of audio content, based on the audio signal
PL2301011T3 (pl) * 2008-07-11 2019-03-29 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Sposób i dyskryminator do klasyfikacji różnych segmentów sygnału audio zawierającego segmenty mowy i muzyki
US9037474B2 (en) 2008-09-06 2015-05-19 Huawei Technologies Co., Ltd. Method for classifying audio signal into fast signal or slow signal
US8380498B2 (en) 2008-09-06 2013-02-19 GH Innovation, Inc. Temporal envelope coding of energy attack signal by using attack point location
CN101615395B (zh) * 2008-12-31 2011-01-12 华为技术有限公司 信号编码、解码方法及装置、系统
CN101847412B (zh) 2009-03-27 2012-02-15 华为技术有限公司 音频信号的分类方法及装置
FR2944640A1 (fr) * 2009-04-17 2010-10-22 France Telecom Procede et dispositif d'evaluation objective de la qualite vocale d'un signal de parole prenant en compte la classification du bruit de fond contenu dans le signal.
WO2011033597A1 (ja) 2009-09-19 2011-03-24 株式会社 東芝 信号分類装置
EP2490214A4 (en) * 2009-10-15 2012-10-24 Huawei Tech Co Ltd METHOD, DEVICE AND SYSTEM FOR SIGNAL PROCESSING
CN102044244B (zh) * 2009-10-15 2011-11-16 华为技术有限公司 信号分类方法和装置
CN102044243B (zh) * 2009-10-15 2012-08-29 华为技术有限公司 语音激活检测方法与装置、编码器
JP5651945B2 (ja) * 2009-12-04 2015-01-14 ヤマハ株式会社 音響処理装置
CN102098057B (zh) * 2009-12-11 2015-03-18 华为技术有限公司 一种量化编解码方法和装置
US8473287B2 (en) * 2010-04-19 2013-06-25 Audience, Inc. Method for jointly optimizing noise reduction and voice quality in a mono or multi-microphone system
CN101944362B (zh) * 2010-09-14 2012-05-30 北京大学 一种基于整形小波变换的音频无损压缩编码、解码方法
CN102413324A (zh) * 2010-09-20 2012-04-11 联合信源数字音视频技术(北京)有限公司 预编码码表优化方法与预编码方法
CN102446504B (zh) * 2010-10-08 2013-10-09 华为技术有限公司 语音/音乐识别方法及装置
RU2010152225A (ru) * 2010-12-20 2012-06-27 ЭлЭсАй Корпорейшн (US) Обнаружение музыки с использованием анализа спектральных пиков
CN102741918B (zh) * 2010-12-24 2014-11-19 华为技术有限公司 用于话音活动检测的方法和设备
WO2012083555A1 (en) * 2010-12-24 2012-06-28 Huawei Technologies Co., Ltd. Method and apparatus for adaptively detecting voice activity in input audio signal
EP3252771B1 (en) * 2010-12-24 2019-05-01 Huawei Technologies Co., Ltd. A method and an apparatus for performing a voice activity detection
US8990074B2 (en) * 2011-05-24 2015-03-24 Qualcomm Incorporated Noise-robust speech coding mode classification
CN102982804B (zh) * 2011-09-02 2017-05-03 杜比实验室特许公司 音频分类方法和系统
US9111531B2 (en) * 2012-01-13 2015-08-18 Qualcomm Incorporated Multiple coding mode signal classification
CN103021405A (zh) * 2012-12-05 2013-04-03 渤海大学 基于music和调制谱滤波的语音信号动态特征提取方法
JP5277355B1 (ja) * 2013-02-08 2013-08-28 リオン株式会社 信号処理装置及び補聴器並びに信号処理方法
US9984706B2 (en) * 2013-08-01 2018-05-29 Verint Systems Ltd. Voice activity detection using a soft decision mechanism
CN106409313B (zh) * 2013-08-06 2021-04-20 华为技术有限公司 一种音频信号分类方法和装置
US9620105B2 (en) * 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
JP6521855B2 (ja) 2015-12-25 2019-05-29 富士フイルム株式会社 磁気テープおよび磁気テープ装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197135A (zh) * 2006-12-05 2008-06-11 华为技术有限公司 声音信号分类方法和装置
CN101221766A (zh) * 2008-01-23 2008-07-16 清华大学 音频编码器切换的方法
CN101546557A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类器参数更新方法
CN101546556A (zh) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 用于音频内容识别的分类系统
CN102044246A (zh) * 2009-10-15 2011-05-04 华为技术有限公司 一种音频信号检测方法和装置
CN102543079A (zh) * 2011-12-21 2012-07-04 南京大学 一种实时的音频信号分类方法及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509601A (zh) * 2020-11-18 2021-03-16 中电海康集团有限公司 一种音符起始点检测方法及系统
CN113192488A (zh) * 2021-04-06 2021-07-30 青岛信芯微电子科技股份有限公司 一种语音处理方法及装置

Also Published As

Publication number Publication date
KR20160040706A (ko) 2016-04-14
SG11201600880SA (en) 2016-03-30
KR20200013094A (ko) 2020-02-05
AU2018214113B2 (en) 2019-11-14
CN104347067B (zh) 2017-04-12
AU2013397685A1 (en) 2016-03-24
CN106409310A (zh) 2017-02-15
US20160155456A1 (en) 2016-06-02
US11756576B2 (en) 2023-09-12
ES2769267T3 (es) 2020-06-25
EP3029673A4 (en) 2016-06-08
ES2909183T3 (es) 2022-05-05
PT3029673T (pt) 2017-06-29
US11289113B2 (en) 2022-03-29
MX353300B (es) 2018-01-08
ES2629172T3 (es) 2017-08-07
KR20170137217A (ko) 2017-12-12
EP3667665B1 (en) 2021-12-29
HK1219169A1 (zh) 2017-03-24
JP2016527564A (ja) 2016-09-08
US20180366145A1 (en) 2018-12-20
HUE035388T2 (en) 2018-05-02
KR101805577B1 (ko) 2017-12-07
JP2018197875A (ja) 2018-12-13
JP6162900B2 (ja) 2017-07-12
US20200126585A1 (en) 2020-04-23
EP3029673B1 (en) 2017-05-10
CN104347067A (zh) 2015-02-11
CN106409310B (zh) 2019-11-19
CN106409313B (zh) 2021-04-20
KR20190015617A (ko) 2019-02-13
JP6392414B2 (ja) 2018-09-19
KR101946513B1 (ko) 2019-02-12
AU2018214113A1 (en) 2018-08-30
PT3324409T (pt) 2020-01-30
AU2017228659A1 (en) 2017-10-05
US10529361B2 (en) 2020-01-07
MY173561A (en) 2020-02-04
PT3667665T (pt) 2022-02-14
AU2013397685B2 (en) 2017-06-15
EP3667665A1 (en) 2020-06-17
CN106409313A (zh) 2017-02-15
MX2016001656A (es) 2016-10-05
EP4057284A3 (en) 2022-10-12
KR102296680B1 (ko) 2021-09-02
US20240029757A1 (en) 2024-01-25
EP3324409B1 (en) 2019-11-06
US10090003B2 (en) 2018-10-02
EP4057284A2 (en) 2022-09-14
US20220199111A1 (en) 2022-06-23
BR112016002409A2 (pt) 2017-08-01
SG10201700588UA (en) 2017-02-27
EP3029673A1 (en) 2016-06-08
BR112016002409B1 (pt) 2021-11-16
EP3324409A1 (en) 2018-05-23
KR102072780B1 (ko) 2020-02-03
AU2017228659B2 (en) 2018-05-10
JP6752255B2 (ja) 2020-09-09
JP2017187793A (ja) 2017-10-12

Similar Documents

Publication Publication Date Title
WO2015018121A1 (zh) 一种音频信号分类方法和装置
BR112014017708B1 (pt) Método e aparelho para detectar atividade de voz na presença de ruído de fundo, e, memória legível por computador
JP2015507222A (ja) 複数コーディングモード信号分類
CN115346549A (zh) 一种基于深度学习的音频带宽扩展方法、系统及编码方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13891232

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016532192

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: MX/A/2016/001656

Country of ref document: MX

NENP Non-entry into the national phase

Ref country code: DE

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112016002409

Country of ref document: BR

REEP Request for entry into the european phase

Ref document number: 2013891232

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: IDP00201601486

Country of ref document: ID

Ref document number: 2013891232

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20167006075

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2013397685

Country of ref document: AU

Date of ref document: 20130926

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 112016002409

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20160203