CN103959375A - Enhanced chroma extraction from an audio codec - Google Patents

Enhanced chroma extraction from an audio codec Download PDF

Info

Publication number
CN103959375A
CN103959375A CN201280058961.7A CN201280058961A CN103959375A CN 103959375 A CN103959375 A CN 103959375A CN 201280058961 A CN201280058961 A CN 201280058961A CN 103959375 A CN103959375 A CN 103959375A
Authority
CN
China
Prior art keywords
frequency
coefficient
block
sound signal
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201280058961.7A
Other languages
Chinese (zh)
Other versions
CN103959375B (en
Inventor
A·比斯沃斯
M·芬克
M·舒格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby International AB
Original Assignee
Dolby International AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby International AB filed Critical Dolby International AB
Publication of CN103959375A publication Critical patent/CN103959375A/en
Application granted granted Critical
Publication of CN103959375B publication Critical patent/CN103959375B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/38Chord
    • G10H1/383Chord detection and/or recognition, e.g. for correction, or automatic bass generation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • G10L19/038Vector quantisation, e.g. TwinVQ audio
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/131Mathematical functions for musical analysis, processing, synthesis or composition
    • G10H2250/215Transforms, i.e. mathematical transforms into domains appropriate for musical signal processing, coding or compression
    • G10H2250/221Cosine transform; DCT [discrete cosine transform], e.g. for use in lossy audio compression such as MP3
    • G10H2250/225MDCT [Modified discrete cosine transform], i.e. based on a DCT of overlapping data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/038Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
    • G10L21/0388Details of processing therefor

Abstract

The present document relates to methods and systems for music information retrieval (MIR). In particular, the present document relates to methods and systems for extracting a chroma vector from an audio signal. A method (900) for determining a chroma vector (100) for a block of samples of an audio signal (301) is described. The method (900) comprises receiving (901) a corresponding block of frequency coefficients derived from the block of samples of the audio signal (301) from a core encoder (412) of a spectral band replication based audio encoder (410) adapted to generate an encoded bitstream (305) of the audio signal (301) from the block of frequency coefficients; and determining (904) the chroma vector (100) for the block of samples of the audio signal (301) based on the received block of frequency coefficients.

Description

The colourity from audio codec strengthening is extracted
The cross reference of related application
The application requires the U.S. Provisional Patent Application No.61/565 submitting on November 30th, 2011,037 right of priority, and the full content of this application is incorporated to this by reference.
Technical field
This document relates to the method and system for music information retrieval (MIR).Especially, this document relates to for coding with sound signal and processes in combination (for example,, during the coding of sound signal is processed) extracts colourity (chroma) vector method and system from sound signal.
Background technology
Due to the fact that is easy to the quantity of the data of access and significantly increases in the past few years, traversal available music storehouse becomes more and more difficult.The solution of music data being carried out to structuring and classification has been investigated in the interdisciplinary research field that is called as music information retrieval (MIR), to help their media of user's detection.For example, hope be that method based on MIR can be classified to music, to propose the music of similar type.MIR technology can be based on specifying the by-level of semitone energy distribution in time time m-frequency spectrum designation, it is called as colourity collection of illustrative plates (chromagram).The colourity collection of illustrative plates of sound signal can for identify sound signal and acoustic intelligence (for example,, about the information of melody (melody) and/or about the information of chord (chord)).Yet determining of colourity collection of illustrative plates is typically associated with significant computation complexity.
This document has managed to solve the complexity issue of colourity collection of illustrative plates computing method, and has described the method and system that calculates colourity collection of illustrative plates for the computation complexity to reduce.Especially, describe for calculating expeditiously the method and system of the colourity collection of illustrative plates of perceived excitation.
Summary of the invention
According on the one hand, a kind of method of chrominance vector of the sampling block for definite sound signal is described.Sampling block can be the long piece of so-called sampling, and it is also referred to as sample frame.Sound signal can be track for example.Said method comprising the steps of: for example, from audio coder (, AAC (advanced audio decoding) or mp3 scrambler), receive the corresponding block of frequency coefficients deriving from the sampling block of sound signal.Audio coder can be the core encoder of the audio coder based on spectral band replication (SBR).For instance, the core encoder of the audio coder based on SBR can be AAC or mp3 scrambler, and more particularly, the audio coder based on SBR can be HE (high-level efficiency) AAC scrambler or mp3PRO.Another example that can apply the audio coder based on SBR of the method described in this document is MPEG-D USAC (universal phonetic and audio codec) scrambler.
(based on SBR's) audio coder is typically suitable for producing from block of frequency coefficients the bit stream of the coding of sound signal.For this object, audio coder can quantize block of frequency coefficients, and can carry out entropy coding to the block of frequency coefficients after quantizing.
Described method also comprises that the block of frequency coefficients based on received determines the chrominance vector of the sampling block of sound signal.Especially, can determine chrominance vector from second frequency coefficient block, second frequency coefficient block derives from received block of frequency coefficients.In an embodiment, second frequency coefficient block is received block of frequency coefficients.In the situation that the received block of frequency coefficients long piece that is coefficient of frequency, situation may be so.In another embodiment, second frequency coefficient block is corresponding to the long piece of estimated coefficient of frequency.The long piece of this estimated coefficient of frequency can be determined from included a plurality of short blocks in received block of frequency coefficients.
Block of frequency coefficients can be Modified Discrete Cosine Transform (MDCT) coefficient block.Time domain is the conversion such as MDST (correction discrete sine transform), DFT (discrete Fourier transform (DFT)) and MCLT (revising plural lapped transform) to other examples of frequency domain conversion (and block of frequency coefficients of gained).In general, can to frequency domain conversion, from corresponding sampling block, determine block of frequency coefficients by time domain.Conversely, can use corresponding inverse transformation to determine sampling block from block of frequency coefficients.
MDCT is lapped transform, and it means, under these circumstances, from the other more sampling of the direct neighborhood from this sampling block of sampling block and sound signal, determines block of frequency coefficients.Especially, can from sampling block and immediately preceding sampling block determine block of frequency coefficients.
Sampling block can comprise N short block in succession, and each short block in succession has M sampling.In other words, sampling block can be the sequence of (or can a comprise) N short block.In a similar fashion, block of frequency coefficients can comprise N corresponding short block, and each corresponding short block has M coefficient of frequency.In an embodiment, M=128, N=8, this means that sampling block comprises M * N=1024 sampling.Audio coder can be encoded to transient audio signal with short block, thereby improves temporal resolution, reduces frequency resolution simultaneously.
When receiving short block sequence from audio coder, described method can comprise that additional step is to improve the frequency resolution of the sequence of the coefficient of frequency short block being received, thereby makes it possible to determine the chrominance vector of whole sampling blocks (it comprises short block sample sequence).Especially, described method can comprise the short block estimation and the long piece of the corresponding coefficient of frequency of sampling block from N M coefficient of frequency.Carry out and estimate, so that compare with N coefficient of frequency short block, the frequency resolution of the estimated long piece of coefficient of frequency improves.Under these circumstances, the chrominance vector of can the long piece of coefficient of frequency based on estimated determining the sampling block of sound signal.
Should point out, for different polymerization levels, can carry out in the mode of layering the step of the long piece of estimated frequency coefficient.This means, a plurality of short blocks can be polymerized to long piece, and a plurality of long pieces can be polymerized to overlength piece, etc.The frequency resolution (and correspondingly, temporal resolution) of varying level can be provided as a result.For instance, can determine the long piece of coefficient of frequency (as outlined above) from the sequence of N short block.In next layering level, the sequence of N2 the long piece of coefficient of frequency (wherein some or all may by the sequence estimation from a corresponding N short block out) can be converted to the overlength piece (and correspondingly, higher frequency resolution) of N2 overtones band coefficient.With regard to this point, can be for hierarchically improve the frequency resolution (hierarchically reducing the temporal resolution of chrominance vector) of chrominance vector meanwhile, for the method for the long piece of sequence estimation coefficient of frequency from coefficient of frequency short block.
The step of the long piece of estimated frequency coefficient can comprise: the corresponding frequencies coefficient to N coefficient of frequency short block interweaves, thus the long piece of the coefficient of frequency that obtains interweaving.Should point out, in block of frequency coefficients being quantized to the context of encoding with entropy, such interweaving can for example, be carried out by audio coder (, core encoder).With regard to this point, described method alternately can comprise the step that receives the long piece of coefficient of frequency interweaving from audio coder.Therefore, the step that interweaves will not consume extra computational resource.Chrominance vector can be determined by long piece from the coefficient of frequency interweaving.In addition, the step of the long piece of estimated frequency coefficient can comprise by (comparing with (bin) between high frequency region, in between the low frequency range of conversion) there is the long piece of coefficient of frequency that the conversion (for example, DCT-II conversion) of energy accumulating character (energy compaction property) is applied to interweave the N of N coefficient of frequency short block corresponding frequencies coefficient carried out to decorrelation.This uses the decorrelation scheme of energy accumulating conversion (for example, DCT-II conversion) can be called as adaptive hybrid transform (AHT) scheme.Chrominance vector can be determined from the long piece of coefficient of frequency decorrelation, that interweave.
Alternately, the step of the long piece of estimated frequency coefficient can comprise the short block that heterogeneous conversion (PPC) is applied to N M coefficient of frequency.Heterogeneous conversion can be based on transition matrix, and this transition matrix for being transformed to the short block of N M coefficient of frequency the long piece of N * M coefficient of frequency accurately on mathematics.With regard to this point, can be on mathematics from the time domain carried out by audio coder to frequency domain conversion, (for example, MDCT) determine transition matrix.Transition matrix can represent that N coefficient of frequency short block is to the inverse transformation of time domain and the combination that subsequently convert of time-domain sampling to frequency domain, thereby obtains the long piece of N * M coefficient of frequency accurately.Heterogeneous conversion can be used the wherein sub-fraction transition matrix coefficient of transition matrix to be set to zero approach (approximation).For instance, 90% or more part of transition matrix coefficient can be set to zero.As a result, heterogeneous conversion can provide with low computation complexity the estimated long piece of coefficient of frequency.In addition, this fraction (fraction) can change as the function as complexity the parameter of conversion quality.In other words, this fraction can be for providing complexity scalable conversion.
Should point out, AHT (and PPC) can be applied to one or more subsets of short block sequence.With regard to this point, the long piece of estimated frequency coefficient can comprise a plurality of subsets that form N coefficient of frequency short block.These subsets can have the length of L short block, thereby obtain N/L subset.Can select based on sound signal the short block quantity L of each subset, thereby make AHT/PPC adapt to the particular characteristics (that is, the particular frame of sound signal) of sound signal.
The in the situation that of AHT, for each subset, can interweave to the corresponding frequencies coefficient of coefficient of frequency short block, thereby obtain the coefficient of frequency intermediate mass interweaving (thering is L * M coefficient) of this subset.In addition, for each subset, energy accumulating can be converted to (for example, DCT-II conversion) and be applied to the coefficient of frequency intermediate mass interweaving of this subset, thereby improve the frequency resolution of the coefficient of frequency intermediate mass interweaving.The in the situation that of PPC, can determine intermediate conversion matrix, this intermediate conversion matrix for being transformed to the short block of L M coefficient of frequency the intermediate mass of L * M coefficient of frequency accurately on mathematics.For each subset, heterogeneous conversion (it can be called as middle heterogeneous conversion) can be used the wherein sub-fraction intermediate conversion matrix coefficient of intermediate conversion matrix to be set to zero approaching.
More generally, can state, the estimation of the long piece of coefficient of frequency can comprise from a plurality of coefficient of frequency intermediate mass of short block sequence estimation (for a plurality of subsets).Can determine a plurality of chrominance vectors (method described in use this document) from a plurality of coefficient of frequency intermediate mass.With regard to this point, for determining that the frequency resolution (and temporal resolution) of chrominance vector can be suitable for the characteristic of sound signal.
The step of determining chrominance vector can comprise to be processed frequency dependence psychologic acoustics to be applied to the second frequency coefficient block that derives from received block of frequency coefficients.Frequency dependence psychologic acoustics is processed, and the psychoacoustic model being provided by audio coder can be provided.
In an embodiment, the relevant psychologic acoustics of applying frequency is processed the value and the frequency dependence energy threshold (for example, the psychoacoustic masking threshold value of frequency dependence) that comprise deriving from least one coefficient of frequency of second frequency coefficient block and is compared.The value deriving from described at least one coefficient of frequency can for example, corresponding to from corresponding a plurality of frequencies (, scale factor band (scale factor band)) the average energy value (for example, scale factor band energy) that a plurality of coefficient of frequencies are derived.Especially, the average energy value can be the mean value of a plurality of coefficient of frequencies.Result as a comparison, if coefficient of frequency, lower than energy threshold, can be set to zero by coefficient of frequency.Energy threshold can for example, be derived from the applied psychoacoustic model of audio coder (, the core encoder of the audio coder based on SBR).Especially, energy threshold can be from being derived for the frequency dependence masking threshold that block of frequency coefficients is quantized by audio coder.
The step of determining chrominance vector can comprise some or all in the coefficient of frequency of second are categorized as to the tone class (tone class) of chrominance vector.Subsequently, can determine the cumlative energy of the tone class of chrominance vector by the coefficient of frequency based on classification.The bandpass filter that can be associated by the tone class with chrominance vector for instance, is classified to coefficient of frequency.
Can determine chrominance vector sequence and with respect to drawing this chrominance vector sequence with the timeline of this sampling block Serial relation connection, determine the sound signal colourity collection of illustrative plates of (comprising sampling block sequence) by the sampling block sequence from sound signal.In other words, for example, by for sampling block sequence (, for frame sequence) method summarized in iteration this document, can frame by frame, do not ignore any frame and determine reliable chrominance vector (frame of for example, not ignoring the transient audio signal that comprises short block sequence).Therefore, can determine continuous colourity collection of illustrative plates (each frame comprises (at least) chrominance vector).
According on the other hand, a kind of audio coder being suitable for coding audio signal is described.Audio coder can comprise and is suitable for core encoder that (possible down-sampling) low frequency component of sound signal is encoded.Core encoder is typically suitable for encoding by sampling block being transformed to the sampling block to low frequency component in frequency domain, thereby obtains corresponding block of frequency coefficients.In addition, audio coder can comprise colourity determining unit, and it is suitable for determining based on block of frequency coefficients the chrominance vector of sampling block of the low frequency component of sound signal.For this object, colourity determining unit can be suitable for carrying out any one in the method step of summarizing in this document.Scrambler can also comprise spectral band replication scrambler, and it is suitable for the corresponding high fdrequency component of sound signal to encode.In addition, scrambler can comprise multiplexer, and its data that are suitable for from being provided by core encoder and spectral band replication scrambler produce the bit stream of encoding.In addition, multiplexer can be suitable for the information deriving from chrominance vector (for example, the high-level information of deriving from chrominance vector, such as chord and/or tune) to add to as metadata the bit stream of coding.For instance, the bit stream of coding can be encoded by any form in following column format: MP4 form, 3GP form, 3G2 form, LATM form.
Should point out, the method described in this document can be applied to audio decoder (for example, the audio coder based on SBR).Such audio decoder typically comprises demultiplexing and decoding unit, and it is suitable for the bit stream of received code, and is suitable for bitstream extraction from this coding (quantification) block of frequency coefficients.These block of frequency coefficients can be for determining chrominance vector as summarized in this document.
Therefore, a kind of audio decoder that is suitable for sound signal to decode is described.Audio decoder comprises demultiplexing and decoding unit, and it is suitable for receiving bit stream, and is suitable for from received bitstream extraction block of frequency coefficients.The corresponding sampling block of (through down-sampling) low frequency component of block of frequency coefficients and sound signal is associated.Especially, block of frequency coefficients can be corresponding to the quantised versions of the corresponding block of frequency coefficients deriving at corresponding audio coder place.The block of frequency coefficients at demoder place can be transformed into (use inverse transformation) in time domain, to obtain the reconstructed sample piece of (through down-sampling) low frequency component of sound signal.
In addition, audio decoder comprises colourity determining unit, and it is suitable for the chrominance vector based on determine the sampling block of sound signal (low frequency component) from the block of frequency coefficients of bitstream extraction.Colourity determining unit can be suitable for carrying out any one in the method step of summarizing in this document.
In addition, should point out, some audio decoders can comprise psychoacoustic model.The example of such audio decoder is for example Dolby Digital and Dolby Digital Plus.This psychoacoustic model can be for determining chrominance vector (as summarized in this document).
According on the other hand, a kind of software program is described.This software program can be suitable for carrying out on processor, and is suitable for carrying out when carrying out on calculation element the method step of summarizing in this document.
According on the other hand, a kind of storage medium is described.This storage medium can comprise software program, and this software program is suitable for carrying out on processor, and is suitable for carrying out when carrying out on calculation element the method step of summarizing in this document.
According on the other hand, a kind of computer program is described.This computer program can comprise for carry out the executable instruction of the method step that this document summarizes when carrying out on calculation element.
Should point out, as the method and system that comprises its preferred embodiment of being summarized in this document can independently be used, or be combined with the disclosed additive method of this document and system.All aspects of the method and system of summarizing in this document in addition, can combination in any.Especially, can be in mode arbitrarily by the feature combination with one another of claim.
Accompanying drawing explanation
Below in an exemplary fashion with reference to the accompanying drawings of the present invention, wherein:
The example of Fig. 1 illustration chrominance vector is determined scheme;
Fig. 2 illustrates for the coefficient of spectrogram being categorized as to the example bandpass filter of the example tone class of chrominance vector;
Fig. 3 illustration comprises the block diagram of the example audio scrambler of colourity determining unit;
Fig. 4 illustrates the block diagram of example high-level efficiency-advanced audio decoding and encoding device and demoder;
Definite scheme of Fig. 5 illustration Modified Discrete Cosine Transform;
Fig. 6 a and b illustrated example psychologic acoustics frequency curve;
Fig. 7 a to e illustrates the exemplary sequence of the long piece of (estimated) coefficient of frequency;
Fig. 8 illustrates the exemplary experimental result for the similarity of the chrominance vector deriving from various long piece estimation scheme; And
Fig. 9 illustrates the example flow diagram for the method for the chrominance vector sequence of definite sound signal.
Embodiment
Storage solution of today has the ability that huge music content database is provided to user.Such as the online streaming service of Simfy provides more than 1,000 3 hundred ten thousand songs (audio file or sound signal), these streaming services navigate and select suitable track and these tracks are streaming to their client's challenge in the face of traversal large database.Similarly, there is the user who is stored in the large-scale personal music collection in database and there is the same problem of selecting suitable music.In order to process such mass data, find that the new mode of music is wished.Especially, can be useful, when knowing that user samples the preference of music, music retrieval system is advised the music of similar type to user.
In order to identify music similarity, may need many senior semantic features, such as bat, rhythm, beat, harmony, melody, style and keynote (mood), and may extract these senior semantic features from music content.Music information retrieval (MIR) provide the method for calculating the many musical features in these musical features.Most of MIR strategies depend on the descriptor that obtains the intermediate level of necessary senior musical features from it.An example of the descriptor of the intermediate level is illustrated so-called chrominance vector 100 in Fig. 1.Chrominance vector 100 is K n dimensional vector n normally, and wherein, each dimension of this vector is corresponding to the spectrum energy of semitone class.The in the situation that of western music, K=12 typically.For the music of other types, K can have different values.Can obtain chrominance vector 100 by sound signal for example, is shone upon and folded into single octave (octave) in the spectrum 101 of particular moment (, use the amplitude spectrum of short time discrete Fourier transform STFT and determine).With regard to this point, chrominance vector capturing audio signal is in melody and the harmony content of particular moment, and the while is less sensitive with the variation that spectrogram 101 compares tone color.
As shown in fig. 1, can make the chromaticity of sound signal visual by spectrum 101 being incident upon on the spiral expression 102 of pitch (musical pitch) perception of Shepard.In representing 102, colourity refer to from directly over position the circumference of the spiral 102 seen.On the other hand, highly refer to the upright position of the spiral arriving from the side.Height corresponding to the position of octave, that is, is highly indicated octave.Can extract chrominance vector by following manner,, make amplitude spectrum 101 reel and the spectrum energy corresponding position on the circumference of spiral 102, that still locate in different octaves (differing heights) is projected to colourity (or tone class) above around spiral 102, thereby the spectrum energy of semitone class is sued for peace.
The harmony content of this distribution capturing audio signal of semitone class.Chrominance vector progress is in time called as colourity collection of illustrative plates.Chrominance vector and chromatic diagram spectral representation can be for identification chord name (for example, comprise C, the C Major chord of the large chrominance vector value of E and G), (tone has identified the large tune/ditty of the focus of the terminal of the remainder that represents musical works or a joint of musical works to the overall tone of estimation sound signal, string and keynote common chords), estimate the mode (mode) of sound signal (wherein, mode is the type of scale, for example, musical works in large tune or ditty), detect the similarity (harmony/melody similarity in song in song or between song, or the harmony/melody similarity on song book, to create the playlist of similar song), identification song, and/or the chorus of extracting song.
With regard to this point, can be by the short-time spectrum spectrum of sound signal be folded in single octave, then folding spectrum is divided into ten two-dimensional vectors obtains chrominance vector.This operation depends on the reasonable time-frequency representation (preferably, having high resolving power in frequency domain) of sound signal.The calculating of such T/F conversion of sound signal is computation-intensive, and consumes most of computing power in known colourity collection of illustrative plates numerical procedure.
Below, describe for determining the basic scheme of chrominance vector.From table 1 (frequency semitone, YiHzWei unit the 4th octave of western music), can find out, when knowing with reference to pitch (for tone A4, being generally 440Hz), tone is possible to the direct mapping of frequency.
Table 1
The factor between the frequency of two semitones is therefore, the factor between two octaves is because double being equal to of frequency improved an octave by tone, so this system can be looked at as periodically, and can be shown in cylindrical-coordinate system 102, in cylindrical-coordinate system 102, radial axle represents one of one of 12 tones or chromatic value (being called as c), and wherein, lengthwise position represents pitch (being called as h).Therefore, pitch or the frequency f of institute's perception can be written as to f=2 c+h, c ∈ [0,1), h ∈ Z.
When for example, melody with regard to sound signal (, musical works) and harmony are analyzed this sound signal, the vision that itself and acoustic intelligence are shown in time shows wishes.Mode is a so-called colourity collection of illustrative plates, and in colourity collection of illustrative plates, the spectrum content of a frame is mapped to ten two-dimensional vectors of semitone, and draws with respect to the time, and ten two-dimensional vectors of semitone are called as chrominance vector.Can be by above-mentioned equation be transformed to come to obtain chromatic value c from given frequency f, wherein, be to round (flooring) computing downwards, it is corresponding to a plurality of octave spectrums are folded into single octave (being described by spiral expression 102).Alternately, can by one group of 12 bandpass filter, determine chrominance vector by each octave, wherein, each is with the logical spectrum energy of specific colourity that is suitable for extracting at the amplitude spectrum of particular moment from sound signal.With regard to this point, can make and the corresponding spectrum energy of each colourity (or tone class) and amplitude spectrum isolation, subsequently this spectrum energy is sued for peace to obtain the chromatic value c of specific colourity.In Fig. 2 exemplified with the example bandpass filter 200 for tone class A.At M.Goto " A Chorus Section Detection Method for Musical Audio Signals and its Application to a Music Listening Station. " IEEE Trans.Auido, Speech, and Language Processing14, no.5 (Sepetember2006): described in 1783-1794 for determining such method based on wave filter of chrominance vector and colourity collection of illustrative plates spectrum.At Stein, the people's such as M. " Evaluation and Comparison of Auido Chroma Feature Extraction Methods. " 126 thaES Convention.Munich, has described other colourity extracting method in Germany2009.Two documents are all incorporated to this by reference.
As above, summarize the definite reasonable time-frequency representation that need to determine sound signal of chrominance vector and colourity collection of illustrative plates.This is associated with high computation complexity conventionally.In this document, proposed to reduce amount of calculation by MIR being processed be incorporated in the existing audio frequency processing scheme of having used similar T/F to convert.The quality of the hope of existing audio frequency processing scheme like this by being that the T/F with high frequency resolution represents, the high-level efficiency of T/F conversion realizes and can be for improving potentially the reliability of colourity collection of illustrative plates of gained and the availability of the add-on module of quality.
Sound signal (particularly, music signal) is typically stored and/or sends with (that is, compression) form of coding.This means that MIR processes should work in combination with the sound signal of coding.Therefore, proposed to determine in combination chrominance vector and/or the colourity collection of illustrative plates of sound signal with the audio coder of service time-frequency transformation.Particularly, propose use high-level efficiency (HE) encoder/decoder, that is, used the encoder/decoder of spectral band replication (SBR).The example of the encoder/decoder based on SBR is like this HE-AAC (advanced audio decoding) encoder/decoder.HE-AAC codec is designed to send the abundant experience of listening to low-down bit rate, therefore, is widely used in broadcast, mobile flow transmission and download service.The alternative codec based on SBR is for example the mp3PRO codec that uses mp3 core encoder rather than AAC core encoder.Below, with reference to HE-AAC codec.Yet, should point out, the method and system proposing also can be applicable to other audio codecs, particularly other codecs based on SBR.
With regard to this point, in this document, propose to use T/F conversion available in HE-AAC, to determine chrominance vector/colourity collection of illustrative plates of sound signal.With regard to this point, significantly reduced the definite computation complexity of chrominance vector.Except saving assesses the cost, another advantage that obtains colourity collection of illustrative plates with audio coder is the fact that typical audio codec concentrates on human perception.This means that typical audio codec (such as HE-AAC codec) provides the good psychologic acoustics instrument that can be suitable for further colourity collection of illustrative plates enhancing.In other words, propose with in audio coder can with psychologic acoustics instrument strengthen the reliability of colourity collection of illustrative plates.
In addition, should point out, audio coder itself is also benefited from the existence of additional colourity collection of illustrative plates computing module, for example, because this colourity collection of illustrative plates computing module makes to calculate the useful metadata (, chordal information) in the metadata that can be included the bit stream being produced by audio coder.This attaching metadata can be for providing the consumer experience of enhancing in decoder end.Especially, this attaching metadata can be for other MIR application.
Fig. 3 for example, exemplified with the example block diagram of audio coder (, HE-AAC scrambler) 300 and colourity collection of illustrative plates determination module 310.Audio coder 300 is converted sound signal 301 is encoded by 302 pairs of sound signals 301 of service time-frequency transformation in time-frequency domain.The exemplary of such T/F conversion 302 is the Modified Discrete Cosine Transforms (MDCT) that for example use in the context of AAC scrambler.Typically, frequency of utilization conversion (for example, MDCT) by the sample frame x[k of sound signal 301] transform in frequency domain, thereby coefficient of frequency collection X[k is provided].Quantize and decoding unit 303 in to coefficient of frequency collection X[k] quantize and encode, sensing module 306 is typically considered in quantification and decoding thus.Subsequently, in coding unit or multiplexer unit 304, by the audio-frequency signal coding of decoding, be specific bit stream format (for example, MP4 form, 3GP form, 3G2 form or LATM form).Be encoded to specific bit stream format and typically comprise the sound signal of metadata being added to coding.As a result, obtain the bit stream 305 (for example, the HE-AAC bit stream of MP4 form) of specific format.This bit stream 305 typically comprises from the coded data of audio core scrambler and SBR encoder data and attaching metadata.
The short-time magnitude spectrum 101 of sound signal 301 is determined in colourity collection of illustrative plates determination module 310 service times-frequency transformation 311.Subsequently, in unit 312, from the sequence of short-time magnitude spectrum 101, determine chrominance vector sequence (that is, colourity collection of illustrative plates 313).
Fig. 3 is further exemplified with the scrambler 350 that comprises integrated colourity collection of illustrative plates determination module.Some processing units of combined encoding device 350 are corresponding to the unit of independent scrambler 300.Yet, as indicated above, can in bit stream coding unit 354, use the attaching metadata of deriving from colourity collection of illustrative plates 353 to strengthen the bit stream 355 of coding.On the other hand, colourity collection of illustrative plates determination module can be used the T/F conversion 302 of the sensing module 306 of scrambler 350 and/or scrambler 350.In other words, colourity collection of illustrative plates calculating 352 (possibly, applied mental acoustic treatment 356) can be used by the 302 coefficient of frequency collection X[k that provide are provided] determine and determine the amplitude spectrum 101 of chrominance vector 100 from it.In addition, can consider sensing module 306, to determine significant chrominance vector 100 in perception.
Fig. 4 is exemplified with the audio codec based on SBR 400 of the example of using in HE-AAC version 1 and HE-AAC version 2 (that is the HE-AAC that, comprises parametric stereo (PS) coding/decoding of stereophonic signal).Especially, Fig. 4 shows under so-called pair of rate pattern the block diagram of the HE-AAC codec 400 of (that is, the core encoder in scrambler 410 412 is with under half the pattern of sampling rate work of the sampling rate of SBR scrambler 414) operation.In the input of scrambler 410, provide the sound signal 301 of input sampling rate fs=fs_in.In downsampling unit 411, sound signal 301 being carried out to the factor is 2 down-sampling, to the low frequency component of sound signal 301 is provided.Typically, downsampling unit 411 comprises low-pass filter, to removed high fdrequency component (thereby avoiding aliasing) before down-sampling.Downsampling unit 411 provides low frequency component with the sample rate f s/2=fs_in/2 reducing.Core encoder 412 (for example, AAC scrambler) encodes to provide the bit stream of the coding of low frequency component to low frequency component.
Use SBR parameter to encode to the high fdrequency component of sound signal.For this object, use analysis filterbank 413 (for example, thering is for example quadrature mirror filter bank of 64 frequency bands (QMF)) to analyze sound signal 301.As a result, obtain a plurality of subband signals of sound signal, wherein, at each moment t (or at each sampling k), described a plurality of subband signals provide the indication of sound signal 301 in the spectrum of this moment t.Described a plurality of subband signal is provided for SBR scrambler 414.SBR scrambler 414 is determined a plurality of SBR parameters, and wherein, described a plurality of SBR parameters make it possible to the high fdrequency component from (reconstruct) low frequency component reconstructed audio signal at corresponding demoder 430 places.SBR scrambler 414 is typically determined described a plurality of SBR parameters, so that based on described a plurality of SBR parameters and (reconstruct) low frequency component and the high fdrequency component of definite reconstruct is approached original high fdrequency component.For this object, SBR scrambler 414 can use the error minimize standard (for example, mean squared error criterion) of the high fdrequency component based on original high fdrequency component and reconstruct.
For example, bit stream at the coding of a plurality of SBR parameters described in multiplexer 415 (, cell encoder 304) and low frequency component combines to provide the whole bit stream that can be stored or can be sent out (for example, HE-AAC bit stream 305).Whole bit stream 305 also comprises about the information for determining that the SBR scrambler of a plurality of SBR parameters arranges by SBR scrambler 414.In addition, in this document, propose to add the metadata of 313,353 derivation of the colourity collection of illustrative plates from sound signal 301 to whole bit stream 305.
Corresponding demoder 430 can produce from whole bit stream 305 the not compressing audio signal of sample rate f s_out=fs_in.Core decoder 431 is separated with the bit stream of the coding of low frequency component by SBR parameter.In addition, core decoder 431 (for example, AAC demoder) is decoded to the bit stream of the coding of low frequency component, to the time-domain signal of the low frequency component of reconstruct is provided with the inside sample rate f s of demoder 430.Use the low frequency component of 432 pairs of reconstruct of analysis filterbank to analyze.Should point out, under two rate patterns, because AAC demoder 431 is worked in down-sampling territory,, the fact of the content sampling rate fs (half of a half-sum output sampling rate fs_out of its input sampling rate fs_in that is sound signal 301) of take work, inner sample rate f s is different from input sampling rate fs_in and output sampling rate fs_out at demoder 430 places.
Compare with the analysis filterbank 413 of using at scrambler 410 places, analysis filterbank 432 (for example, having for example quadrature mirror filter bank of 32 frequency bands) typically only has the frequency band of half quantity.This is owing to only must analyzing the low frequency component of reconstruct rather than the fact of whole sound signal causes.A plurality of subband signals of the gained of the low frequency component of reconstruct are combined with to produce a plurality of subband signals of the high fdrequency component of reconstruct in SBR demoder 433 with received SBR parameter.Subsequently, use synthesis filter banks 434 (for example, for example the quadrature mirror filter bank of 64 frequency bands) that the sound signal of the reconstruct in time domain is provided.Typically, the number of frequency bands of synthesis filter banks 434 is the twice of quantity of the frequency band of analysis filterbank 432.A plurality of subband signals of the low frequency component of reconstruct can be fed to the latter half frequency band of synthesis filter banks 434, and a plurality of subband signals of the high fdrequency component of reconstruct can be fed to the first half frequency band of synthesis filter banks 434.The sound signal of the reconstruct of the output of synthesis filter banks 434 has the corresponding inner sampling rate 2fs with signal sampling speed fs_out=fs_in.
With regard to this point, HE-AAC codec 400 is provided for determining the T/F conversion 413 of SBR parameter.Yet this T/F conversion 413 typically has low-down frequency resolution, is therefore not suitable for colourity collection of illustrative plates and determines.On the other hand, core encoder 412 (particularly AAC core encoder) is also used the T/F with higher frequency resolution to convert (conventionally, MDCT).
AAC core encoder resolves into fragment sequence by sound signal, and these fragments are called as piece or frame.The time domain filtering that is called as window provides seamlessly transitting of interblock by the data of revising in these pieces.AAC core encoder is suitable for being dynamically called as respectively switching between two block length M=1028 samplings of long piece and short block and M=128 sampling.With regard to this point, AAC core encoder is suitable for the coding audio signal waving between tone (complex number spectrum signal stable state, that harmony is abundant) (using long piece) and impassioned (transient signal) (using the sequence of eight short blocks).
Use Modified Discrete Cosine Transform (MDCT) that each sampling block is transformed in frequency domain.The problem that the spectrum occurring for fear of typical case in the context in the conversion of block-based (be also referred to as based on frame) temporal frequency is leaked, MDCT is used overlaid windows, that is and, MDCT is the example of so-called overlapping lapped transform.This is illustration in Fig. 5, and Fig. 5 shows the sound signal 301 of the sequence that comprises frame or piece 501.In an example shown, each piece 501 comprises M sampling of sound signal 301 (for long piece, M=1024, for short block, M=128).They as shown in sequence 502, be not only conversion to be applied to single, but overlapping MDCT is with two adjacent blocks of overlapping mode conversion.For the further transition between smoothing order piece, the window function w[k that additionally application length is 2M].Because this window is employed twice (in the conversion at scrambler place and in the inverse transformation at demoder place), so this window function w[k] should meet Princen-Bradley condition.The MDCT conversion of gained can be written as:
X [ k ] = 2 M Σ l = 0 2 M - 1 x [ l ] w [ k ] cos [ π 4 M ( 2 l + 1 + M ) ( 2 k + 1 ) ] , k ∈ [ 0 , . . . . , M - 1 ]
This means from 2M signal sampling x[l] determine M coefficient of frequency X[k].
Subsequently, based on psychoacoustic model to M coefficient of frequency X[k] the sequence of piece quantize.These are the various psychoacoustic models that use in audio coding, such as, psychoacoustic model described in following document: standard ISO 13818-7:2005, Coding of Moving Pictures and Audio, 2005 or ISO14496-3:2009, Information technology – Coding of audio-visual objects – Part3:Audio, 2009 or 3GPP, General Audio Codec audio processing functions; Enhanced aac-Plus general audio codec; Encoder Specification AAC part, 2004, these documents are incorporated to by reference.Psychoacoustic model typically considers that people's ear has the fact of different susceptibility to different frequency.The required sound pressure level (SPL) of the sound signal of in other words, perception characteristic frequency is as the function of frequency and change.This is illustration in Fig. 6 a, and in Fig. 6 a, the threshold value of the audiometric curve 601 of people's ear is illustrated as the function of frequency.This means can in the situation that the threshold value of considering the audiometric curve 601 shown in Fig. 6 a to coefficient of frequency X[k] quantize.
In addition, should point out, the ability of the hearing of people's ear is tied in sheltering (masking).Term is sheltered can be subdivided into compose and is sheltered and temporal masking.Spectrum is sheltered the sheltering tone and can shelter other tones in the direct spectrum neighborhood of this frequency interval of sheltering tone of a certain energy level of a certain frequency interval of indication.This is illustration in Fig. 6 b, in Fig. 6 b, can observe, and the threshold value 602 of hearing is respectively around centre frequency 0.25kHz, 1kHz and 4kHz, increase in the spectrum neighborhood of the level narrow band noise that is 60dB.The threshold value 602 of the hearing raising is called as masking threshold Thr.This means that the masking threshold 602 that can consider shown in Fig. 6 b is to coefficient of frequency X[k] quantize.After temporal masking indicates last masking signal to shelter, a signal (sheltering after being called as or forward masking) and/or a rear masking signal can be sheltered last signal (be called as and shelter in advance or backward masking).
For instance, can use the psychoacoustic model from 3GPP standard.This model is by calculating a plurality of spectrum energy X of corresponding a plurality of frequency band b endetermine suitable psychoacoustic masking threshold value.Can be from MDCT coefficient of frequency X[k] by square MDCT coefficient sue for peace and determine a plurality of spectrum energy X of subband b (be also referred to as frequency band b this document, be also referred to as scale factor band under the context of HE-AAC) en[b], that is, be defined as:
X en [ b ] = Σ k = k 1 k 2 X 2 [ k ]
Use constant skew simulation worst case, that is, and the tone signal in whole audio frequency range.In other words, psychoacoustic model is not distinguished tonal components and non-pitch component.Suppose that all signal frames are all tones, this implies " worst " situation.As a result, do not carry out tone and non-pitch component and distinguish, so the counting yield of this psychoacoustic model is high.
The off-set value of using, corresponding to SNR (signal to noise ratio (S/N ratio)) value, should suitably select SNR value to guarantee high audio quality.For standard A AC, defined the logarithm SNR value of 29dB, and the threshold value in subband b be defined as:
Thr sc [ b ] = X en [ b ] SNR
3GPP model passes through the threshold value Thr in subband b scthe threshold value Thr of [b] and adjacent sub-bands b-1, b+1 sc[b-1] or Thr scthe weighted version of [b+1] compares and selects maximal value to carry out the auditory system of simulating human.This is relatively by using respectively the weighting coefficient s of different frequency dependences for lower neighborhood and upper neighborhood h[b] and s l[b] carries out to simulate the Different Slope of asymmetric masking curve 602.Therefore, from lowest sub-band, start and the first filtering operation of approaching the slope of 15dB/Bark is provided by following formula:
Thr′ spr[b]=max(Thr sc[b],s h[b]·Thr sc[b-1])
From the highest subband, start and the second filtering operation of approaching the slope of 30dB/Bark is provided by following formula:
Thr spr[b]=max(Thr′ spr[b],s l[b]·Thr spr[b+1])
For the masking threshold Thr from calculated spr[b] obtains the global threshold Thr[b of subband b], also should consider that quiet threshold value 601 (is called as Thr quiet[b]).This can be by selecting respectively higher value for two masking thresholds of each subband b so that the more leading part of two curves is considered to carry out.This means and whole masking threshold can be defined as:
Thr′[b]=max(Thr spr[b],Thr quiet[b])
In addition,, in order to make whole masking threshold Thr ' [b] more have resistibility to the problem of pre-echo, can apply additional modifications below.When transient signal occurs, from a piece to another piece, among some subband b, likely exist unexpected energy to increase or decline.Such energy jitter can cause the unexpected increase of masking threshold Thr ' [b], and this will cause quantizing the unexpected reduction of quality.Listened to the error of the form for pre-echo pseudomorphism in this sound signal that can cause encoding.With regard to this point, can be by selecting the masking threshold Thr as last lastthe masking threshold Thr[b of the current block of the function of [b]] come along time shaft smoothing masking threshold.Especially, can be by the masking threshold Thr[b of current block] be defined as:
Thr[b]=max(rpmn·Thr spr[b],min(Thr′[b],rpelev·Thr last[b]))
Wherein, rpmn, rpelv are suitable smoothing parameters.This reduction that is used for the masking threshold of transient signal causes higher SMR (signal-to-mask ratio) value, causes better quantification, finally causes listened to the error of the form for pre-echo pseudomorphism still less.
In quantification and decoding unit 303, use masking threshold Thr[b] the MDCT coefficient of piece 501 is quantized.With lower precision to being positioned at masking threshold Thr[b] following MDCT coefficient quantizes and decoding,, expends less bit that is.As in this document by general introduction, at colourity collection of illustrative plates, calculated before 352 (or in colourity collection of illustrative plates calculates context of 352) and also can use masking threshold Thr[b in 356 context processed in perception].
Generally speaking, can be summarized as: core encoder 412 provides:
Sound signal 301 is the expression (for long piece and for short block) of the form of MDCT coefficient sequence in time-frequency domain; And
For the relevant masking threshold Thr[b of frequency (subband)] the signal correction sensor model (for long piece and for short block) of form.
These data can be for determining the colourity collection of illustrative plates 353 of sound signal 301.For long piece (M=1024 sampling), the MDCT coefficient of piece typically has for determining the sufficiently high frequency resolution of chrominance vector.Because the AAC core codec 412 in HE-AAC scrambler 410 operates with half frequency of sample frequency, so with there is no the situation of AAC of SBR coding under compare, the MDCT conversion-domain representation using in HE-AAC has better frequency resolution for long piece.For instance, the sound signal 301 that is 44.1kHz for sampling rate, is Δ f=10.77Hz/ interval (bin) for the frequency resolution of the MDCT coefficient of long piece, and it is sufficiently high for chrominance vector of determining most of western pops.In other words, the frequency resolution of the long piece of the core encoder of HE-AAC scrambler is enough high, thereby reliably spectrum energy is distributed to the different tone classes (referring to Fig. 1 and table 1) of chrominance vector.
On the other hand, for short block (M=128), frequency resolution is that Δ f=86.13Hz/ is interval.Because basic frequency (F0s) is until the 6th octave, interval is all greater than 86.13Hz, so the frequency resolution that short block provides is typically not enough to determine chrominance vector.But, what may wish is the chrominance vector that can also determine short block, and this is because typically can comprise tone information (for example,, from Xylophone or Glockenspiel or electronic music class) with the transient audio signal of short block Serial relation connection.Such tone information may be important for reliable MIR application.
Below, describe for improving the various exemplary scenario of the frequency resolution of short block sequence.Compare with the conversion that original time-domain audio signal piece is transformed in frequency domain, these exemplary scenario have reduced computation complexity.This means, these exemplary scenario make from short block sequence, to determine chrominance vector (with directly determining and compare from time-domain signal) with the computation complexity reducing.
As above, summarize, for transient audio signal is encoded, AAC scrambler is typically selected the sequence of eight short blocks, rather than single long piece.With regard to this point, provide eight MDCT coefficient block X l[k], l=0 ..., the sequence of N-1, the in the situation that of AAC, N=8.For improving the first scheme of the frequency resolution of short block spectrum, can be that link length is M shortthe N of (=128) block of frequency coefficients X lto X n, and coefficient of frequency is interweaved.This short block interleaving scheme (SIS) according to the time index of coefficient of frequency by these coefficient of frequencies rearrange for length be M long=NM shortthe new piece X of (=1024) sIS.This can carry out according to following formula:
X SIS[kN+l]=X l[k],k∈[0,....,M short-1],l∈[0,...,N-1]
This of coefficient of frequency interweaves has increased the quantity of coefficient of frequency, thereby has improved resolution.But because same frequency, N low resolution coefficient in the same time be not mapped to different frequency, at N high resolving power coefficient of synchronization, so introduced variance and be ± error in a N/2 interval.But, the in the situation that of HE-AAC or AAC, the method is by being M to length shortthe coefficient of=128 a N=8 short block interweaves to make can estimate to have M longthe spectrum of=1024 coefficients.
For another program of frequency resolution of sequence of improving N short block based on adaptive hybrid transform (AHT).AHT utilizes the following fact: relatively constant if time signal keeps, its spectrum will typically will can not change rapidly.The decorrelation of such spectrum signal is by the compact representation in causing between low frequency range.For signal being carried out to the conversion of decorrelation, can be the DCT-II (discrete cosine transform) that approaches Karhunen-Loeve-Transform (KLT).From the meaning of decorrelation, KLT is best.Yet KLT is signal correction, therefore can in the situation that there is no high complexity, be employed.AHT formula below can be counted as for the coefficient of frequency of corresponding short block frequency separation being carried out to the DCT-II core of decorrelation and the combination of above-mentioned SIS, its:
X AHT [ kN + l ] = 2 N C l Σ m = 0 N - 1 X m [ k ] cos ( ( 2 m + 1 ) lπ 2 N )
Compare block of frequency coefficients X with SIS aHTthere is the frequency resolution of raising and the error variance of reduction., compare with the complete MDCT of the long piece of sampled audio signal, the computation complexity of AHT scheme is lower meanwhile.
With regard to this point, can be to the N=8 of a frame short block (being equal to long) application AHT, to estimate the long piece spectrum of high resolving power.Thereby the quality of the colourity collection of illustrative plates of gained is benefited from approaching of long piece spectrum, rather than use short block spectral sequence.Should point out, conventionally, AHT scheme can be applied to the piece of any amount, because DCT-II is non-lapped transform.Therefore, AHT scheme can be applied to the subset of short block sequence.This can be useful for the specified conditions that make AHT scheme adapt to audio frequency.For instance, can compose similarity measurement and be that different subsets are distinguished a plurality of different static entities in short block sequence by short block sequences segmentation by calculating.Then can to these subsets, process to improve with AHT the frequency resolution of these subsets.
Be used for improving MDCT coefficient block X l[k], l=0 ..., another program of the frequency resolution of the sequence of N-1 is the heterogeneous description of using the basic MDCT conversion of short block sequence and the MDCT conversion of long piece.By doing like this, can determine transition matrix Y, it carries out MDCT coefficient block X l[k], l=0 ..., the sequence of N-1 (being short block sequence) is to the accurate conversion of the MDCT coefficient block of long piece, that is:
X PPC=Y·[X 0,....,X N-1]
Wherein, X pPCmean [3, MN] matrix of the MDCT coefficient of long piece and the impact of two previous frames, Y is that (wherein, the coefficient of the third dimension representing matrix Y of matrix Y is the fact of 3 order polynomials to [MN, MN, 3] transition matrix, this means that matrix element is az -2+ bz -1+ cz -0described equation, wherein, z represents the delay of a frame), [X 0..., X n-1] be [1, MN] vector that the MDCT coefficient by N short block forms.N is the quantity that forms the short block of the long piece that length is N * M, and M is the quantity of the sampling in short block.
Transition matrix Y determined from composite matrix G and analysis matrix H, that is, Y=GH, composite matrix G is for N short block transformed to time domain, and analysis matrix H is used for the time-domain sampling of long piece to transform to frequency domain.Transition matrix Y makes can be from N the long piece MDCT of ideally reconstruct of short block MDCT coefficient sets coefficient.Can illustrate, transition matrix Y is sparse, this means that the signal portion of the matrix coefficient of transition matrix Y can be set to zero, and affects indistinctively conversion accuracy.This is all to comprise that due to two matrix G and H the fact of the DCT-IV conversion coefficient of weighting causes.The transition matrix Y=GH of gained is sparse matrix, because DCT is orthogonal transformation.Therefore, can in calculating, ignore many coefficients of transition matrix Y, because they are close to zero.Typically, the band of consideration principal diagonal q coefficient is around enough.The method makes the complexity of the conversion from short block to long piece and the precision can convergent-divergent, because q can be selected from 1 to M * N.Can illustrate, with the O ((MN) in Recursive Implementation 2) or the complexity of the long piece MDCT of O (MNlog (MN)) compare, the complexity of changing is O (qMN3).This means and use the conversion of heterogeneous transition matrix Y to be realized with the computation complexity lower than recalculating of the MDCT of long piece.
" Fast audio feature extraction from compressed audio data " at G.Schuller, M.Gruhne and T.Friedrich, Selected Topics in Signal Processing, IEEE Journal of, 5 (6): 1262-1271, in oct.2011, described the details about heterogeneous conversion, the document is incorporated to by reference.
As the result of heterogeneous conversion, obtain long piece MDCT coefficient X pPCestimation, it is provided as short block MDCT coefficient [X 0..., X n-1] N frequency resolution doubly.This means estimated long piece MDCT coefficient X pPCtypically have sufficiently high for determining the frequency resolution of chrominance vector.
Fig. 7 a to e illustrates the example spectrogram of the sound signal that comprises distinct frequency component that can find out from the spectrogram 700 based on long piece MDCT.From the spectrogram 701 shown in Fig. 7 b, can find out, by estimated long piece MDCT coefficient X pPCapproach well spectrogram 700.In an example shown, q=32, only considers 3% the coefficient of transition matrix Y.This means long piece MDCT coefficient X pPCestimation can be determined with significantly reduced computation complexity.
The long piece MDCT coefficient X of Fig. 7 c illustration based on estimated aHTspectrogram 702.Can observe, frequency resolution is lower than the frequency resolution of the correct long piece MDCT coefficient shown in spectrogram 700., can find out estimated long piece MDCT coefficient X meanwhile aHTprovide than the estimated long piece MDCT coefficient X shown in the spectrogram of Fig. 7 d 703 sIShigh frequency resolution, the spectrogram 703 of Fig. 7 d itself provides the short block MDCT coefficient [X more indicated than the spectrogram of Fig. 7 e 704 0..., X n-1] high frequency resolution.
The different frequency resolution being provided to long piece conversion plan by the various short blocks of summarizing above is also reflected in from the quality of the definite chrominance vector of the various estimations of long piece MDCT coefficient.This is shown in Figure 8, and Fig. 8 illustrates the average chrominance similarity for several test files.The mean square deviation that colourity similarity can for example indicate the chrominance vector that obtains from long piece MDCT coefficient to compare with the chrominance vector of long piece MDCT coefficient acquisition from estimated.The benchmark of label 801 indication colourity similaritys.Can find out, based on heterogeneous conversion and definite estimation has relatively high similarity degree 802.The in the situation that of q=32, that is, entirely to change 3% of complexity, carry out heterogeneous conversion.In addition, exemplified with the similarity degree 803 realizing with adaptive hybrid transform, the similarity degree 804 realizing with short block interleaving scheme and the similarity degree 805 based on short block realization.
With regard to this point, described and made to determine the method for colourity collection of illustrative plates by core encoder (for example, AAC core encoder) the MDCT coefficient providing based on by based on SBR.Summarized and can how by approaching corresponding long piece MDCT coefficient, to have improved the resolution of short block MDCT coefficient sequence.Compare with recalculate long piece MDCT coefficient from time domain, can be with the true fixed-length block MDCT of the computation complexity coefficient reducing.With regard to this point, can also with the computation complexity reducing, determine the chrominance vector of transient audio signal.
Below, describe for strengthen the method for colourity collection of illustrative plates from perception.Particularly, the method that the sensor model being provided by audio coder is provided is described.
As summarized above, how the object of the psychoacoustic model in perception and lossless audio coding device quantizes some trickle part of spectrum according to given bit rate typically.In other words, the psychoacoustic model of scrambler provides the grade for the perceived relevance of each frequency band b.In sense correlation part, mainly comprise that, under the prerequisite of harmony content, the application of masking threshold should improve the quality of colourity collection of illustrative plates.The colourity collection of illustrative plates of multi-tone signal should be useful especially, because the noise section of sound signal is out in the cold or at least decay.
Summarized and how for frequency band b, to have determined (that is, block-by-block) masking threshold Thr[b frame by frame].Scrambler is by will be for each coefficient of frequency X[k] masking threshold Thr[b] with the energy X of sound signal at frequency band b en[b] (in the situation that of HE-AAC, it is also referred to as scale factor band) compares to use this masking threshold, energy X en[b] comprises frequency indices k.Whenever energy value X enwhen [b] is brought down below masking value, just ignore X[k], that is, typically, with respect to the method based on described in this document in chord identification application and definite colourity collection of illustrative plates by band comparatively speaking, coefficient of frequency (that is, energy value) X[k] with the masking threshold Thr[b of frequency band b] by coefficients comparison, only provide less quality benefit.On the other hand, by coefficients comparison, will cause computation complexity to improve.With regard to this point, use the average energy value X of each frequency band b enthe block-by-block of [b] can be relatively preferred.
Typically, the energy that comprises harmony contributor of frequency band b (being also referred to as scale factor band energy) should be higher than perceptual mask threshold Thr[b].On the other hand, the energy that mainly comprises noise of frequency band b should be less than masking threshold Thr[b].With regard to this point, scrambler provides coefficient of frequency X[k] the noise reduction version of perception excitation, it can be for determining the chrominance vector (and colourity collection of illustrative plates of frame sequence) to framing.
Alternately, can determine the masking threshold of revising from the available data of audio coder.Scale factor band energy distribution X at specific (or frame) enin [b] given situation, can be by use constant SMR (signal-to-mask ratio) to determine the masking threshold Thr revising for all scale factor band b constSMR, that is, and Thr constSMR=X en[b]-SMR.The masking threshold of this correction can by with low assess the cost definite because it only needs subtraction.In addition, the masking threshold of correction is strictly followed the energy of spectrum, so that can easily adjust by adjusting the SMR value of scrambler the amount of unheeded spectrum data.
Should point out, the SMR of tone can depend on tone amplitude and pitch frequency.With regard to this point, as substituting of above-mentioned constant SMR, can be based on scale factor band energy X en[b] and/or tape index b adjust/revise SMR.
In addition, should point out, can directly from audio coder, receive the scale factor band energy distribution X of specific (frame) en[b].In the context that audio coder typically quantizes in (psychologic acoustics), determine this scale factor band energy distribution X en[b].For determining that the method for the chrominance vector of frame can receive scale factor band energy distribution X as calculated from audio coder en[b] (rather than calculating energy value), to determine above-mentioned masking threshold, thereby reduces the definite computation complexity of chrominance vector.
Can be by arranging apply the masking threshold of correction.If suppose that each scale factor band b only exists a harmony contributor, this is with the energy X in b enthe coefficient X[k of [b] and energy spectrum] should there is similar value.Therefore, X en[b] reduces constant SMR value and should obtain only catching the masking threshold of correction of the harmony part of spectrum.The not sum part of spectrum should be divided and be set to zero.The chrominance vector of frame (and colourity collection of illustrative plates of frame sequence) can be determined from (that is, processing through perception) coefficient of frequency of revising.
Fig. 9 illustration is for determining the process flow diagram of the exemplary method 900 of chrominance vector sequence from the piece sequence of sound signal.In step 901, receive frequency coefficient (for example, MDCT coefficient) piece.This block of frequency coefficients is received from the audio coder of deriving block of frequency coefficients from the corresponding sampling block of sound signal.Especially, block of frequency coefficients may be derived from (down-sampling) low frequency component of sound signal by the core encoder of the audio coder based on SBR.If block of frequency coefficients is corresponding to short block sequence, method 900 is carried out the short block of summarizing in this document and is arrived long piece transform method (step 902) (for example, SIS, AHT or PPC scheme).As a result, obtain the estimation for the long piece of coefficient of frequency.Alternatively, as above, summarize, method 900 can be submitted to (estimated) block of frequency coefficients the threshold value (step 903) of psychoacoustic frequency dependence.Subsequently, from the long piece of coefficient of frequency of gained, determine chrominance vector (step 904).If repeat the method for piece sequence, obtain the colourity collection of illustrative plates (step 905) of sound signal.
In this document, described for determine the whole bag of tricks and the system of chrominance vector and/or colourity collection of illustrative plates with the computation complexity reducing.Especially, proposed to use the T/F of the sound signal being provided by audio codec (such as HE-AAC codec) to represent.For continuous colourity collection of illustrative plates (in the situation that scrambler hopefully or is undesirably switched to short block, also for the transient state part of sound signal) is provided, described for improving the method for the frequency resolution that short block T/F represents.In addition, proposed to use the psychoacoustic model being provided by audio codec, to improve the perception conspicuousness of colourity collection of illustrative plates.
Should point out, this description and accompanying drawing are only exemplified with the principle of proposed method and system.Therefore will recognize, and although those skilled in the art can find out, clearly do not describe in this article or illustrate, still embody principle of the present invention and be included various layouts within the spirit and scope of the present invention.In addition, all examples of recording herein in principle clearly intention only for the object of imparting knowledge to students to help principle and the inventor of the method and system that reader understanding was proposed to promote the design that this area is contributed, and to be understood to be not limited to these concrete example and conditions of recording.And all statements of recording principle of the present invention, aspect and embodiment and specific examples thereof herein are all intended to comprise its equivalents.
Method and system described in this document may be implemented as software, firmware and/or hardware.Some assembly can for example be implemented as the software moving on digital signal processor or microprocessor.Other assemblies can for example be implemented as hardware and/or special IC.The signal running in described method and system can be stored on the medium such as random access memory or optical storage medium.They can transmit via network, described network such as radio net, satellite network, wireless network or cable network, for example internet.The typical device of the method and system described in use this document is for portable electron device or for storing and/or present other consumer devices of sound signal.

Claims (35)

1. one kind for determining the method (900) of chrominance vector (100) of the sampling block of sound signal (301), and described method (900) comprising:
-from the core encoder (412) of the audio coder based on spectral band replication (410), receiving the corresponding frequencies coefficient block that derive from the sampling block of sound signal (301) (901), described core encoder (412) is suitable for producing from described block of frequency coefficients the bit stream (305) of the coding of sound signal (301); With
-the block of frequency coefficients based on received is determined the chrominance vector (100) of the sampling block of (904) sound signal (301).
2. method according to claim 1 (900), wherein, the described audio coder (410) based on spectral band replication is applied any one in following: the decoding of high-level efficiency advanced audio, mp3PRO and MPEG-D USAC.
3. according to the method (900) described in any one claim above, wherein, described block of frequency coefficients is any one in following:
-Modified Discrete Cosine Transform coefficient block, Modified Discrete Cosine Transform is called as MDCT;
-revise discrete sine transform coefficient block, revise discrete sine transform and be called as MDST;
-discrete Fourier transform (DFT) coefficient block, discrete Fourier transform (DFT) is called as DFT; With
-revise complex lapped transform coefficient block, revise plural lapped transform and be called as MCLT.
4. according to the method (900) described in any one claim above, wherein,
-sampling block comprises N short block in succession, and each in N short block in succession has respectively M sampling;
-block of frequency coefficients comprises N corresponding short block, and each corresponding short block has respectively M coefficient of frequency.
5. method according to claim 4 (900), wherein, described method also comprises:
-from the short block of N M coefficient of frequency, estimate (902) and the long piece of the corresponding coefficient of frequency of described sampling block; Wherein, the long piece of estimated coefficient of frequency is compared the frequency resolution with increase with N coefficient of frequency short block; With
-long the piece of coefficient of frequency based on estimated is determined the chrominance vector of the sampling block of (904) sound signal (301).
6. method according to claim 5 (900), wherein, estimates that the long piece of (902) coefficient of frequency comprises the corresponding frequencies coefficient of N coefficient of frequency short block is interweaved, thus the long piece of the coefficient of frequency that obtains interweaving.
7. method according to claim 6 (900), wherein, estimate that the long piece of (902) coefficient of frequency comprises that the long piece of coefficient of frequency by the conversion with energy accumulating character is applied to interweave carries out decorrelation to the N of N coefficient of frequency short block corresponding frequencies coefficient, described in there is energy accumulating character conversion be that for example DCT-II converts.
8. method according to claim 5 (900), wherein, estimate that the long piece of (902) coefficient of frequency comprises:
-form a plurality of subsets of N coefficient of frequency short block, wherein, the quantity of the short block of each subset is selected based on described sound signal;
-for each subset, the corresponding frequencies coefficient of coefficient of frequency short block is interweaved, thus the coefficient of frequency intermediate mass interweaving of this subset obtained; With
-for each subset, the conversion with energy accumulating character is applied to the coefficient of frequency intermediate mass interweaving of this subset, thereby obtain the coefficient of frequency intermediate mass of a plurality of estimations of described a plurality of subsets, described in there is energy accumulating character conversion be for example DCT-II conversion.
9. method according to claim 5 (900), wherein, estimate that the long piece of (902) coefficient of frequency comprises: by heterogeneous transformation applications in the short block of N M coefficient of frequency.
10. method according to claim 9 (900), wherein,
-described heterogeneous conversion is based on transition matrix, and described transition matrix for being transformed to the short block of N M coefficient of frequency the long piece of N * M coefficient of frequency accurately on mathematics; With
-described heterogeneous conversion is used wherein a part of transition matrix coefficient of described transition matrix to be set to zero approaching.
11. methods according to claim 10 (900), wherein, 90% or the described transition matrix coefficient of a more part be set to zero.
12. methods according to claim 5 (900), wherein, estimate that the long piece of (902) coefficient of frequency comprises:
-form a plurality of subsets of N coefficient of frequency short block, wherein, the quantity L of the short block of each subset is selected L<N based on described sound signal;
-by the heterogeneous transformation applications in centre in described a plurality of subsets, thereby obtain the coefficient of frequency intermediate mass of a plurality of estimations; Wherein, the heterogeneous conversion in described centre is based on intermediate conversion matrix, and described intermediate conversion matrix for being transformed to the short block of L M coefficient of frequency the intermediate mass of L * M coefficient of frequency accurately on mathematics; And
Wherein, the heterogeneous conversion in described centre is used wherein a part of intermediate conversion matrix coefficient of described intermediate conversion matrix to be set to zero approaching.
13. according to claim 10 to the method (900) described in any one in 12, and wherein, described part is variable, thereby changes the quality of estimated block of frequency coefficients.
14. according to the method (900) described in any one in claim 4 to 13, wherein, and M=128, N=8.
15. according to the method (900) described in any one in claim 5 to 14, also comprises:
-from the long piece of corresponding a plurality of coefficient of frequencies, estimate and the corresponding coefficient of frequency overlength of a plurality of sampling blocks piece; Wherein, estimated coefficient of frequency overlength piece is compared the frequency resolution with increase with the long piece of described a plurality of coefficient of frequencies.
16. according to the method (900) described in any one claim above, wherein, determines that chrominance vector (100) comprises frequency dependence psychologic acoustics is processed to the second frequency coefficient block that application (903) is derived in the block of frequency coefficients from received.
17. bases are quoted the method (900) described in any one claim 16 in claim 5 to 7 and 9 to 11, and wherein, described second frequency coefficient block is the estimated long piece of coefficient of frequency.
18. bases are quoted the method (900) described in any one claim 16 in claim 1 to 4, and wherein, described second frequency coefficient block is received block of frequency coefficients.
19. bases are quoted the method (900) described in any one claim 16 in claim 8 and 12, wherein, and one of coefficient of frequency intermediate mass that described second frequency coefficient block is described a plurality of estimations.
20. according to the method described in the claim 16 with reference to claim 15 (900), and wherein, described second frequency coefficient block is estimated coefficient of frequency overlength piece.
21. according to claim 16 to the method (900) described in any one in 20, and wherein, application (903) frequency dependence psychologic acoustics is processed and comprised:
-value and the frequency dependence energy threshold of from least one coefficient of frequency of described second frequency coefficient block, deriving are compared; With
If-this coefficient of frequency is lower than this energy threshold, this coefficient of frequency is set to zero.
22. methods according to claim 21 (900), wherein, the value deriving from described at least one coefficient of frequency is corresponding to the average energy deriving from a plurality of coefficient of frequencies of corresponding a plurality of frequencies.
23. according to the method (900) described in any one in claim 21 to 22, and wherein, described energy threshold is exported from the applied psychoacoustic model of core encoder (412).
24. methods according to claim 23 (900), wherein, described energy threshold is exported for the frequency dependence masking threshold that described block of frequency coefficients is quantized from core encoder.
25. according to claim 16 to the method (900) described in any one in 24, wherein, determines that chrominance vector (100) comprising:
-some or all in the coefficient of frequency of second are categorized as to the tone class of chrominance vector (100); With
-the coefficient of frequency based on classified is determined the cumlative energy of the tone class of chrominance vector (100).
26. methods according to claim 25 (900), wherein, are used the bandpass filter (200) being associated with the tone class of chrominance vector (100) to classify to coefficient of frequency.
27. according to the method (900) described in any one claim above, also comprises:
-from the sampling block sequence of sound signal (301), determine chrominance vector (100) sequence, thus the colourity collection of illustrative plates of the sound signal of obtaining (301).
28. 1 kinds are suitable for the audio coder (350,410) that sound signal (301) is encoded, and described audio coder (350,410) comprising:
-core encoder (302,412), described core encoder (302,412) be suitable for the low frequency component through down-sampling of sound signal (301) to encode, wherein, described core encoder (412) is suitable for encoding by sampling block being transformed to the sampling block to low frequency component in frequency domain, thereby obtains corresponding block of frequency coefficients; With
-colourity determining unit (352,356), described colourity determining unit (352,356) is suitable for determining based on described block of frequency coefficients the chrominance vector (100) of sampling block of the low frequency component of sound signal (301).
29. scramblers according to claim 28 (350,410), also comprise spectral band replication scrambler (414), and described spectral band replication scrambler (414) is suitable for the corresponding high fdrequency component of sound signal (301) to encode.
30. scramblers according to claim 29 (350,410), also comprise:
-multiplexer (354,415), described multiplexer (354,415) be suitable for from by core encoder (302,412) and the data that provide of spectral band replication scrambler (414) produce the bit stream (355) of coding, wherein, described multiplexer (354,415) is suitable for adding to the information deriving from chrominance vector (100) as metadata the bit stream (355) of coding.
31. scramblers according to claim 30 (350,410), wherein, any form in following column format is encoded to the bit stream (355) of coding: MP4 form, 3GP form, 3G2 form, LATM form.
32. 1 kinds are suitable for the audio decoder (430) that sound signal (301) is decoded, and described audio decoder (430) comprising:
-demultiplexing and decoding unit (431), described demultiplexing and decoding unit (431) are suitable for the bit stream of received code, and are suitable for the bitstream extraction block of frequency coefficients from described coding; Wherein, described block of frequency coefficients is associated with the corresponding sampling block of the low frequency component through down-sampling of sound signal (301); With
-colourity determining unit (352,356), described colourity determining unit (352,356) is suitable for determining based on described block of frequency coefficients the chrominance vector (100) of the sampling block of sound signal (301).
33. 1 kinds of software programs, described software program is suitable for carrying out on processor, and the method step described in any one in being suitable for executing claims 1 to 27 on described processor when carrying out.
34. 1 kinds of storage mediums, described storage medium comprises software program, described software program is suitable for carrying out on processor, and is suitable for executing claims the method step described in any one in 1 to 27 when carrying out on calculation element.
35. 1 kinds of computer programs, described computer program comprise for when carrying out on computers, execute claims 1 to 27 any one described in the executable instruction of method step.
CN201280058961.7A 2011-11-30 2012-11-28 The enhanced colourity extraction from audio codec Expired - Fee Related CN103959375B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201161565037P 2011-11-30 2011-11-30
US61/565,037 2011-11-30
PCT/EP2012/073825 WO2013079524A2 (en) 2011-11-30 2012-11-28 Enhanced chroma extraction from an audio codec

Publications (2)

Publication Number Publication Date
CN103959375A true CN103959375A (en) 2014-07-30
CN103959375B CN103959375B (en) 2016-11-09

Family

ID=47720463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201280058961.7A Expired - Fee Related CN103959375B (en) 2011-11-30 2012-11-28 The enhanced colourity extraction from audio codec

Country Status (5)

Country Link
US (1) US9697840B2 (en)
EP (1) EP2786377B1 (en)
JP (1) JP6069341B2 (en)
CN (1) CN103959375B (en)
WO (1) WO2013079524A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107731240A (en) * 2016-08-12 2018-02-23 黑莓有限公司 System and method for Compositing Engine sound
CN107864678A (en) * 2015-06-26 2018-03-30 亚马逊技术公司 Detection and interpretation to visual detector
CN109273013A (en) * 2015-03-13 2019-01-25 杜比国际公司 Decode the audio bit stream with the frequency spectrum tape copy metadata of enhancing
CN110442075A (en) * 2018-05-04 2019-11-12 康茂股份公司 Monitor method, corresponding monitoring system and the computer program product for the treatment of stations mode of operation
CN111863030A (en) * 2020-07-30 2020-10-30 广州酷狗计算机科技有限公司 Audio detection method and device

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11271993B2 (en) 2013-03-14 2022-03-08 Aperture Investments, Llc Streaming music categorization using rhythm, texture and pitch
US10225328B2 (en) 2013-03-14 2019-03-05 Aperture Investments, Llc Music selection and organization using audio fingerprints
US10242097B2 (en) * 2013-03-14 2019-03-26 Aperture Investments, Llc Music selection and organization using rhythm, texture and pitch
US10061476B2 (en) 2013-03-14 2018-08-28 Aperture Investments, Llc Systems and methods for identifying, searching, organizing, selecting and distributing content based on mood
US10623480B2 (en) 2013-03-14 2020-04-14 Aperture Investments, Llc Music categorization using rhythm, texture and pitch
EP2830065A1 (en) 2013-07-22 2015-01-28 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for decoding an encoded audio signal using a cross-over filter around a transition frequency
EP2830058A1 (en) * 2013-07-22 2015-01-28 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Frequency-domain audio coding supporting transform length switching
JP6220701B2 (en) * 2014-02-27 2017-10-25 日本電信電話株式会社 Sample sequence generation method, encoding method, decoding method, apparatus and program thereof
WO2015136159A1 (en) * 2014-03-14 2015-09-17 Berggram Development Oy Method for offsetting pitch data in an audio file
US20220147562A1 (en) 2014-03-27 2022-05-12 Aperture Investments, Llc Music streaming, playlist creation and streaming architecture
US9935604B2 (en) * 2015-07-06 2018-04-03 Xilinx, Inc. Variable bandwidth filtering
KR20180088184A (en) * 2017-01-26 2018-08-03 삼성전자주식회사 Electronic apparatus and control method thereof
EP3382700A1 (en) * 2017-03-31 2018-10-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for post-processing an audio signal using a transient location detection
EP3382701A1 (en) 2017-03-31 2018-10-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for post-processing an audio signal using prediction based shaping
JP7230464B2 (en) * 2018-11-29 2023-03-01 ヤマハ株式会社 SOUND ANALYSIS METHOD, SOUND ANALYZER, PROGRAM AND MACHINE LEARNING METHOD
WO2020178322A1 (en) * 2019-03-06 2020-09-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for converting a spectral resolution

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011000780A1 (en) * 2009-06-29 2011-01-06 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Bandwidth extension encoder, bandwidth extension decoder and phase vocoder

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001154698A (en) * 1999-11-29 2001-06-08 Victor Co Of Japan Ltd Audio encoding device and its method
US6930235B2 (en) * 2001-03-15 2005-08-16 Ms Squared System and method for relating electromagnetic waves to sound waves
JP2006018023A (en) 2004-07-01 2006-01-19 Fujitsu Ltd Audio signal coding device, and coding program
US7627481B1 (en) 2005-04-19 2009-12-01 Apple Inc. Adapting masking thresholds for encoding a low frequency transient signal in audio data
KR100715949B1 (en) 2005-11-11 2007-05-08 삼성전자주식회사 Method and apparatus for classifying mood of music at high speed
WO2007070007A1 (en) 2005-12-14 2007-06-21 Matsushita Electric Industrial Co., Ltd. A method and system for extracting audio features from an encoded bitstream for audio classification
EP2022041A1 (en) 2006-04-14 2009-02-11 Koninklijke Philips Electronics N.V. Selection of tonal components in an audio spectrum for harmonic and key analysis
US8463719B2 (en) * 2009-03-11 2013-06-11 Google Inc. Audio classification for information retrieval using sparse features
TWI484473B (en) 2009-10-30 2015-05-11 Dolby Int Ab Method and system for extracting tempo information of audio signal from an encoded bit-stream, and estimating perceptually salient tempo of audio signal
GEP20146081B (en) 2009-12-07 2014-04-25 Dolby Laboratories Licensing Corp Decoding of multichannel aufio encoded bit streams using adaptive hybrid transformation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011000780A1 (en) * 2009-06-29 2011-01-06 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Bandwidth extension encoder, bandwidth extension decoder and phase vocoder

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
G.SCHULLER ET AL: "FAST AUDIO FEATURE EXTRACTION FROM COMPRESSED AUDIO DATA", 《SELECTED TOPICS IN SIGNAL PROCESSING》, 1 October 2011 (2011-10-01), pages 1262 - 1271, XP011386720, DOI: doi:10.1109/JSTSP.2011.2158802 *
RAVELLI E ET AL: "AUDIO SIGNAL REPRESENTATIONS FOR INDEXING IN THE RANSFORM DOMAIN", 《IEEE TRANSACTIONS ON AUDIO,SPEECH AND LANGUAGE PROCESSING》, 1 March 2010 (2010-03-01), pages 434 - 446 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109273013A (en) * 2015-03-13 2019-01-25 杜比国际公司 Decode the audio bit stream with the frequency spectrum tape copy metadata of enhancing
CN109273013B (en) * 2015-03-13 2023-04-04 杜比国际公司 Decoding an audio bitstream with enhanced spectral band replication metadata
US11664038B2 (en) 2015-03-13 2023-05-30 Dolby International Ab Decoding audio bitstreams with enhanced spectral band replication metadata in at least one fill element
CN107864678A (en) * 2015-06-26 2018-03-30 亚马逊技术公司 Detection and interpretation to visual detector
CN107864678B (en) * 2015-06-26 2021-09-28 亚马逊技术公司 Detection and interpretation of visual indicators
CN107731240A (en) * 2016-08-12 2018-02-23 黑莓有限公司 System and method for Compositing Engine sound
CN107731240B (en) * 2016-08-12 2023-06-27 黑莓有限公司 System and method for synthesizing engine sound
CN110442075A (en) * 2018-05-04 2019-11-12 康茂股份公司 Monitor method, corresponding monitoring system and the computer program product for the treatment of stations mode of operation
CN111863030A (en) * 2020-07-30 2020-10-30 广州酷狗计算机科技有限公司 Audio detection method and device

Also Published As

Publication number Publication date
WO2013079524A3 (en) 2013-07-25
CN103959375B (en) 2016-11-09
WO2013079524A2 (en) 2013-06-06
EP2786377B1 (en) 2016-03-02
EP2786377A2 (en) 2014-10-08
US20140310011A1 (en) 2014-10-16
JP2015504539A (en) 2015-02-12
US9697840B2 (en) 2017-07-04
JP6069341B2 (en) 2017-02-01

Similar Documents

Publication Publication Date Title
CN103959375B (en) The enhanced colourity extraction from audio codec
KR101370515B1 (en) Complexity Scalable Perceptual Tempo Estimation System And Method Thereof
CN101297356B (en) Audio compression
Lu et al. A robust audio classification and segmentation method
JP6262668B2 (en) Bandwidth extension parameter generation device, encoding device, decoding device, bandwidth extension parameter generation method, encoding method, and decoding method
EP1441330B1 (en) Method of encoding and/or decoding digital audio using time-frequency correlation and apparatus performing the method
Nematollahi et al. Digital speech watermarking based on linear predictive analysis and singular value decomposition
Wu et al. Low bitrates audio object coding using convolutional auto-encoder and densenet mixture model
RU2409874C9 (en) Audio signal compression
Sankar et al. Mel scale-based linear prediction approach to reduce the prediction filter order in CELP paradigm
Sen et al. Feature extraction
Zhou et al. A robust audio fingerprinting algorithm in MP3 compressed domain
Bhattacharjee et al. Speech/music classification using phase-based and magnitude-based features
Hollosi et al. Complexity Scalable Perceptual Tempo Estimation from HE-AAC Encoded Music
Lotfaliei A New Audio Compression Scheme that Leverages Repetition in Music
Lin et al. Audio Bandwidth Extension Using Audio Super-Resolution
Li et al. Research on Audio Processing Method Based on 3D Technology
D'Aguanno et al. Tempo induction algorithm in MP3 compressed domain
CN117649846A (en) Speech recognition model generation method, speech recognition method, device and medium
Umapathy et al. Audio Coding and Classification: Principles and Algorithms
Marrakchi-Mezghani et al. ROBUSTNESS OF AUDIO FINGERPRINTING SYSTEMS FOR CONNECTED AUDIO APPLICATIONS
Fink et al. Enhanced Chroma Feature Extraction from HE-AAC Encoder
Laaksonen Kaistanlaajennus korkealaatuisessa audiokoodauksessa
Te Li MPEG-4 SCALABLE LOSSLESS AUDIO TRANSPARENT BITRATE AND ITS APPLICATION
JIA Complexity-scalable bit detection with MP3 audio bitstreams

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161109

Termination date: 20181128

CF01 Termination of patent right due to non-payment of annual fee