US20150279383A1 - Processing Audio Signals with Adaptive Time or Frequency Resolution - Google Patents

Processing Audio Signals with Adaptive Time or Frequency Resolution Download PDF

Info

Publication number
US20150279383A1
US20150279383A1 US14/735,635 US201514735635A US2015279383A1 US 20150279383 A1 US20150279383 A1 US 20150279383A1 US 201514735635 A US201514735635 A US 201514735635A US 2015279383 A1 US2015279383 A1 US 2015279383A1
Authority
US
United States
Prior art keywords
audio
auditory
frequency
time
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US14/735,635
Other versions
US9165562B1 (en
Inventor
Brett G. Crockett
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/922,394 external-priority patent/US20020116178A1/en
Priority claimed from PCT/US2002/004317 external-priority patent/WO2002084645A2/en
Priority claimed from PCT/US2002/005999 external-priority patent/WO2002097792A1/en
Priority to US14/735,635 priority Critical patent/US9165562B1/en
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Assigned to DOLBY LABORATORIES LICENSING CORPORATION reassignment DOLBY LABORATORIES LICENSING CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CROCKETT, BRETT G.
Priority to US14/842,208 priority patent/US20150371649A1/en
Publication of US20150279383A1 publication Critical patent/US20150279383A1/en
Publication of US9165562B1 publication Critical patent/US9165562B1/en
Application granted granted Critical
Priority to US15/264,828 priority patent/US20170004838A1/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • G10L19/025Detection of transients or attacks for time/frequency resolution switching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/44Receiver circuitry for the reception of television signals according to analogue transmission standards
    • H04N5/60Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/04Synchronising

Definitions

  • the present invention pertains to the field of psychoacoustic processing of audio signals.
  • the invention relates to aspects of dividing or segmenting audio signals into “auditory events,” each of which tends to be perceived as separate and distinct, and to aspects of generating reduced-information representations of audio signals based on auditory events and, optionally, also based on the characteristics or features of audio signals within such auditory events.
  • Auditory events may be useful as defining the MPEG-7 “Audio Segments” as proposed by the “ISO/IEC JTC 1/SC 29/WG 11.”
  • ASA auditory scene analysis
  • MPEG ISO/IEC JTC 1/SC 29/WG 11
  • MPEG-7 ISO/IEC JTC 1/SC 29/WG 11
  • a common shortcoming of such methods is that they ignore auditory scene analysis.
  • Such methods seek to measure, periodically, certain “classical” signal processing parameters such as pitch, amplitude, power, harmonic structure and spectral flatness.
  • Such parameters while providing useful information, do not analyze and characterize audio signals into elements perceived as separate and distinct according to human cognition.
  • MPEG-7 descriptors may be useful in characterizing an Auditory Event identified in accordance with aspects of the present invention.
  • a computationally efficient process for dividing audio into temporal segments or “auditory events” that tend to be perceived as separate and distinct is provided.
  • the locations of the boundaries of these auditory events provide valuable information that can be used to describe an audio signal.
  • the locations of auditory event boundaries can be assembled to generate a reduced-information representation, “signature, or “fingerprint” of an audio signal that can be stored for use, for example, in comparative analysis with other similarly generated signatures (as, for example, in a database of known works).
  • Bregman notes that “[w]e hear discrete units when the sound changes abruptly in timbre, pitch, loudness, or (to a lesser extent) location in space.” ( Auditory Scene Analysis—The Perceptual Organization of Sound, supra at page 469). Bregman also discusses the perception of multiple simultaneous sound streams when, for example, they are separated in frequency.
  • the audio event detection process detects changes in spectral composition with respect to time.
  • the process according to an aspect of the present invention also detects auditory events that result from changes in spatial location with respect to time.
  • the process may also detect changes in amplitude with respect to time that would not be detected by detecting changes in spectral composition with respect to time.
  • the process divides audio into time segments by analyzing the entire frequency band (full bandwidth audio) or substantially the entire frequency band (in practical implementations, band limiting filtering at the ends of the spectrum is often employed) and giving the greatest weight to the loudest audio signal components.
  • This approach takes advantage of a psychoacoustic phenomenon in which at smaller time scales (20 milliseconds (ms) and less) the ear may tend to focus on a single auditory event at a given time. This implies that while multiple events may be occurring at the same time, one component tends to be perceptually most prominent and may be processed individually as though it were the only event taking place. Taking advantage of this effect also allows the auditory event detection to scale with the complexity of the audio being processed.
  • the auditory event detection identifies the “most prominent” (i.e., the loudest) audio element at any given moment.
  • the most prominent audio element may be determined by taking hearing threshold and frequency response into consideration.
  • While the locations of the auditory event boundaries computed from full-bandwidth audio provide useful information related to the content of an audio signal, it might be desired to provide additional information further describing the content of an auditory event for use in audio signal analysis.
  • an audio signal could be analyzed across two or more frequency subbands and the location of frequency subband auditory events determined and used to convey more detailed information about the nature of the content of an auditory event. Such detailed information could provide additional information unavailable from wideband analysis.
  • the process may also take into consideration changes in spectral composition with respect to time in discrete frequency subbands (fixed or dynamically determined or both fixed and dynamically determined subbands) rather than the full bandwidth.
  • This alternative approach would take into account more than one audio stream in different frequency subbands rather than assuming that only a single stream is perceptible at a particular time.
  • An auditory event detecting process may be implemented by dividing a time domain audio waveform into time intervals or blocks and then converting the data in each block to the frequency domain, using either a filter bank or a time-frequency transformation, such as the FFT.
  • the amplitude of the spectral content of each block may be normalized in order to eliminate or reduce the effect of amplitude changes.
  • Each resulting frequency domain representation provides an indication of the spectral content (amplitude as a function of frequency) of the audio in the particular block.
  • the spectral content of successive blocks is compared and changes greater than a threshold may be taken to indicate the temporal start or temporal end of an auditory event.
  • FIG. 1 shows an idealized waveform of a single channel of orchestral music illustrating auditory events. The spectral changes that occur as a new note is played trigger the new auditory events 2 and 3 at samples 2048 and 2560, respectively.
  • a single band of frequencies of the time domain audio waveform may be processed, preferably either the entire frequency band of the spectrum (which may be about 50 Hz to 15 kHz in the case of an average quality music system) or substantially the entire frequency band (for example, a band defining filter may exclude the high and low frequency extremes).
  • the frequency domain data is normalized, as is described below.
  • the degree to which the frequency domain data needs to be normalized gives an indication of amplitude. Hence, if a change in this degree exceeds a predetermined threshold, that too may be taken to indicate an event boundary. Event start and end points resulting from spectral changes and from amplitude changes may be ORed together so that event boundaries resulting from either type of change are identified.
  • each channel may be treated independently and the resulting event boundaries for all channels may then be ORed together.
  • an auditory event that abruptly switches directions will likely result in an “end of event” boundary in one channel and a “start of event” boundary in another channel.
  • the auditory event detection process of the present invention is capable of detecting auditory events based on spectral (timbre and pitch), amplitude and directional changes.
  • the spectrum of the time domain waveform prior to frequency domain conversion may be divided into two or more frequency bands.
  • Each of the frequency bands may then be converted to the frequency domain and processed as though it were an independent channel in the manner described above.
  • the resulting event boundaries may then be ORed together to define the event boundaries for that channel.
  • the multiple frequency bands may be fixed, adaptive, or a combination of fixed and adaptive. Tracking filter techniques employed in audio noise reduction and other arts, for example, may be employed to define adaptive frequency bands (e.g., dominant simultaneous sine waves at 800 Hz and 2 kHz could result in two adaptively-determined bands centered on those two frequencies).
  • the event boundary information may be preserved.
  • the frequency domain magnitude of a digital audio signal contains useful frequency information out to a frequency of Fs/2 where Fs is the sampling frequency of the digital audio signal.
  • Fs is the sampling frequency of the digital audio signal.
  • the frequency subbands may be analyzed over time in a manner similar to a full bandwidth auditory event detection method.
  • the subband auditory event information provides additional information about an audio signal that more accurately describes the signal and differentiates it from other audio signals. This enhanced differentiating capability may be useful if the audio signature information is to be used to identify matching audio signals from a large number of audio signatures. For example, as shown in FIG. 2 , a frequency subband auditory event analysis (with a auditory event boundary resolution of 512 samples) has found multiple subband auditory events starting, variously, at samples 1024 and 1536 and ending, variously, at samples 2560, 3072 and 3584. It is unlikely that this level of signal detail would be available from a single, wideband auditory scene analysis.
  • the subband auditory event information may be used to derive an auditory event signature for each subband. While this would increase the size of the audio signal's signature and possibly increase the computation time required to compare multiple signatures it could also greatly reduce the probability of falsely classifying two signatures as being the same. A tradeoff between signature size, computational complexity and signal accuracy could be done depending upon the application.
  • the auditory events may be ORed together to provide a single set of “combined” auditory event boundaries (at samples 1024, 1536, 2560, 3072 and 3584. Although this would result in some loss of information, it provides a single set of event boundaries, representing combined auditory events, that provides more information than the information of a single subband or a wideband analysis.
  • each channel is analyzed independently and the auditory event boundary information of each may either be retained separately or be combined to provide combined auditory event information. This is somewhat analogous to the case of multiple subbands.
  • Combined auditory events may be better understood by reference to FIG. 3 that shows the auditory scene analysis results for a two channel audio signal.
  • FIG. 3 shows time concurrent segments of audio data in two channels.
  • ASA processing of the audio in a first channel, the top waveform of FIG. 3 identifies auditory event boundaries at samples that are multiples of the 512 sample spectral-profile block size, 1024 and 1536 samples in this example.
  • ASA processing results in event boundaries at samples that are also multiples of the spectral-profile block size, at samples 1024, 2048 and 3072 in this example.
  • a combined auditory event analysis for both channels results in combined auditory event segments with boundaries at samples 1024, 1536, 2048 and 3072 (the auditory event boundaries of the channels are “ORed” together).
  • N is 512 samples in this example
  • a block size of 512 samples has been found to determine auditory event boundaries with sufficient accuracy as to provide satisfactory results.
  • FIG. 3A shows three auditory events. These events include the (1) quiet portion of audio before the transient, (2) the transient event, and (3) the echo/sustain portion of the audio transient.
  • a speech signal is represented in FIG. 3B having a predominantly high-frequency sibilance event, and events as the sibilance evolves or “morphs” into the vowel, the first half of the vowel, and the second half of the vowel.
  • FIG. 3 also shows the combined event boundaries when the auditory event data is shared across the time concurrent data blocks of two channels. Such event segmentation provides five combined auditory event regions (the event boundaries are ORed together).
  • FIG. 4 shows an example of a four channel input signal.
  • Channels 1 and 4 each contain three auditory events and channels 2 and 3 each contain two auditory events.
  • the combined auditory event boundaries for the concurrent data blocks across all four channels are located at sample numbers 512, 1024, 1536, 2560 and 3072 as indicated at the bottom of the FIG. 4 .
  • the processed audio may be digital or analog and need not be divided into blocks.
  • the input signals likely are one or more channels of digital audio represented by samples in which consecutive samples in each channel are divided into blocks of, for example 4096 samples (as in the examples of FIGS. 1 , 3 and 4 , above).
  • auditory events are determined by examining blocks of audio sample data preferably representing approximately 20 ms of audio or less, which is believed to be the shortest auditory event recognizable by the human ear.
  • auditory events are likely to be determined by examining blocks of, for example, 512 samples, which corresponds to about 11.6 ms of input audio at a sampling rate of 44.1 kHz, within larger blocks of audio sample data.
  • blocks rather than “subblocks” when referring to the examination of segments of audio data for the purpose of detecting auditory event boundaries. Because the audio sample data is examined in blocks, in practice, the auditory event temporal start and stop point boundaries necessarily will each coincide with block boundaries. There is a trade off between real-time processing requirements (as larger blocks require less processing overhead) and resolution of event location (smaller blocks provide more detailed information on the location of auditory events).
  • an audio processing apparatus includes an audio decoder, a filterbank, and a processor.
  • the audio decoder decodes an encoded audio signal to obtain a time-domain audio signal, the encoded audio signal including a plurality of spectral components.
  • the filterbank splits the time-domain audio signal to obtain a plurality of complex-valued subband samples in a first frequency region.
  • the processor generates a plurality of subband samples in a second frequency region based at least in part on the complex-valued subband samples in the first frequency region, adaptively groups at least some of the plurality of subband samples in the second frequency region with an adaptive time resolution or adaptive frequency resolution, and determines a spectral profile of at least some of the subband samples in the second frequency region based on the adaptive grouping.
  • FIG. 1 is an idealized waveform of a single channel of orchestral music illustrating auditory.
  • FIG. 2 is an idealized conceptual schematic diagram illustrating the concept of dividing full bandwidth audio into frequency subbands in order to identify subband auditory events.
  • the horizontal scale is samples and the vertical scale is frequency.
  • FIG. 3 is a series of idealized waveforms in two audio channels, showing audio events in each channel and combined audio events across the two channels.
  • FIG. 3A shows three auditory events, including the quiet portion of audio before the transient, the transient event, and the echo/sustain portion of the audio transient.
  • FIG. 3B represents a speech signal having a predominantly high-frequency sibilance event, and events as the sibilance evolves or “morphs” into the vowel, the first half of the vowel, and the second half of the vowel.
  • FIG. 4 is a series of idealized waveforms in four audio channels showing audio events in each channel and combined audio events across the four channels.
  • FIG. 5 is a flow chart showing the extraction of audio event locations and the optional extraction of dominant subbands from an audio signal in accordance with the present invention.
  • FIG. 6 is a conceptual schematic representation depicting spectral analysis in accordance with the present invention.
  • FIGS. 7-9 are flow charts showing more generally three alternative arrangements equivalent to the flow chart of FIG. 5 .
  • auditory scene analysis is composed of three general processing steps as shown in a portion of FIG. 5 .
  • the first step 5 - 1 (“Perform Spectral Analysis”) takes a time-domain audio signal, divides it into blocks and calculates a spectral profile or spectral content for each of the blocks.
  • Spectral analysis transforms the audio signal into the short-term frequency domain. This can be performed using any filterbank, either based on transforms or banks of bandpass filters, and in either linear or warped frequency space (such as the Bark scale or critical band, which better approximate the characteristics of the human ear). With any filterbank there exists a tradeoff between time and frequency. Greater time resolution, and hence shorter time intervals, leads to lower frequency resolution. Greater frequency resolution, and hence narrower subbands, leads to longer time intervals.
  • the first step calculates the spectral content of successive time segments of the audio signal.
  • the ASA block size is 512 samples of the input audio signal.
  • the differences in spectral content from block to block are determined (“Perform spectral profile difference measurements”).
  • the second step calculates the difference in spectral content between successive time segments of the audio signal.
  • a powerful indicator of the beginning or end of a perceived auditory event is believed to be a change in spectral content.
  • the third step 5 - 3 (“Identify location of auditory event boundaries”), when the spectral difference between one spectral-profile block and the next is greater than a threshold, the block boundary is taken to be an auditory event boundary.
  • auditory event boundaries define auditory events having a length that is an integral multiple of spectral profile blocks with a minimum length of one spectral profile block (512 samples in this example).
  • event boundaries need not be so limited.
  • the input block size may vary, for example, so as to be essentially the size of an auditory event.
  • the locations of event boundaries may be stored as a reduced-information characterization or “signature” and formatted as desired, as shown in step 5 - 4 .
  • An optional process step 5 - 5 (“Identify dominant subband”) uses the spectral analysis of step 5 - 1 to identify a dominant frequency subband that may also be stored as part of the signature.
  • the dominant subband information may be combined with the auditory event boundary information in order to define a feature of each auditory event.
  • FIG. 6 shows a conceptual representation of non-overlapping 512 sample blocks being windowed and transformed into the frequency domain by the Discrete Fourier Transform (DFT). Each block may be windowed and transformed into the frequency domain, such as by using the DFT, preferably implemented as a Fast Fourier Transform (FFT) for speed.
  • DFT Discrete Fourier Transform
  • any integer numbers may be used for the variables above.
  • M is set equal to a power of 2 so that standard FFTs may be used for the spectral profile calculations.
  • N, M, and P are chosen such that Q is an integer number, this will avoid under-running or over-running audio at the end of the N samples.
  • the parameters listed may be set to:
  • the above-listed values were determined experimentally and were found generally to identify with sufficient accuracy the location and duration of auditory events. However, setting the value of P to 256 samples (50% overlap) rather than zero samples (no overlap) has been found to be useful in identifying some hard-to-find events. While many different types of windows may be used to minimize spectral artifacts due to windowing, the window used in the spectral profile calculations is an M-point Hanning, Kaiser-Bessel or other suitable, preferably non-rectangular, window. The above-indicated values and a Hanning window type were selected after extensive experimental analysis as they have shown to provide excellent results across a wide range of audio material. Non-rectangular windowing is preferred for the processing of audio signals with predominantly low frequency content.
  • Rectangular windowing produces spectral artifacts that may cause incorrect detection of events.
  • codec encoder/decoder
  • the spectrum of each M-sample block may be computed by windowing the data by an M-point Hanning, Kaiser-Bessel or other suitable window, converting to the frequency domain using an M-point Fast Fourier Transform, and calculating the magnitude of the complex FFT coefficients.
  • the resultant data is normalized so that the largest magnitude is set to unity, and the normalized array of M numbers is converted to the log domain.
  • the array need not be converted to the log domain, but the conversion simplifies the calculation of the difference measure in step 5 - 2 .
  • the log domain more closely matches the nature of the human auditory system.
  • the resulting log domain values have a range of minus infinity to zero.
  • a lower limit can be imposed on the range of values; the limit may be fixed, for example ⁇ 60 dB, or be frequency-dependent to reflect the lower audibility of quiet sounds at low and very high frequencies. (Note that it would be possible to reduce the size of the array to M/2 in that the FFT represents negative as well as positive frequencies).
  • Step 5 - 2 calculates a measure of the difference between the spectra of adjacent blocks. For each block, each of the M (log) spectral coefficients from step 5 - 1 is subtracted from the corresponding coefficient for the preceding block, and the magnitude of the difference calculated (the sign is ignored). These M differences are then summed to one number. Hence, for a contiguous time segment of audio, containing Q blocks, the result is an array of Q positive numbers, one for each block. The greater the number, the more a block differs in spectrum from the preceding block.
  • This difference measure may also be expressed as an average difference per spectral coefficient by dividing the difference measure by the number of spectral coefficients used in the sum (in this case M coefficients).
  • Step 5 - 3 identifies the locations of auditory event boundaries by applying a threshold to the array of difference measures from step 5 - 2 with a threshold value.
  • a difference measure exceeds a threshold, the change in spectrum is deemed sufficient to signal a new event and the block number of the change is recorded as an event boundary.
  • the threshold may be set equal to 2500 if the whole magnitude FFT (including the mirrored part) is compared or 1250 if half the FFT is compared (as noted above, the FFT represents negative as well as positive frequencies—for the magnitude of the FFT, one is the mirror image of the other). This value was chosen experimentally and it provides good auditory event boundary detection. This parameter value may be changed to reduce (increase the threshold) or increase (decrease the threshold) the detection of events.
  • the auditory scene analysis function 2 outputs approximately 86 values a second.
  • the array B(q) may stored as a signature, such that, in its basic form, without the optional dominant subband frequency information of step 5 - 5 , the audio signal's signature is an array B(q) representing a string of auditory event boundaries.
  • an optional additional step in the processing of FIG. 5 is to extract information from the audio signal denoting the dominant frequency “subband” of the block (conversion of the data in each block to the frequency domain results in information divided into frequency subbands).
  • This block-based information may be converted to auditory-event based information, so that the dominant frequency subband is identified for every auditory event.
  • Such information for every auditory event provides information regarding the auditory event itself and may be useful in providing a more detailed and unique reduced-information representation of the audio signal.
  • the employment of dominant subband information is more appropriate in the case of determining auditory events of full bandwidth audio rather than cases in which the audio is broken into subbands and auditory events are determined for each subband.
  • the dominant (largest amplitude) subband may be chosen from a plurality of subbands, three or four, for example, that are within the range or band of frequencies where the human ear is most sensitive. Alternatively, other criteria may be used to select the subbands.
  • the spectrum may be divided, for example, into three subbands. Useful frequency ranges for the subbands are (these particular frequencies are not critical):
  • the square of the magnitude spectrum (or the power magnitude spectrum) is summed for each subband. This resulting sum for each subband is calculated and the largest is chosen.
  • the subbands may also be weighted prior to selecting the largest. The weighting may take the form of dividing the sum for each subband by the number of spectral values in the subband, or alternatively may take the form of an addition or multiplication to emphasize the importance of a band over another. This can be useful where some subbands have more energy on average than other subbands but are less perceptually important.
  • the array DS(q) is formatted and stored in the signature along with the array B(q).
  • the audio signal's signature is two arrays B(q) and DS(q), representing, respectively, a string of auditory event boundaries and a dominant frequency subband within each block, from which the dominant frequency subband for each auditory event may be determined if desired.
  • the two arrays could have the following values (for a case in which there are three possible dominant subbands).
  • the dominant subband remains the same within each auditory event, as shown in this example, or has an average value if it is not uniform for all blocks within the event.
  • a dominant subband may be determined for each auditory event and the array DS(q) may be modified to provide that the same dominant subband is assigned to each block within an event.
  • FIG. 5 may be represented more generally by the equivalent arrangements of FIGS. 7 , 8 and 9 .
  • an audio signal is applied in parallel to an “Identify Auditory Events” function or step 7 - 1 that divides the audio signal into auditory events, each of which tends to be perceived as separate and distinct and to an optional “Identify Characteristics of Auditory Events” function or step 7 - 2 .
  • the process of FIG. 5 may be employed to divide the audio signal into auditory events or some other suitable process may be employed.
  • the auditory event information which may be an identification of auditory event boundaries, determined by function or step 7 - 1 is stored and formatted, as desired, by a “Store and Format” function or step 7 - 3 .
  • the optional “Identify Characteristics” function or step 7 - 3 also receives the auditory event information.
  • the “Identify Characteristics” function or step 7 - 3 may characterize some or all of the auditory events by one or more characteristics. Such characteristics may include an identification of the dominant subband of the auditory event, as described in connection with the process of FIG. 5 .
  • the characteristics may also include one or more of the MPEG-7 audio descriptors, including, for example, a measure of power of the auditory event, a measure of amplitude of the auditory event, a measure of the spectral flatness of the auditory event, and whether the auditory event is substantially silent.
  • the characteristics may also include other characteristics such as whether the auditory event includes a transient. Characteristics for one or more auditory events are also received by the “Store and Format” function or step 7 - 3 and stored and formatted along with the auditory event information.
  • FIGS. 8 and 9 Alternatives to the arrangement of FIG. 7 are shown in FIGS. 8 and 9 .
  • the audio input signal is not applied directly to the “Identify Characteristics” function or step 8 - 3 , but it does receive information from the “Identify Auditory Events” function or step 8 - 1 .
  • the arrangement of FIG. 5 is a specific example of such an arrangement.
  • the functions or steps 9 - 1 , 9 - 2 and 9 - 3 are arranged in series.
  • the present invention and its various aspects may be implemented as software functions performed in digital signal processors, programmed general-purpose digital computers, and/or special purpose digital computers. Interfaces between analog and digital signal streams may be performed in appropriate hardware and/or as functions in software and/or firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

In one aspect, an audio processing apparatus is disclosed. The apparatus includes an audio decoder, a filterbank, and a processor. The audio decoder decodes an encoded audio signal to obtain a time-domain audio signal, the encoded audio signal including a plurality of spectral components. The filterbank splits the time-domain audio signal to obtain a plurality of complex-valued subband samples in a first frequency region. The processor generates a plurality of subband samples in a second frequency region based at least in part on the complex-valued subband samples in the first frequency region, adaptively groups at least some of the plurality of subband samples in the second frequency region with an adaptive time resolution or an adaptive frequency resolution, and determines a spectral profile of at least some of the subband samples in the second frequency region based on the groups.

Description

    TECHNICAL FIELD
  • The present invention pertains to the field of psychoacoustic processing of audio signals. In particular, the invention relates to aspects of dividing or segmenting audio signals into “auditory events,” each of which tends to be perceived as separate and distinct, and to aspects of generating reduced-information representations of audio signals based on auditory events and, optionally, also based on the characteristics or features of audio signals within such auditory events. Auditory events may be useful as defining the MPEG-7 “Audio Segments” as proposed by the “ISO/IEC JTC 1/SC 29/WG 11.”
  • BACKGROUND ART
  • The division of sounds into units or segments perceived as separate and distinct is sometimes referred to as “auditory event analysis” or “auditory scene analysis” (“ASA”). An extensive discussion of auditory scene analysis is set forth by Albert S. Bregman in his book Auditory Scene Analysis—The Perceptual Organization of Sound, Massachusetts Institute of Technology, 1991, Fourth printing, 2001, Second MIT Press paperback edition.) In addition, U.S. Pat. No. 6,002,776 to Bhadkamkar, et al, Dec. 14, 1999 cites publications dating back to 1976 as “prior art work related to sound separation by auditory scene analysis.” However, the Bhadkamkar, et al patent discourages the practical use of auditory scene analysis, concluding that “[t]echniques involving auditory scene analysis, although interesting from a scientific point of view as models of human auditory processing, are currently far too computationally demanding and specialized to be considered practical techniques for sound separation until fundamental progress is made.”
  • There are many different methods for extracting characteristics or features from audio. Provided the features or characteristics are suitably defined, their extraction can be performed using automated processes. For example “ISO/IEC JTC 1/SC 29/WG 11” (MPEG) is currently standardizing a variety of audio descriptors as part of the MPEG-7 standard. A common shortcoming of such methods is that they ignore auditory scene analysis. Such methods seek to measure, periodically, certain “classical” signal processing parameters such as pitch, amplitude, power, harmonic structure and spectral flatness. Such parameters, while providing useful information, do not analyze and characterize audio signals into elements perceived as separate and distinct according to human cognition. However, MPEG-7 descriptors may be useful in characterizing an Auditory Event identified in accordance with aspects of the present invention.
  • DISCLOSURE OF THE INVENTION
  • In accordance with aspects of the present invention, a computationally efficient process for dividing audio into temporal segments or “auditory events” that tend to be perceived as separate and distinct is provided. The locations of the boundaries of these auditory events (where they begin and end with respect to time) provide valuable information that can be used to describe an audio signal. The locations of auditory event boundaries can be assembled to generate a reduced-information representation, “signature, or “fingerprint” of an audio signal that can be stored for use, for example, in comparative analysis with other similarly generated signatures (as, for example, in a database of known works).
  • Bregman notes that “[w]e hear discrete units when the sound changes abruptly in timbre, pitch, loudness, or (to a lesser extent) location in space.” (Auditory Scene Analysis—The Perceptual Organization of Sound, supra at page 469). Bregman also discusses the perception of multiple simultaneous sound streams when, for example, they are separated in frequency.
  • In order to detect changes in timbre and pitch and certain changes in amplitude, the audio event detection process according to an aspect of the present invention detects changes in spectral composition with respect to time. When applied to a multichannel sound arrangement in which the channels represent directions in space, the process according to an aspect of the present invention also detects auditory events that result from changes in spatial location with respect to time. Optionally, according to a further aspect of the present invention, the process may also detect changes in amplitude with respect to time that would not be detected by detecting changes in spectral composition with respect to time.
  • In its least computationally demanding implementation, the process divides audio into time segments by analyzing the entire frequency band (full bandwidth audio) or substantially the entire frequency band (in practical implementations, band limiting filtering at the ends of the spectrum is often employed) and giving the greatest weight to the loudest audio signal components. This approach takes advantage of a psychoacoustic phenomenon in which at smaller time scales (20 milliseconds (ms) and less) the ear may tend to focus on a single auditory event at a given time. This implies that while multiple events may be occurring at the same time, one component tends to be perceptually most prominent and may be processed individually as though it were the only event taking place. Taking advantage of this effect also allows the auditory event detection to scale with the complexity of the audio being processed. For example, if the input audio signal being processed is a solo instrument, the audio events that are identified will likely be the individual notes being played. Similarly for an input voice signal, the individual components of speech, the vowels and consonants for example, will likely be identified as individual audio elements. As the complexity of the audio increases, such as music with a drumbeat or multiple instruments and voice, the auditory event detection identifies the “most prominent” (i.e., the loudest) audio element at any given moment. Alternatively, the most prominent audio element may be determined by taking hearing threshold and frequency response into consideration.
  • While the locations of the auditory event boundaries computed from full-bandwidth audio provide useful information related to the content of an audio signal, it might be desired to provide additional information further describing the content of an auditory event for use in audio signal analysis. For example, an audio signal could be analyzed across two or more frequency subbands and the location of frequency subband auditory events determined and used to convey more detailed information about the nature of the content of an auditory event. Such detailed information could provide additional information unavailable from wideband analysis.
  • Thus, optionally, according to further aspects of the present invention, at the expense of greater computational complexity, the process may also take into consideration changes in spectral composition with respect to time in discrete frequency subbands (fixed or dynamically determined or both fixed and dynamically determined subbands) rather than the full bandwidth. This alternative approach would take into account more than one audio stream in different frequency subbands rather than assuming that only a single stream is perceptible at a particular time.
  • Even a simple and computationally efficient process according to aspects of the present invention has been found usefully to identify auditory events.
  • An auditory event detecting process according to the present invention may be implemented by dividing a time domain audio waveform into time intervals or blocks and then converting the data in each block to the frequency domain, using either a filter bank or a time-frequency transformation, such as the FFT. The amplitude of the spectral content of each block may be normalized in order to eliminate or reduce the effect of amplitude changes. Each resulting frequency domain representation provides an indication of the spectral content (amplitude as a function of frequency) of the audio in the particular block. The spectral content of successive blocks is compared and changes greater than a threshold may be taken to indicate the temporal start or temporal end of an auditory event. FIG. 1 shows an idealized waveform of a single channel of orchestral music illustrating auditory events. The spectral changes that occur as a new note is played trigger the new auditory events 2 and 3 at samples 2048 and 2560, respectively.
  • As mentioned above, in order to minimize the computational complexity, only a single band of frequencies of the time domain audio waveform may be processed, preferably either the entire frequency band of the spectrum (which may be about 50 Hz to 15 kHz in the case of an average quality music system) or substantially the entire frequency band (for example, a band defining filter may exclude the high and low frequency extremes).
  • Preferably, the frequency domain data is normalized, as is described below. The degree to which the frequency domain data needs to be normalized gives an indication of amplitude. Hence, if a change in this degree exceeds a predetermined threshold, that too may be taken to indicate an event boundary. Event start and end points resulting from spectral changes and from amplitude changes may be ORed together so that event boundaries resulting from either type of change are identified.
  • In the case of multiple audio channels, each representing a direction in space, each channel may be treated independently and the resulting event boundaries for all channels may then be ORed together. Thus, for example, an auditory event that abruptly switches directions will likely result in an “end of event” boundary in one channel and a “start of event” boundary in another channel. When ORed together, two events will be identified. Thus, the auditory event detection process of the present invention is capable of detecting auditory events based on spectral (timbre and pitch), amplitude and directional changes.
  • As mentioned above, as a further option, but at the expense of greater computational complexity, instead of processing the spectral content of the time domain waveform in a single band of frequencies, the spectrum of the time domain waveform prior to frequency domain conversion may be divided into two or more frequency bands. Each of the frequency bands may then be converted to the frequency domain and processed as though it were an independent channel in the manner described above. The resulting event boundaries may then be ORed together to define the event boundaries for that channel. The multiple frequency bands may be fixed, adaptive, or a combination of fixed and adaptive. Tracking filter techniques employed in audio noise reduction and other arts, for example, may be employed to define adaptive frequency bands (e.g., dominant simultaneous sine waves at 800 Hz and 2 kHz could result in two adaptively-determined bands centered on those two frequencies). Although filtering the data before conversion to the frequency domain is workable, more optimally the full bandwidth audio is converted to the frequency domain and then only those frequency subband components of interest are processed. In the case of converting the full bandwidth audio using the FFT, only sub-bins corresponding to frequency subbands of interest would be processed together.
  • Alternatively, in the case of multiple subbands or multiple channels, instead of ORing together auditory event boundaries, which results in some loss of information, the event boundary information may be preserved.
  • As shown in FIG. 2, the frequency domain magnitude of a digital audio signal contains useful frequency information out to a frequency of Fs/2 where Fs is the sampling frequency of the digital audio signal. By dividing the frequency spectrum of the audio signal into two or more subbands (not necessarily of the same bandwidth and not necessarily up to a frequency of Fs/2 Hz), the frequency subbands may be analyzed over time in a manner similar to a full bandwidth auditory event detection method.
  • The subband auditory event information provides additional information about an audio signal that more accurately describes the signal and differentiates it from other audio signals. This enhanced differentiating capability may be useful if the audio signature information is to be used to identify matching audio signals from a large number of audio signatures. For example, as shown in FIG. 2, a frequency subband auditory event analysis (with a auditory event boundary resolution of 512 samples) has found multiple subband auditory events starting, variously, at samples 1024 and 1536 and ending, variously, at samples 2560, 3072 and 3584. It is unlikely that this level of signal detail would be available from a single, wideband auditory scene analysis.
  • The subband auditory event information may be used to derive an auditory event signature for each subband. While this would increase the size of the audio signal's signature and possibly increase the computation time required to compare multiple signatures it could also greatly reduce the probability of falsely classifying two signatures as being the same. A tradeoff between signature size, computational complexity and signal accuracy could be done depending upon the application. Alternatively, rather than providing a signature for each subband, the auditory events may be ORed together to provide a single set of “combined” auditory event boundaries (at samples 1024, 1536, 2560, 3072 and 3584. Although this would result in some loss of information, it provides a single set of event boundaries, representing combined auditory events, that provides more information than the information of a single subband or a wideband analysis.
  • While the frequency subband auditory event information on its own provides useful signal information, the relationship between the locations of subband auditory events may be analyzed and used to provide more insight into the nature of an audio signal. For example, the location and strength of the subband auditory events may be used as an indication of timbre (frequency content) of the audio signal. Auditory events that appear in subbands that are harmonically related to one another would also provide useful insight regarding the harmonic nature of the audio. The presence of auditory events in a single subband may also provide information as to the tone-like nature of an audio signal. Analyzing the relationship of frequency subband auditory events across multiple channels can also provide spatial content information.
  • In the case of analyzing multiple audio channels, each channel is analyzed independently and the auditory event boundary information of each may either be retained separately or be combined to provide combined auditory event information. This is somewhat analogous to the case of multiple subbands. Combined auditory events may be better understood by reference to FIG. 3 that shows the auditory scene analysis results for a two channel audio signal. FIG. 3 shows time concurrent segments of audio data in two channels. ASA processing of the audio in a first channel, the top waveform of FIG. 3, identifies auditory event boundaries at samples that are multiples of the 512 sample spectral-profile block size, 1024 and 1536 samples in this example. The lower waveform of FIG. 3 is a second channel and ASA processing results in event boundaries at samples that are also multiples of the spectral-profile block size, at samples 1024, 2048 and 3072 in this example. A combined auditory event analysis for both channels results in combined auditory event segments with boundaries at samples 1024, 1536, 2048 and 3072 (the auditory event boundaries of the channels are “ORed” together). It will be appreciated that in practice the accuracy of auditory event boundaries depends on the size of the spectral-profile block size (N is 512 samples in this example) because event boundaries can occur only at block boundaries. Nevertheless, a block size of 512 samples has been found to determine auditory event boundaries with sufficient accuracy as to provide satisfactory results.
  • FIG. 3A shows three auditory events. These events include the (1) quiet portion of audio before the transient, (2) the transient event, and (3) the echo/sustain portion of the audio transient. A speech signal is represented in FIG. 3B having a predominantly high-frequency sibilance event, and events as the sibilance evolves or “morphs” into the vowel, the first half of the vowel, and the second half of the vowel.
  • FIG. 3 also shows the combined event boundaries when the auditory event data is shared across the time concurrent data blocks of two channels. Such event segmentation provides five combined auditory event regions (the event boundaries are ORed together).
  • FIG. 4 shows an example of a four channel input signal. Channels 1 and 4 each contain three auditory events and channels 2 and 3 each contain two auditory events. The combined auditory event boundaries for the concurrent data blocks across all four channels are located at sample numbers 512, 1024, 1536, 2560 and 3072 as indicated at the bottom of the FIG. 4.
  • In principle, the processed audio may be digital or analog and need not be divided into blocks. However, in practical applications, the input signals likely are one or more channels of digital audio represented by samples in which consecutive samples in each channel are divided into blocks of, for example 4096 samples (as in the examples of FIGS. 1, 3 and 4, above). In practical embodiments set forth herein, auditory events are determined by examining blocks of audio sample data preferably representing approximately 20 ms of audio or less, which is believed to be the shortest auditory event recognizable by the human ear. Thus, in practice, auditory events are likely to be determined by examining blocks of, for example, 512 samples, which corresponds to about 11.6 ms of input audio at a sampling rate of 44.1 kHz, within larger blocks of audio sample data. However, throughout this document reference is made to “blocks” rather than “subblocks” when referring to the examination of segments of audio data for the purpose of detecting auditory event boundaries. Because the audio sample data is examined in blocks, in practice, the auditory event temporal start and stop point boundaries necessarily will each coincide with block boundaries. There is a trade off between real-time processing requirements (as larger blocks require less processing overhead) and resolution of event location (smaller blocks provide more detailed information on the location of auditory events).
  • In some aspects, an audio processing apparatus is disclosed. The apparatus includes an audio decoder, a filterbank, and a processor. The audio decoder decodes an encoded audio signal to obtain a time-domain audio signal, the encoded audio signal including a plurality of spectral components. The filterbank splits the time-domain audio signal to obtain a plurality of complex-valued subband samples in a first frequency region. The processor generates a plurality of subband samples in a second frequency region based at least in part on the complex-valued subband samples in the first frequency region, adaptively groups at least some of the plurality of subband samples in the second frequency region with an adaptive time resolution or adaptive frequency resolution, and determines a spectral profile of at least some of the subband samples in the second frequency region based on the adaptive grouping.
  • Other aspects of the invention will be appreciated and understood as the detailed description of the invention is read and understood.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is an idealized waveform of a single channel of orchestral music illustrating auditory.
  • FIG. 2 is an idealized conceptual schematic diagram illustrating the concept of dividing full bandwidth audio into frequency subbands in order to identify subband auditory events. The horizontal scale is samples and the vertical scale is frequency.
  • FIG. 3 is a series of idealized waveforms in two audio channels, showing audio events in each channel and combined audio events across the two channels.
  • FIG. 3A shows three auditory events, including the quiet portion of audio before the transient, the transient event, and the echo/sustain portion of the audio transient.
  • FIG. 3B represents a speech signal having a predominantly high-frequency sibilance event, and events as the sibilance evolves or “morphs” into the vowel, the first half of the vowel, and the second half of the vowel.
  • FIG. 4 is a series of idealized waveforms in four audio channels showing audio events in each channel and combined audio events across the four channels.
  • FIG. 5 is a flow chart showing the extraction of audio event locations and the optional extraction of dominant subbands from an audio signal in accordance with the present invention.
  • FIG. 6 is a conceptual schematic representation depicting spectral analysis in accordance with the present invention.
  • FIGS. 7-9 are flow charts showing more generally three alternative arrangements equivalent to the flow chart of FIG. 5.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • In accordance with an embodiment of one aspect of the present invention, auditory scene analysis is composed of three general processing steps as shown in a portion of FIG. 5. The first step 5-1 (“Perform Spectral Analysis”) takes a time-domain audio signal, divides it into blocks and calculates a spectral profile or spectral content for each of the blocks. Spectral analysis transforms the audio signal into the short-term frequency domain. This can be performed using any filterbank, either based on transforms or banks of bandpass filters, and in either linear or warped frequency space (such as the Bark scale or critical band, which better approximate the characteristics of the human ear). With any filterbank there exists a tradeoff between time and frequency. Greater time resolution, and hence shorter time intervals, leads to lower frequency resolution. Greater frequency resolution, and hence narrower subbands, leads to longer time intervals.
  • The first step, illustrated conceptually in FIG. 6 calculates the spectral content of successive time segments of the audio signal. In a practical embodiment, the ASA block size is 512 samples of the input audio signal. In the second step 5-2, the differences in spectral content from block to block are determined (“Perform spectral profile difference measurements”). Thus, the second step calculates the difference in spectral content between successive time segments of the audio signal. As discussed above, a powerful indicator of the beginning or end of a perceived auditory event is believed to be a change in spectral content. In the third step 5-3 (“Identify location of auditory event boundaries”), when the spectral difference between one spectral-profile block and the next is greater than a threshold, the block boundary is taken to be an auditory event boundary. The audio segment between consecutive boundaries constitutes an auditory event. Thus, the third step sets an auditory event boundary between successive time segments when the difference in the spectral profile content between such successive time segments exceeds a threshold, thus defining auditory events. In this embodiment, auditory event boundaries define auditory events having a length that is an integral multiple of spectral profile blocks with a minimum length of one spectral profile block (512 samples in this example). In principle, event boundaries need not be so limited. As an alternative to the practical embodiments discussed herein, the input block size may vary, for example, so as to be essentially the size of an auditory event.
  • The locations of event boundaries may be stored as a reduced-information characterization or “signature” and formatted as desired, as shown in step 5-4. An optional process step 5-5 (“Identify dominant subband”) uses the spectral analysis of step 5-1 to identify a dominant frequency subband that may also be stored as part of the signature. The dominant subband information may be combined with the auditory event boundary information in order to define a feature of each auditory event.
  • Either overlapping or non-overlapping segments of the audio may be windowed and used to compute spectral profiles of the input audio. Overlap results in finer resolution as to the location of auditory events and, also, makes it less likely to miss an event, such as a transient. However, overlap also increases computational complexity. Thus, overlap may be omitted. FIG. 6 shows a conceptual representation of non-overlapping 512 sample blocks being windowed and transformed into the frequency domain by the Discrete Fourier Transform (DFT). Each block may be windowed and transformed into the frequency domain, such as by using the DFT, preferably implemented as a Fast Fourier Transform (FFT) for speed.
  • The following variables may be used to compute the spectral profile of the input block:
      • N=number of samples in the input signal
      • M=number of windowed samples in a block used to compute spectral profile
      • P=number of samples of spectral computation overlap
      • Q=number of spectral windows/regions computed
  • In general, any integer numbers may be used for the variables above. However, the implementation will be more efficient if M is set equal to a power of 2 so that standard FFTs may be used for the spectral profile calculations. In addition, if N, M, and P are chosen such that Q is an integer number, this will avoid under-running or over-running audio at the end of the N samples. In a practical embodiment of the auditory scene analysis process, the parameters listed may be set to:
      • M=512 samples (or 11.6 ms at 44.1 kHz)
      • P=0 samples (no overlap)
  • The above-listed values were determined experimentally and were found generally to identify with sufficient accuracy the location and duration of auditory events. However, setting the value of P to 256 samples (50% overlap) rather than zero samples (no overlap) has been found to be useful in identifying some hard-to-find events. While many different types of windows may be used to minimize spectral artifacts due to windowing, the window used in the spectral profile calculations is an M-point Hanning, Kaiser-Bessel or other suitable, preferably non-rectangular, window. The above-indicated values and a Hanning window type were selected after extensive experimental analysis as they have shown to provide excellent results across a wide range of audio material. Non-rectangular windowing is preferred for the processing of audio signals with predominantly low frequency content. Rectangular windowing produces spectral artifacts that may cause incorrect detection of events. Unlike certain encoder/decoder (codec) applications where an overall overlap/add process must provide a constant level, such a constraint does not apply here and the window may be chosen for characteristics such as its time/frequency resolution and stop-band rejection.
  • In step 5-1 (FIG. 5), the spectrum of each M-sample block may be computed by windowing the data by an M-point Hanning, Kaiser-Bessel or other suitable window, converting to the frequency domain using an M-point Fast Fourier Transform, and calculating the magnitude of the complex FFT coefficients. The resultant data is normalized so that the largest magnitude is set to unity, and the normalized array of M numbers is converted to the log domain. The array need not be converted to the log domain, but the conversion simplifies the calculation of the difference measure in step 5-2. Furthermore, the log domain more closely matches the nature of the human auditory system. The resulting log domain values have a range of minus infinity to zero. In a practical embodiment, a lower limit can be imposed on the range of values; the limit may be fixed, for example −60 dB, or be frequency-dependent to reflect the lower audibility of quiet sounds at low and very high frequencies. (Note that it would be possible to reduce the size of the array to M/2 in that the FFT represents negative as well as positive frequencies).
  • Step 5-2 calculates a measure of the difference between the spectra of adjacent blocks. For each block, each of the M (log) spectral coefficients from step 5-1 is subtracted from the corresponding coefficient for the preceding block, and the magnitude of the difference calculated (the sign is ignored). These M differences are then summed to one number. Hence, for a contiguous time segment of audio, containing Q blocks, the result is an array of Q positive numbers, one for each block. The greater the number, the more a block differs in spectrum from the preceding block. This difference measure may also be expressed as an average difference per spectral coefficient by dividing the difference measure by the number of spectral coefficients used in the sum (in this case M coefficients).
  • Step 5-3 identifies the locations of auditory event boundaries by applying a threshold to the array of difference measures from step 5-2 with a threshold value. When a difference measure exceeds a threshold, the change in spectrum is deemed sufficient to signal a new event and the block number of the change is recorded as an event boundary. For the values of M and P given above and for log domain values (in step 5-1) expressed in units of dB, the threshold may be set equal to 2500 if the whole magnitude FFT (including the mirrored part) is compared or 1250 if half the FFT is compared (as noted above, the FFT represents negative as well as positive frequencies—for the magnitude of the FFT, one is the mirror image of the other). This value was chosen experimentally and it provides good auditory event boundary detection. This parameter value may be changed to reduce (increase the threshold) or increase (decrease the threshold) the detection of events.
  • For an audio signal consisting of Q blocks (of size M samples), the output of step 5-3 of FIG. 5 may be stored and formatted in step 5-4 as an array B(q) of information representing the location of auditory event boundaries where q=0, 1, . . . , Q−1. For a block size of M=512 samples, overlap of P=0 samples and a signal-sampling rate of 44.1 kHz, the auditory scene analysis function 2 outputs approximately 86 values a second. The array B(q) may stored as a signature, such that, in its basic form, without the optional dominant subband frequency information of step 5-5, the audio signal's signature is an array B(q) representing a string of auditory event boundaries.
  • Identify Dominant Subband (Optional)
  • For each block, an optional additional step in the processing of FIG. 5 is to extract information from the audio signal denoting the dominant frequency “subband” of the block (conversion of the data in each block to the frequency domain results in information divided into frequency subbands). This block-based information may be converted to auditory-event based information, so that the dominant frequency subband is identified for every auditory event. Such information for every auditory event provides information regarding the auditory event itself and may be useful in providing a more detailed and unique reduced-information representation of the audio signal. The employment of dominant subband information is more appropriate in the case of determining auditory events of full bandwidth audio rather than cases in which the audio is broken into subbands and auditory events are determined for each subband.
  • The dominant (largest amplitude) subband may be chosen from a plurality of subbands, three or four, for example, that are within the range or band of frequencies where the human ear is most sensitive. Alternatively, other criteria may be used to select the subbands. The spectrum may be divided, for example, into three subbands. Useful frequency ranges for the subbands are (these particular frequencies are not critical):
  • Subband 1 300 Hz to 550 Hz
    Subband 2 550 Hz to 2000 Hz
    Subband
    3 2000 Hz to 10,000 Hz
  • To determine the dominant subband, the square of the magnitude spectrum (or the power magnitude spectrum) is summed for each subband. This resulting sum for each subband is calculated and the largest is chosen. The subbands may also be weighted prior to selecting the largest. The weighting may take the form of dividing the sum for each subband by the number of spectral values in the subband, or alternatively may take the form of an addition or multiplication to emphasize the importance of a band over another. This can be useful where some subbands have more energy on average than other subbands but are less perceptually important.
  • Considering an audio signal consisting of Q blocks, the output of the dominant subband processing is an array DS(q) of information representing the dominant subband in each block (q=0, 1, . . . , Q−1). Preferably, the array DS(q) is formatted and stored in the signature along with the array B(q). Thus, with the optional dominant subband information, the audio signal's signature is two arrays B(q) and DS(q), representing, respectively, a string of auditory event boundaries and a dominant frequency subband within each block, from which the dominant frequency subband for each auditory event may be determined if desired. Thus, in an idealized example, the two arrays could have the following values (for a case in which there are three possible dominant subbands).
  • 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 (Event Boundaries)
    1 1 2 2 2 2 1 1 1 3 3 3 3 3 3 1 1 (Dominant Subbands)
  • In most cases, the dominant subband remains the same within each auditory event, as shown in this example, or has an average value if it is not uniform for all blocks within the event. Thus, a dominant subband may be determined for each auditory event and the array DS(q) may be modified to provide that the same dominant subband is assigned to each block within an event.
  • The process of FIG. 5 may be represented more generally by the equivalent arrangements of FIGS. 7, 8 and 9. In FIG. 7, an audio signal is applied in parallel to an “Identify Auditory Events” function or step 7-1 that divides the audio signal into auditory events, each of which tends to be perceived as separate and distinct and to an optional “Identify Characteristics of Auditory Events” function or step 7-2. The process of FIG. 5 may be employed to divide the audio signal into auditory events or some other suitable process may be employed. The auditory event information, which may be an identification of auditory event boundaries, determined by function or step 7-1 is stored and formatted, as desired, by a “Store and Format” function or step 7-3. The optional “Identify Characteristics” function or step 7-3 also receives the auditory event information. The “Identify Characteristics” function or step 7-3 may characterize some or all of the auditory events by one or more characteristics. Such characteristics may include an identification of the dominant subband of the auditory event, as described in connection with the process of FIG. 5. The characteristics may also include one or more of the MPEG-7 audio descriptors, including, for example, a measure of power of the auditory event, a measure of amplitude of the auditory event, a measure of the spectral flatness of the auditory event, and whether the auditory event is substantially silent. The characteristics may also include other characteristics such as whether the auditory event includes a transient. Characteristics for one or more auditory events are also received by the “Store and Format” function or step 7-3 and stored and formatted along with the auditory event information.
  • Alternatives to the arrangement of FIG. 7 are shown in FIGS. 8 and 9. In FIG. 8, the audio input signal is not applied directly to the “Identify Characteristics” function or step 8-3, but it does receive information from the “Identify Auditory Events” function or step 8-1. The arrangement of FIG. 5 is a specific example of such an arrangement. In FIG. 9, the functions or steps 9-1, 9-2 and 9-3 are arranged in series.
  • The details of this practical embodiment are not critical. Other ways to calculate the spectral content of successive time segments of the audio signal, calculate the differences between successive time segments, and set auditory event boundaries at the respective boundaries between successive time segments when the difference in the spectral profile content between such successive time segments exceeds a threshold may be employed.
  • It should be understood that implementation of other variations and modifications of the invention and its various aspects will be apparent to those skilled in the art, and that the invention is not limited by these specific embodiments described. It is therefore contemplated to cover by the present invention any and all modifications, variations, or equivalents that fall within the true spirit and scope of the basic underlying principles disclosed and claimed herein.
  • The present invention and its various aspects may be implemented as software functions performed in digital signal processors, programmed general-purpose digital computers, and/or special purpose digital computers. Interfaces between analog and digital signal streams may be performed in appropriate hardware and/or as functions in software and/or firmware.

Claims (12)

1. An audio processing apparatus comprising:
an audio decoder that decodes an encoded audio signal to obtain a time-domain audio signal, the encoded audio signal including a plurality of spectral components from at least two channels of audio content;
a filterbank that splits the time-domain audio signal to obtain a plurality of complex-valued subband samples in a first frequency region for each of the at least two channels of audio content; and
one or more processors that for each of the at least two channels of audio content:
generate a plurality of subband samples in a second frequency region based at least in part on the complex-valued subband samples in the first frequency region,
group at least some of the plurality of subband samples in the second frequency region with an adaptive time resolution and an adaptive frequency resolution to obtain an adaptive grouping, and
determine a spectral profile of at least some of the subband samples in the second frequency region based at least in part on the adaptive grouping,
wherein at least one of the audio decoder, the filterbank, and the one or more processors are implemented in hardware, and
wherein a parameter in the encoded audio signal indicates the adaptive frequency resolution for each of the at least two channels of audio content by specifying either a first frequency resolution or a second frequency resolution for each of the at least two channels of audio content and wherein the first frequency resolution is finer than the second frequency resolution.
2. The audio processing apparatus of claim 1 wherein the adaptive grouping is derived from an auditory scene analysis performed in an audio encoder and signaled in the encoded audio signal as one or more parameters.
3. The audio processing apparatus of claim 2 wherein the one or more parameters are used to determine a start time border and an end time border of a time segment.
4. The audio processing apparatus of claim 3 wherein an end time border of a first time segment is a start time border of a second time segment.
5. (canceled)
6. (canceled)
7. The audio processing apparatus of claim 1 wherein the spectral profile includes a spectral envelope.
8. The audio processing apparatus of claim 1 wherein the second frequency region is higher than the first frequency region.
9. (canceled)
10. The audio processing apparatus of claim 1 wherein the number of spectral components varies in time.
11. The audio processing apparatus of claim 1 wherein the audio processing apparatus is implemented as part of an MPEG decoder.
12. The audio processing apparatus of claim 1 wherein the adaptive grouping represents one or more auditory events.
US14/735,635 2001-04-13 2015-06-10 Processing audio signals with adaptive time or frequency resolution Expired - Fee Related US9165562B1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US14/735,635 US9165562B1 (en) 2001-04-13 2015-06-10 Processing audio signals with adaptive time or frequency resolution
US14/842,208 US20150371649A1 (en) 2001-04-13 2015-09-01 Processing Audio Signals with Adaptive Time or Frequency Resolution
US15/264,828 US20170004838A1 (en) 2001-04-13 2016-09-14 Processing Audio Signals with Adaptive Time or Frequency Resolution

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US83473901A 2001-04-13 2001-04-13
US29382501P 2001-05-25 2001-05-25
US09/922,394 US20020116178A1 (en) 2001-04-13 2001-08-02 High quality time-scaling and pitch-scaling of audio signals
US4564402A 2002-01-11 2002-01-11
US35149802P 2002-01-23 2002-01-23
PCT/US2002/004317 WO2002084645A2 (en) 2001-04-13 2002-02-12 High quality time-scaling and pitch-scaling of audio signals
US10/478,538 US7711123B2 (en) 2001-04-13 2002-02-26 Segmenting audio signals into auditory events
PCT/US2002/005999 WO2002097792A1 (en) 2001-05-25 2002-02-26 Segmenting audio signals into auditory events
US12/724,969 US8488800B2 (en) 2001-04-13 2010-03-16 Segmenting audio signals into auditory events
US13/919,089 US8842844B2 (en) 2001-04-13 2013-06-17 Segmenting audio signals into auditory events
US14/463,812 US10134409B2 (en) 2001-04-13 2014-08-20 Segmenting audio signals into auditory events
US14/735,635 US9165562B1 (en) 2001-04-13 2015-06-10 Processing audio signals with adaptive time or frequency resolution

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/463,812 Continuation US10134409B2 (en) 2001-04-13 2014-08-20 Segmenting audio signals into auditory events

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/842,208 Continuation US20150371649A1 (en) 2001-04-13 2015-09-01 Processing Audio Signals with Adaptive Time or Frequency Resolution

Publications (2)

Publication Number Publication Date
US20150279383A1 true US20150279383A1 (en) 2015-10-01
US9165562B1 US9165562B1 (en) 2015-10-20

Family

ID=32872918

Family Applications (7)

Application Number Title Priority Date Filing Date
US10/478,538 Expired - Lifetime US7711123B2 (en) 2001-04-13 2002-02-26 Segmenting audio signals into auditory events
US12/724,969 Expired - Lifetime US8488800B2 (en) 2001-04-13 2010-03-16 Segmenting audio signals into auditory events
US13/919,089 Expired - Fee Related US8842844B2 (en) 2001-04-13 2013-06-17 Segmenting audio signals into auditory events
US14/463,812 Expired - Fee Related US10134409B2 (en) 2001-04-13 2014-08-20 Segmenting audio signals into auditory events
US14/735,635 Expired - Fee Related US9165562B1 (en) 2001-04-13 2015-06-10 Processing audio signals with adaptive time or frequency resolution
US14/842,208 Abandoned US20150371649A1 (en) 2001-04-13 2015-09-01 Processing Audio Signals with Adaptive Time or Frequency Resolution
US15/264,828 Abandoned US20170004838A1 (en) 2001-04-13 2016-09-14 Processing Audio Signals with Adaptive Time or Frequency Resolution

Family Applications Before (4)

Application Number Title Priority Date Filing Date
US10/478,538 Expired - Lifetime US7711123B2 (en) 2001-04-13 2002-02-26 Segmenting audio signals into auditory events
US12/724,969 Expired - Lifetime US8488800B2 (en) 2001-04-13 2010-03-16 Segmenting audio signals into auditory events
US13/919,089 Expired - Fee Related US8842844B2 (en) 2001-04-13 2013-06-17 Segmenting audio signals into auditory events
US14/463,812 Expired - Fee Related US10134409B2 (en) 2001-04-13 2014-08-20 Segmenting audio signals into auditory events

Family Applications After (2)

Application Number Title Priority Date Filing Date
US14/842,208 Abandoned US20150371649A1 (en) 2001-04-13 2015-09-01 Processing Audio Signals with Adaptive Time or Frequency Resolution
US15/264,828 Abandoned US20170004838A1 (en) 2001-04-13 2016-09-14 Processing Audio Signals with Adaptive Time or Frequency Resolution

Country Status (1)

Country Link
US (7) US7711123B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018084848A1 (en) * 2016-11-04 2018-05-11 Hewlett-Packard Development Company, L.P. Dominant frequency processing of audio signals
US20180308497A1 (en) * 2017-04-25 2018-10-25 Dts, Inc. Encoding and decoding of digital audio signals using variable alphabet size

Families Citing this family (82)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461002B2 (en) * 2001-04-13 2008-12-02 Dolby Laboratories Licensing Corporation Method for time aligning audio signals using characterizations based on auditory events
US7283954B2 (en) * 2001-04-13 2007-10-16 Dolby Laboratories Licensing Corporation Comparing audio using characterizations based on auditory events
US7610205B2 (en) * 2002-02-12 2009-10-27 Dolby Laboratories Licensing Corporation High quality time-scaling and pitch-scaling of audio signals
US7711123B2 (en) 2001-04-13 2010-05-04 Dolby Laboratories Licensing Corporation Segmenting audio signals into auditory events
AU2002248431B2 (en) * 2001-04-13 2008-11-13 Dolby Laboratories Licensing Corporation High quality time-scaling and pitch-scaling of audio signals
EP1386312B1 (en) * 2001-05-10 2008-02-20 Dolby Laboratories Licensing Corporation Improving transient performance of low bit rate audio coding systems by reducing pre-noise
US7277554B2 (en) 2001-08-08 2007-10-02 Gn Resound North America Corporation Dynamic range compression using digital frequency warping
US7421304B2 (en) * 2002-01-21 2008-09-02 Kenwood Corporation Audio signal processing device, signal recovering device, audio signal processing method and signal recovering method
JP3891111B2 (en) * 2002-12-12 2007-03-14 ソニー株式会社 Acoustic signal processing apparatus and method, signal recording apparatus and method, and program
CN1754218A (en) * 2003-02-26 2006-03-29 皇家飞利浦电子股份有限公司 Handling of digital silence in audio fingerprinting
US20070153125A1 (en) * 2003-05-16 2007-07-05 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization
US7499104B2 (en) * 2003-05-16 2009-03-03 Pixel Instruments Corporation Method and apparatus for determining relative timing of image and associated information
KR101164937B1 (en) * 2003-05-28 2012-07-12 돌비 레버러토리즈 라이쎈싱 코오포레이션 Method, apparatus and computer program for calculating and adjusting the perceived loudness of an audio signal
US7179980B2 (en) * 2003-12-12 2007-02-20 Nokia Corporation Automatic extraction of musical portions of an audio stream
US20090299756A1 (en) * 2004-03-01 2009-12-03 Dolby Laboratories Licensing Corporation Ratio of speech to non-speech audio such as for elderly or hearing-impaired listeners
CA2992097C (en) 2004-03-01 2018-09-11 Dolby Laboratories Licensing Corporation Reconstructing audio signals with multiple decorrelation techniques and differentially coded parameters
JP3827317B2 (en) * 2004-06-03 2006-09-27 任天堂株式会社 Command processing unit
US7508947B2 (en) * 2004-08-03 2009-03-24 Dolby Laboratories Licensing Corporation Method for combining audio signals using auditory scene analysis
MX2007002071A (en) * 2004-08-18 2007-04-24 Nielsen Media Res Inc Methods and apparatus for generating signatures.
AU2005299410B2 (en) 2004-10-26 2011-04-07 Dolby Laboratories Licensing Corporation Calculating and adjusting the perceived loudness and/or the perceived spectral balance of an audio signal
US8199933B2 (en) 2004-10-26 2012-06-12 Dolby Laboratories Licensing Corporation Calculating and adjusting the perceived loudness and/or the perceived spectral balance of an audio signal
WO2006097633A1 (en) * 2005-03-15 2006-09-21 France Telecom Method and system for spatializing an audio signal based on its intrinsic qualities
PL1931197T3 (en) * 2005-04-18 2015-09-30 Basf Se Preparation containing at least one conazole fungicide a further fungicide and a stabilising copolymer
MX2007015118A (en) * 2005-06-03 2008-02-14 Dolby Lab Licensing Corp Apparatus and method for encoding audio signals with decoding instructions.
TWI396188B (en) * 2005-08-02 2013-05-11 Dolby Lab Licensing Corp Controlling spatial audio coding parameters as a function of auditory events
TWI517562B (en) 2006-04-04 2016-01-11 杜比實驗室特許公司 Method, apparatus, and computer program for scaling the overall perceived loudness of a multichannel audio signal by a desired amount
CN101410892B (en) * 2006-04-04 2012-08-08 杜比实验室特许公司 Audio signal loudness measurement and modification in the mdct domain
DE602007011594D1 (en) 2006-04-27 2011-02-10 Dolby Lab Licensing Corp SOUND AMPLIFICATION WITH RECORDING OF PUBLIC EVENTS ON THE BASIS OF SPECIFIC VOLUME
JP4665836B2 (en) * 2006-05-31 2011-04-06 日本ビクター株式会社 Music classification device, music classification method, and music classification program
JP4940308B2 (en) 2006-10-20 2012-05-30 ドルビー ラボラトリーズ ライセンシング コーポレイション Audio dynamics processing using reset
US8521314B2 (en) * 2006-11-01 2013-08-27 Dolby Laboratories Licensing Corporation Hierarchical control path with constraints for audio dynamics processing
US20080111887A1 (en) * 2006-11-13 2008-05-15 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
US20080130908A1 (en) 2006-12-05 2008-06-05 Searete Llc, A Limited Liability Corporation Of The State Of Delaware Selective audio/sound aspects
FR2911031B1 (en) * 2006-12-28 2009-04-10 Actimagine Soc Par Actions Sim AUDIO CODING METHOD AND DEVICE
JP2008241850A (en) * 2007-03-26 2008-10-09 Sanyo Electric Co Ltd Recording or reproducing device
US8849432B2 (en) * 2007-05-31 2014-09-30 Adobe Systems Incorporated Acoustic pattern identification using spectral characteristics to synchronize audio and/or video
CN101681625B (en) * 2007-06-08 2012-11-07 杜比实验室特许公司 Method and device for obtaining two surround sound audio channels by two inputted sound singals
US8054948B1 (en) * 2007-06-28 2011-11-08 Sprint Communications Company L.P. Audio experience for a communications device user
US8396574B2 (en) * 2007-07-13 2013-03-12 Dolby Laboratories Licensing Corporation Audio processing using auditory scene analysis and spectral skewness
US8515257B2 (en) * 2007-10-17 2013-08-20 International Business Machines Corporation Automatic announcer voice attenuation in a presentation of a televised sporting event
US8346559B2 (en) * 2007-12-20 2013-01-01 Dean Enterprises, Llc Detection of conditions from sound
CN102017402B (en) * 2007-12-21 2015-01-07 Dts有限责任公司 System for adjusting perceived loudness of audio signals
ES2739667T3 (en) * 2008-03-10 2020-02-03 Fraunhofer Ges Forschung Device and method to manipulate an audio signal that has a transient event
EP2329492A1 (en) * 2008-09-19 2011-06-08 Dolby Laboratories Licensing Corporation Upstream quality enhancement signal processing for resource constrained client devices
ES2385293T3 (en) 2008-09-19 2012-07-20 Dolby Laboratories Licensing Corporation Upstream signal processing for client devices in a small cell wireless network
EP2425426B1 (en) 2009-04-30 2013-03-13 Dolby Laboratories Licensing Corporation Low complexity auditory event boundary detection
US8538042B2 (en) 2009-08-11 2013-09-17 Dts Llc System for increasing perceived loudness of speakers
US8670577B2 (en) 2010-10-18 2014-03-11 Convey Technology, Inc. Electronically-simulated live music
US20120121103A1 (en) * 2010-11-12 2012-05-17 Searete Llc, A Limited Liability Corporation Of The State Of Delaware Audio/sound information system and method
KR20120064582A (en) * 2010-12-09 2012-06-19 한국전자통신연구원 Method of searching multi-media contents and apparatus for the same
IT1403658B1 (en) * 2011-01-28 2013-10-31 Universal Multimedia Access S R L PROCEDURE AND MEANS OF SCANDING AND / OR SYNCHRONIZING AUDIO / VIDEO EVENTS
EP2681691A4 (en) * 2011-03-03 2015-06-03 Cypher Llc System for autononous detection and separation of common elements within data, and methods and devices associated therewith
US8462984B2 (en) * 2011-03-03 2013-06-11 Cypher, Llc Data pattern recognition and separation engine
WO2013022426A1 (en) * 2011-08-08 2013-02-14 Hewlett-Packard Development Company, L.P. Method and system for compression of a real-time surveillance signal
WO2013030623A1 (en) * 2011-08-30 2013-03-07 Nokia Corporation An audio scene mapping apparatus
US9471673B1 (en) * 2012-03-12 2016-10-18 Google Inc. Audio matching using time-frequency onsets
US9312829B2 (en) 2012-04-12 2016-04-12 Dts Llc System for adjusting loudness of audio signals in real time
CN103716470B (en) * 2012-09-29 2016-12-07 华为技术有限公司 The method and apparatus of Voice Quality Monitor
US9158760B2 (en) 2012-12-21 2015-10-13 The Nielsen Company (Us), Llc Audio decoding with supplemental semantic audio recognition and report generation
US9183849B2 (en) * 2012-12-21 2015-11-10 The Nielsen Company (Us), Llc Audio matching with semantic audio recognition and report generation
US9805725B2 (en) 2012-12-21 2017-10-31 Dolby Laboratories Licensing Corporation Object clustering for rendering object-based audio content based on perceptual criteria
US9195649B2 (en) 2012-12-21 2015-11-24 The Nielsen Company (Us), Llc Audio processing techniques for semantic audio recognition and report generation
US9812150B2 (en) 2013-08-28 2017-11-07 Accusonus, Inc. Methods and systems for improved signal decomposition
US10468036B2 (en) 2014-04-30 2019-11-05 Accusonus, Inc. Methods and systems for processing and mixing signals using signal decomposition
US10536136B2 (en) 2014-04-02 2020-01-14 Lachlan Paul BARRATT Modified digital filtering with sample zoning
US11308928B2 (en) 2014-09-25 2022-04-19 Sunhouse Technologies, Inc. Systems and methods for capturing and interpreting audio
US9536509B2 (en) 2014-09-25 2017-01-03 Sunhouse Technologies, Inc. Systems and methods for capturing and interpreting audio
JP6501259B2 (en) * 2015-08-04 2019-04-17 本田技研工業株式会社 Speech processing apparatus and speech processing method
US10001978B2 (en) 2015-11-11 2018-06-19 Oracle International Corporation Type inference optimization
US9756281B2 (en) 2016-02-05 2017-09-05 Gopro, Inc. Apparatus and method for audio based video synchronization
US9697849B1 (en) 2016-07-25 2017-07-04 Gopro, Inc. Systems and methods for audio based synchronization using energy vectors
US9640159B1 (en) * 2016-08-25 2017-05-02 Gopro, Inc. Systems and methods for audio based synchronization using sound harmonics
US9653095B1 (en) 2016-08-30 2017-05-16 Gopro, Inc. Systems and methods for determining a repeatogram in a music composition using audio features
US9916822B1 (en) 2016-10-07 2018-03-13 Gopro, Inc. Systems and methods for audio remixing using repeated segments
EP3616197A4 (en) 2017-04-28 2021-01-27 DTS, Inc. Audio coder window sizes and time-frequency transformations
US20190082226A1 (en) * 2017-09-08 2019-03-14 Arris Enterprises Llc System and method for recommendations for smart foreground viewing among multiple tuned channels based on audio content and user profiles
US11418655B2 (en) * 2018-07-18 2022-08-16 Google Llc Echo detection
CN111524536B (en) * 2019-02-01 2023-09-08 富士通株式会社 Signal processing method and information processing apparatus
US12032628B2 (en) 2019-11-26 2024-07-09 Gracenote, Inc. Methods and apparatus to fingerprint an audio signal via exponential normalization
CN113113032B (en) * 2020-01-10 2024-08-09 华为技术有限公司 Audio encoding and decoding method and audio encoding and decoding equipment
US11798577B2 (en) 2021-03-04 2023-10-24 Gracenote, Inc. Methods and apparatus to fingerprint an audio signal
US20230409635A1 (en) * 2022-05-27 2023-12-21 Sling TV L.L.C. Detecting content of interest in streaming media

Family Cites Families (120)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US33535A (en) * 1861-10-22 Improvement in machines for making bricks
US4624009A (en) * 1980-05-02 1986-11-18 Figgie International, Inc. Signal pattern encoder and classifier
US4464784A (en) * 1981-04-30 1984-08-07 Eventide Clockworks, Inc. Pitch changer with glitch minimizer
US4723290A (en) * 1983-05-16 1988-02-02 Kabushiki Kaisha Toshiba Speech recognition apparatus
US4792975A (en) * 1983-06-03 1988-12-20 The Variable Speech Control ("Vsc") Digital speech signal processing for pitch change with jump control in accordance with pitch period
US4700391A (en) * 1983-06-03 1987-10-13 The Variable Speech Control Company ("Vsc") Method and apparatus for pitch controlled voice signal processing
US5202761A (en) * 1984-11-26 1993-04-13 Cooper J Carl Audio synchronization apparatus
US4703355A (en) * 1985-09-16 1987-10-27 Cooper J Carl Audio to video timing equalizer method and apparatus
USRE33535E (en) * 1985-09-16 1991-02-12 Audio to video timing equalizer method and apparatus
US5040081A (en) * 1986-09-23 1991-08-13 Mccutchen David Audiovisual synchronization signal generator using audio signature comparison
US4852170A (en) * 1986-12-18 1989-07-25 R & D Associates Real time computer speech recognition system
JPS63225300A (en) * 1987-03-16 1988-09-20 株式会社東芝 Pattern recognition equipment
US4829872A (en) * 1987-05-11 1989-05-16 Fairlight Instruments Pty. Limited Detection of musical gestures
GB8720527D0 (en) * 1987-09-01 1987-10-07 King R A Voice recognition
JPS6474097A (en) 1987-09-11 1989-03-20 Mitsubishi Electric Corp Control method for ac excitation synchronous machine
US5055939A (en) 1987-12-15 1991-10-08 Karamon John J Method system & apparatus for synchronizing an auxiliary sound source containing multiple language channels with motion picture film video tape or other picture source containing a sound track
IL84902A (en) * 1987-12-21 1991-12-15 D S P Group Israel Ltd Digital autocorrelation system for detecting speech in noisy audio signal
JP2739950B2 (en) * 1988-03-31 1998-04-15 株式会社東芝 Pattern recognition device
US5235646A (en) * 1990-06-15 1993-08-10 Wilde Martin D Method and apparatus for creating de-correlated audio output signals and audio recordings made thereby
AU8053691A (en) 1990-06-15 1992-01-07 Auris Corp. Method for eliminating the precedence effect in stereophonic sound systems and recording made with said method
WO1991019989A1 (en) 1990-06-21 1991-12-26 Reynolds Software, Inc. Method and apparatus for wave analysis and event recognition
US5313531A (en) * 1990-11-05 1994-05-17 International Business Machines Corporation Method and apparatus for speech analysis and speech recognition
WO1992012607A1 (en) * 1991-01-08 1992-07-23 Dolby Laboratories Licensing Corporation Encoder/decoder for multidimensional sound fields
US5216744A (en) * 1991-03-21 1993-06-01 Dictaphone Corporation Time scale modification of speech signals
FR2674710B1 (en) 1991-03-27 1994-11-04 France Telecom METHOD AND SYSTEM FOR PROCESSING PREECHOS OF AN AUDIO-DIGITAL SIGNAL ENCODED BY FREQUENTIAL TRANSFORM.
JP3134338B2 (en) 1991-03-30 2001-02-13 ソニー株式会社 Digital audio signal encoding method
US5175769A (en) 1991-07-23 1992-12-29 Rolm Systems Method for time-scale modification of signals
US5450522A (en) * 1991-08-19 1995-09-12 U S West Advanced Technologies, Inc. Auditory model for parametrization of speech
JP3074046B2 (en) 1991-10-21 2000-08-07 沖電気工業株式会社 Voice / music sound identification circuit
US5621857A (en) * 1991-12-20 1997-04-15 Oregon Graduate Institute Of Science And Technology Method and system for identifying and recognizing speech
JPH05181464A (en) 1991-12-27 1993-07-23 Sony Corp Musical sound recognition device
JP3104400B2 (en) 1992-04-27 2000-10-30 ソニー株式会社 Audio signal encoding apparatus and method
JP3298188B2 (en) 1992-12-09 2002-07-02 富士通株式会社 Voice detection method
US5634020A (en) * 1992-12-31 1997-05-27 Avid Technology, Inc. Apparatus and method for displaying audio data as a discrete waveform
DE69428612T2 (en) 1993-01-25 2002-07-11 Matsushita Electric Industrial Co., Ltd. Method and device for carrying out a time scale modification of speech signals
KR100372208B1 (en) * 1993-09-09 2003-04-07 산요 덴키 가부시키가이샤 Time compression / extension method of audio signal
JP3186412B2 (en) 1994-04-01 2001-07-11 ソニー株式会社 Information encoding method, information decoding method, and information transmission method
JP3307138B2 (en) 1995-02-27 2002-07-24 ソニー株式会社 Signal encoding method and apparatus, and signal decoding method and apparatus
US5920840A (en) 1995-02-28 1999-07-06 Motorola, Inc. Communication system and method using a speaker dependent time-scaling technique
US5842172A (en) 1995-04-21 1998-11-24 Tensortech Corporation Method and apparatus for modifying the play time of digital audio tracks
US5730140A (en) * 1995-04-28 1998-03-24 Fitch; William Tecumseh S. Sonification system using synthesized realistic body sounds modified by other medically-important variables for physiological monitoring
US5699404A (en) 1995-06-26 1997-12-16 Motorola, Inc. Apparatus for time-scaling in communication products
US6002776A (en) * 1995-09-18 1999-12-14 Interval Research Corporation Directional acoustic signal processor and method therefor
US5960390A (en) 1995-10-05 1999-09-28 Sony Corporation Coding method for using multi channel audio signals
FR2739736B1 (en) 1995-10-05 1997-12-05 Jean Laroche PRE-ECHO OR POST-ECHO REDUCTION METHOD AFFECTING AUDIO RECORDINGS
DE69612958T2 (en) * 1995-11-22 2001-11-29 Koninklijke Philips Electronics N.V., Eindhoven METHOD AND DEVICE FOR RESYNTHETIZING A VOICE SIGNAL
US5956674A (en) 1995-12-01 1999-09-21 Digital Theater Systems, Inc. Multi-channel predictive subband audio coder using psychoacoustic adaptive bit allocation in frequency, time and over the multiple channels
US5749073A (en) * 1996-03-15 1998-05-05 Interval Research Corporation System for automatically morphing audio information
US6430533B1 (en) * 1996-05-03 2002-08-06 Lsi Logic Corporation Audio decoder core MPEG-1/MPEG-2/AC-3 functional algorithm partitioning and implementation
US5828994A (en) * 1996-06-05 1998-10-27 Interval Research Corporation Non-uniform time scale modification of recorded audio
JPH1074097A (en) 1996-07-26 1998-03-17 Ind Technol Res Inst Parameter changing method and device for audio signal
US6049766A (en) 1996-11-07 2000-04-11 Creative Technology Ltd. Time-domain time/pitch scaling of speech or audio signals with transient handling
JP3124239B2 (en) 1996-11-13 2001-01-15 沖電気工業株式会社 Video information detection device
US5893062A (en) * 1996-12-05 1999-04-06 Interval Research Corporation Variable rate video playback with synchronized audio
US5862228A (en) * 1997-02-21 1999-01-19 Dolby Laboratories Licensing Corporation Audio matrix encoding
DE19710545C1 (en) 1997-03-14 1997-12-04 Grundig Ag Time scale modification method for speech signals
EP0977172A4 (en) 1997-03-19 2000-12-27 Hitachi Ltd Method and device for detecting starting and ending points of sound section in video
US6211919B1 (en) * 1997-03-28 2001-04-03 Tektronix, Inc. Transparent embedment of data in a video signal
SE512719C2 (en) * 1997-06-10 2000-05-02 Lars Gustaf Liljeryd A method and apparatus for reducing data flow based on harmonic bandwidth expansion
TW357335B (en) * 1997-10-08 1999-05-01 Winbond Electronics Corp Apparatus and method for variation of tone of digital audio signals
US6330672B1 (en) 1997-12-03 2001-12-11 At&T Corp. Method and apparatus for watermarking digital bitstreams
EP0976125B1 (en) 1997-12-19 2004-03-24 Koninklijke Philips Electronics N.V. Removing periodicity from a lengthened audio signal
US6108622A (en) 1998-06-26 2000-08-22 Lsi Logic Corporation Arithmetic logic unit controller for linear PCM scaling and decimation in an audio decoder
GB2340351B (en) * 1998-07-29 2004-06-09 British Broadcasting Corp Data transmission
US6266003B1 (en) 1998-08-28 2001-07-24 Sigma Audio Research Limited Method and apparatus for signal processing for time-scale and/or pitch modification of audio signals
US6266644B1 (en) 1998-09-26 2001-07-24 Liquid Audio, Inc. Audio encoding apparatus and methods
SE9903552D0 (en) 1999-01-27 1999-10-01 Lars Liljeryd Efficient spectral envelope coding using dynamic scalefactor grouping and time / frequency switching
TW477119B (en) 1999-01-28 2002-02-21 Winbond Electronics Corp Byte allocation method and device for speech synthesis
JP3430968B2 (en) 1999-05-06 2003-07-28 ヤマハ株式会社 Method and apparatus for time axis companding of digital signal
JP3465628B2 (en) 1999-05-06 2003-11-10 ヤマハ株式会社 Method and apparatus for time axis companding of audio signal
JP3546755B2 (en) 1999-05-06 2004-07-28 ヤマハ株式会社 Method and apparatus for companding time axis of rhythm sound source signal
JP3430974B2 (en) 1999-06-22 2003-07-28 ヤマハ株式会社 Method and apparatus for time axis companding of stereo signal
US6411724B1 (en) 1999-07-02 2002-06-25 Koninklijke Philips Electronics N.V. Using meta-descriptors to represent multimedia information
JP4300641B2 (en) 1999-08-10 2009-07-22 ヤマハ株式会社 Time axis companding method and apparatus for multitrack sound source signal
US6978236B1 (en) * 1999-10-01 2005-12-20 Coding Technologies Ab Efficient spectral envelope coding using variable time/frequency resolution and time/frequency switching
JP4438144B2 (en) * 1999-11-11 2010-03-24 ソニー株式会社 Signal classification method and apparatus, descriptor generation method and apparatus, signal search method and apparatus
FR2802329B1 (en) * 1999-12-08 2003-03-28 France Telecom PROCESS FOR PROCESSING AT LEAST ONE AUDIO CODE BINARY FLOW ORGANIZED IN THE FORM OF FRAMES
US7092774B1 (en) 2000-02-29 2006-08-15 Prime Image, Inc. Multi-channel audio processing system with real-time program duration alteration
US6505153B1 (en) * 2000-05-22 2003-01-07 Compaq Information Technologies Group, L.P. Efficient method for producing off-line closed captions
US6718309B1 (en) 2000-07-26 2004-04-06 Ssi Corporation Continuously variable time scale modification of digital audio signals
KR100898879B1 (en) 2000-08-16 2009-05-25 돌비 레버러토리즈 라이쎈싱 코오포레이션 Modulating One or More Parameter of An Audio or Video Perceptual Coding System in Response to Supplemental Information
US7457422B2 (en) * 2000-11-29 2008-11-25 Ford Global Technologies, Llc Method and implementation for detecting and characterizing audible transients in noise
WO2004019656A2 (en) 2001-02-07 2004-03-04 Dolby Laboratories Licensing Corporation Audio channel spatial translation
US6680753B2 (en) 2001-03-07 2004-01-20 Matsushita Electric Industrial Co., Ltd. Method and apparatus for skipping and repeating audio frames
US7461002B2 (en) * 2001-04-13 2008-12-02 Dolby Laboratories Licensing Corporation Method for time aligning audio signals using characterizations based on auditory events
US20020116178A1 (en) * 2001-04-13 2002-08-22 Crockett Brett G. High quality time-scaling and pitch-scaling of audio signals
US7283954B2 (en) * 2001-04-13 2007-10-16 Dolby Laboratories Licensing Corporation Comparing audio using characterizations based on auditory events
MXPA03009357A (en) 2001-04-13 2004-02-18 Dolby Lab Licensing Corp High quality time-scaling and pitch-scaling of audio signals.
US7711123B2 (en) 2001-04-13 2010-05-04 Dolby Laboratories Licensing Corporation Segmenting audio signals into auditory events
US7610205B2 (en) * 2002-02-12 2009-10-27 Dolby Laboratories Licensing Corporation High quality time-scaling and pitch-scaling of audio signals
US7292901B2 (en) * 2002-06-24 2007-11-06 Agere Systems Inc. Hybrid multi-channel/cue coding/decoding of audio signals
EP1386312B1 (en) * 2001-05-10 2008-02-20 Dolby Laboratories Licensing Corporation Improving transient performance of low bit rate audio coding systems by reducing pre-noise
JP4272050B2 (en) 2001-05-25 2009-06-03 ドルビー・ラボラトリーズ・ライセンシング・コーポレーション Audio comparison using characterization based on auditory events
MXPA03010751A (en) 2001-05-25 2005-03-07 Dolby Lab Licensing Corp High quality time-scaling and pitch-scaling of audio signals.
US7346667B2 (en) 2001-05-31 2008-03-18 Ubs Ag System for delivering dynamic content
US7171367B2 (en) 2001-12-05 2007-01-30 Ssi Corporation Digital audio with parameters for real-time time scaling
US20040037421A1 (en) * 2001-12-17 2004-02-26 Truman Michael Mead Parital encryption of assembled bitstreams
ES2255678T3 (en) 2002-02-18 2006-07-01 Koninklijke Philips Electronics N.V. PARAMETRIC AUDIO CODING.
DE60225190T2 (en) * 2002-04-05 2009-09-10 International Business Machines Corp. FEATURE-BASED AUDIO CONTENT IDENTIFICATION
EP1500084B1 (en) 2002-04-22 2008-01-23 Koninklijke Philips Electronics N.V. Parametric representation of spatial audio
US7043423B2 (en) * 2002-07-16 2006-05-09 Dolby Laboratories Licensing Corporation Low bit-rate audio coding systems and methods that use expanding quantizers with arithmetic coding
DE10236694A1 (en) * 2002-08-09 2004-02-26 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Equipment for scalable coding and decoding of spectral values of signal containing audio and/or video information by splitting signal binary spectral values into two partial scaling layers
US7454331B2 (en) * 2002-08-30 2008-11-18 Dolby Laboratories Licensing Corporation Controlling loudness of speech in signals that contain speech and other types of audio material
DE602004023917D1 (en) 2003-02-06 2009-12-17 Dolby Lab Licensing Corp CONTINUOUS AUDIO DATA BACKUP
KR101164937B1 (en) 2003-05-28 2012-07-12 돌비 레버러토리즈 라이쎈싱 코오포레이션 Method, apparatus and computer program for calculating and adjusting the perceived loudness of an audio signal
US7398207B2 (en) * 2003-08-25 2008-07-08 Time Warner Interactive Video Group, Inc. Methods and systems for determining audio loudness levels in programming
CA2992097C (en) 2004-03-01 2018-09-11 Dolby Laboratories Licensing Corporation Reconstructing audio signals with multiple decorrelation techniques and differentially coded parameters
US7617109B2 (en) * 2004-07-01 2009-11-10 Dolby Laboratories Licensing Corporation Method for correcting metadata affecting the playback loudness and dynamic range of audio information
US7391870B2 (en) * 2004-07-09 2008-06-24 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E V Apparatus and method for generating a multi-channel output signal
US7508947B2 (en) 2004-08-03 2009-03-24 Dolby Laboratories Licensing Corporation Method for combining audio signals using auditory scene analysis
JP4594681B2 (en) * 2004-09-08 2010-12-08 ソニー株式会社 Audio signal processing apparatus and audio signal processing method
TW200638335A (en) 2005-04-13 2006-11-01 Dolby Lab Licensing Corp Audio metadata verification
TWI397903B (en) 2005-04-13 2013-06-01 Dolby Lab Licensing Corp Economical loudness measurement of coded audio
US7983922B2 (en) * 2005-04-15 2011-07-19 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and method for generating multi-channel synthesizer control signal and apparatus and method for multi-channel synthesizing
US8160888B2 (en) * 2005-07-19 2012-04-17 Koninklijke Philips Electronics N.V Generation of multi-channel audio signals
DE602007011594D1 (en) 2006-04-27 2011-02-10 Dolby Lab Licensing Corp SOUND AMPLIFICATION WITH RECORDING OF PUBLIC EVENTS ON THE BASIS OF SPECIFIC VOLUME
EP2176862B1 (en) * 2008-07-11 2011-08-31 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for calculating bandwidth extension data using a spectral tilt controlling framing
CN102172047B (en) * 2008-07-31 2014-01-29 弗劳恩霍夫应用研究促进协会 Signal generation for binaural signals
UA100353C2 (en) * 2009-12-07 2012-12-10 Долбі Лабораторіс Лайсензін Корпорейшн Decoding of multichannel audio encoded bit streams using adaptive hybrid transformation
MX346306B (en) * 2012-06-25 2017-03-15 Lg Electronics Inc Apparatus and method for processing an interactive service.

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018084848A1 (en) * 2016-11-04 2018-05-11 Hewlett-Packard Development Company, L.P. Dominant frequency processing of audio signals
US10390137B2 (en) 2016-11-04 2019-08-20 Hewlett-Packard Dvelopment Company, L.P. Dominant frequency processing of audio signals
US20180308497A1 (en) * 2017-04-25 2018-10-25 Dts, Inc. Encoding and decoding of digital audio signals using variable alphabet size
US10699723B2 (en) * 2017-04-25 2020-06-30 Dts, Inc. Encoding and decoding of digital audio signals using variable alphabet size

Also Published As

Publication number Publication date
US7711123B2 (en) 2010-05-04
US9165562B1 (en) 2015-10-20
US20040165730A1 (en) 2004-08-26
US8488800B2 (en) 2013-07-16
US20170004838A1 (en) 2017-01-05
US10134409B2 (en) 2018-11-20
US20140376729A1 (en) 2014-12-25
US20130279704A1 (en) 2013-10-24
US20150371649A1 (en) 2015-12-24
US8842844B2 (en) 2014-09-23
US20100185439A1 (en) 2010-07-22

Similar Documents

Publication Publication Date Title
US9165562B1 (en) Processing audio signals with adaptive time or frequency resolution
CA2448182C (en) Segmenting audio signals into auditory events
EP2549475B1 (en) Segmenting audio signals into auditory events
US7283954B2 (en) Comparing audio using characterizations based on auditory events
AU2002252143A1 (en) Segmenting audio signals into auditory events
US7461002B2 (en) Method for time aligning audio signals using characterizations based on auditory events
AU2002240461A1 (en) Comparing audio using characterizations based on auditory events

Legal Events

Date Code Title Description
AS Assignment

Owner name: DOLBY LABORATORIES LICENSING CORPORATION, CALIFORN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CROCKETT, BRETT G.;REEL/FRAME:035877/0142

Effective date: 20031112

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20231020