WO2015093668A1 - Dispositif et procédé de traitement de signal audio - Google Patents

Dispositif et procédé de traitement de signal audio Download PDF

Info

Publication number
WO2015093668A1
WO2015093668A1 PCT/KR2013/011935 KR2013011935W WO2015093668A1 WO 2015093668 A1 WO2015093668 A1 WO 2015093668A1 KR 2013011935 W KR2013011935 W KR 2013011935W WO 2015093668 A1 WO2015093668 A1 WO 2015093668A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
tempo
audio signal
sub
spectrum
Prior art date
Application number
PCT/KR2013/011935
Other languages
English (en)
Korean (ko)
Inventor
김태홍
Original Assignee
김태홍
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 김태홍 filed Critical 김태홍
Priority to PCT/KR2013/011935 priority Critical patent/WO2015093668A1/fr
Publication of WO2015093668A1 publication Critical patent/WO2015093668A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/40Rhythm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/076Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of timing, tempo; Beat detection

Definitions

  • the disclosed embodiments relate to an apparatus and a method for processing an audio signal, and more particularly, to an apparatus and a method for extracting a highlight portion of an audio signal.
  • Presenting highlight portions of various media files included in the electronic device to the user of the electronic device enables the user to easily select among the media files that the user wishes to enjoy.
  • existing techniques for extracting highlights from an audio signal represented in a media file divide the audio signal into intervals of any time unit (eg, 1 second) without being based on the unique characteristics of the audio signal. Entails. That is, this division applies a common time unit for any audio signals. Therefore, existing highlight extraction techniques have limitations in providing highlights that are naturally experienced by the user.
  • the disclosed embodiments are directed to an apparatus and method for estimating the tempo of an audio signal and extracting highlight portions of the audio signal based on the estimated tempo.
  • a tempo estimator configured to estimate a tempo of the audio signal from a sequence of samples representing an audio signal in a time domain;
  • An energy level calculator configured to derive a plurality of partially overlapping sub-sequences from the sequence and to calculate energy levels respectively corresponding to the plurality of sub-sequences;
  • a highlight extractor configured to extract one of the plurality of sub-sequences into a highlight portion of the audio signal based on the energy levels, each sub-sequence being a time duration set according to the estimated tempo.
  • the tempo estimator may also be configured to estimate the tempo in units of beat per minute (BPM), the duration being an integer multiple of the beat according to the estimated tempo. Can be.
  • BPM beat per minute
  • a start sample in each sub-sequence may be spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is the time difference. It may be an integer multiple of.
  • the energy level calculator is further adapted to each sub-sequence by obtaining the energy of the samples encapsulated by the sliding window while moving the sliding window by the time difference with respect to the samples in the sequence. It may be configured to calculate the corresponding energy level and the sliding window may have the same length as the duration.
  • the highlight extractor may also be configured to extract, as the highlight portion, a sub-sequence having the largest corresponding energy level among the sub-sequences.
  • the tempo estimator comprises: a time-to-frequency domain conversion unit configured to convert the sequence into a frequency domain spectrum; A copy generator configured to obtain a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications; A spectrum adder configured to obtain an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And a tempo estimator configured to estimate the tempo by detecting one or a plurality of peak frequencies of the overlap spectrum.
  • the tempo estimator may also be configured to detect the one or a plurality of peak frequencies within a preset frequency region of the overlap spectrum.
  • the frequency domain spectrum may be a magnitude spectrum in the frequency domain of the audio signal.
  • the tempo estimator performs at least one of downsampling the sequence and performing an absolute value operation on the value of each of the samples in the sequence before the conversion to update the sequence. It may further comprise a preprocessor configured to.
  • the tempo estimator may further select candidate tempos according to the detected plurality of peak frequencies, divide the sequence into frames having a length set according to each candidate tempo, And may estimate one of the candidate tempos as the tempo based on the similarity between adjacent frames.
  • a method of estimating a tempo of an audio signal from a sequence of samples representing an audio signal in a time domain Deriving a plurality of partially overlapping sub-sequences from the sequence and calculating energy levels respectively corresponding to the plurality of sub-sequences; And extracting one of the plurality of sub-sequences as a highlight portion of the audio signal based on the energy levels, each sub-sequence having a duration set according to the estimated tempo, each energy A level is provided that represents an energy of samples of a sub-sequence corresponding to each energy level.
  • the estimating may include estimating the tempo in units of beat per minute (BPM), and the duration is an integer multiple of a beat according to the estimated tempo. It can be time.
  • BPM beat per minute
  • a start sample in each sub-sequence may be spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is the time difference. It may be an integer multiple of.
  • the calculating step corresponds to each sub-sequence by obtaining an energy of samples enclosed by the sliding window while moving the sliding window by the time difference with respect to the samples in the sequence.
  • Comprising the step of calculating the energy level, wherein the sliding window may have the same length as the duration.
  • the extracting may include extracting, as the highlight portion, a sub-sequence having the largest corresponding energy level among the sub-sequences.
  • the estimating comprises: converting the sequence into a frequency domain spectrum; Obtaining a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications; Obtaining an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And estimating the tempo by detecting one or a plurality of peak frequencies in the overlap spectrum.
  • detecting the one or the plurality of peak frequencies may include detecting the one or the plurality of peak frequencies within a preset frequency region of the overlap spectrum.
  • the frequency domain spectrum may be a magnitude spectrum in the frequency domain of the audio signal.
  • updating the sequence by performing at least one of downsampling the sequence and taking an absolute value operation on the value of each of the samples in the sequence before the conversion.
  • estimating the tempo from the detected one or a plurality of peak frequencies comprises: selecting candidate tempos according to the detected plurality of peak frequencies; Dividing the sequence into frames of a length set according to each candidate tempo; And estimating one of the candidate tempos as the tempo based on the similarity between each frame and each frame and an adjacent frame.
  • a computer readable storage medium having stored thereon computer executable instructions configured to perform any one of the above methods when executed by a processor.
  • the highlight portion of the audio signal may be extracted based on an estimated value of the tempo of the audio signal.
  • the highlight portion extracted in accordance with certain embodiments may provide an improved experience for the listener.
  • FIG. 1 is a block diagram of an audio signal processing apparatus according to some embodiments.
  • FIG. 3 is a block diagram of a tempo estimator in accordance with certain embodiments.
  • FIG. 4 is a flowchart of an audio signal processing method according to some embodiments.
  • FIG. 1 is a block diagram of an audio signal processing apparatus according to some embodiments.
  • the exemplary audio signal processing apparatus 100 includes a tempo estimator 110, an energy level calculator 120, and a highlight extractor 130.
  • Tempo estimator 110 is configured to estimate the tempo of the audio signal.
  • the energy level calculator 120 is configured to calculate energy levels respectively corresponding to the plurality of time periods of the audio signal. The length of each time period may be set according to the tempo estimated by the tempo estimator 110.
  • the highlight extractor 130 is configured to identify the highlight portion of the audio signal. Based on the energy levels calculated by the energy level calculator 120, one of the time periods of the audio signal may be extracted as the highlight portion.
  • the tempo estimator 110, the energy level calculator 120, and the highlight extractor 130 of the audio signal processing apparatus 100 may be configured to perform the following operations.
  • the tempo estimator 110 may estimate the tempo of the audio signal from a sequence of samples representing a time-domain waveform of the audio signal.
  • the sample sequence may be stored locally in the audio signal processing apparatus 100 (eg, data representing a media signal comprising an audio signal) or from a media source remote from the audio signal processing apparatus 100 to the tempo estimator 110. Can be provided.
  • the tempo of a unit of sound refers to the rate perceived as the successive tones of that unit of sound progress.
  • One of the commonly used tempo units is BPM (Beat Per Minute).
  • BPM Beat Per Minute
  • the beat of the audio signal represents a time unit of one second.
  • a measure of a typical 4/4 beat pop song consists of four bars and each measure consists of four beats, each of which corresponds to 16 seconds if the tempo of that song is 60BPM. .
  • the tempo estimated by the tempo estimator 110 may be used by the audio signal processing apparatus 100 to extract a highlight portion of the audio signal. This is based on the fact that when an audio signal represents a piece of music, it is typically seen that a climax or characteristic melody appears in units of measure or measure of the piece of music. Exemplary tempo estimation techniques that may be applied to the tempo estimator 110 will be described below.
  • the energy level calculator 120 is configured to calculate energy levels respectively corresponding to the plurality of time periods of the audio signal. For example, energy level calculator 120 derives a plurality of partially overlapping sub-sequences from a sample sequence representing an audio signal and calculates an energy level representing the energy of the samples of each sub-sequence. can do.
  • Each sub-sequence has a time duration set according to the tempo estimated by the tempo estimator 110.
  • the start sample in each sub-sequence may be spaced a predetermined time difference from the start sample in the sub-sequence adjacent to that sub-sequence. This time difference may be set according to the tempo of the audio signal estimated by the tempo estimator 110. For example, the duration of each sub-sequence may be an integer multiple of that time difference.
  • the time-domain waveform of the audio signal is given by the dashed curve 210 as shown in FIG. 2 and the waveform is sampled at 5 Hz and represented by a sequence of 500 samples in total.
  • the tempo of the audio signal estimated by the tempo estimator 110 is 60 BPM, one bit of the audio signal includes 5 samples.
  • the duration of each sub-sequence may be set according to the tempo estimate. This setting may be performed by the audio signal processing apparatus 100 (eg, the energy level calculator 120) or may be performed according to a user's input of the audio signal processing apparatus 100.
  • the duration of each sub-sequence can be set to a time of four beats, as described above, which corresponds to a word of 4/4 beat music.
  • This duration is exemplary only, and another duration (e.g., with a time of 16 beats corresponding to one measure of 4/4 beat music) may be set.
  • another duration e.g., with a time of 16 beats corresponding to one measure of 4/4 beat music
  • starting samples of adjacent sub-sequences of the sub-sequences are spaced from each other by five consecutive samples (ie, one bit) from each other. That is, fifteen samples (ie three bits) overlap in any two adjacent sub-sequences.
  • the energy level calculator 120 starts a sliding window having a time equal to the duration of each sub-sequence for the 500 samples in the sequence by the time difference corresponding to five consecutive samples starting from the first sample.
  • the partially overlapped sub-sequences can be derived by selecting 20 samples encompassed by the sliding window as sub-sequences each time they move.
  • the energy level calculator 120 may calculate an energy level corresponding to each sub-sequence by obtaining the energy of the samples in the sliding window. According to some embodiments, the energy level calculator 120 first obtains energy values of five samples included in each bit, and then uses the energy values to calculate energy of 20 samples of the sliding window. The energy level corresponding to the sub-sequence can be calculated with less complexity.
  • the energy level calculator 120 is configured to preprocess the original audio signal using a predetermined filter, and calculate energy levels corresponding to a plurality of time periods of the preprocessed audio signal, respectively.
  • the filters used in this preprocessing are filters for applying the modeling of human auditory perception to the frequency domain representation of the audio signal (e.g. Mel-Frequency filter, modification of the simplified Mel-frequency filter with coefficients). Or filters according to auditory recognition modeling different from mel-frequency filters.
  • the energy level calculator 120 preprocesses the sample sequence representing the audio signal as above, then derives partially overlapping sub-sequences by moving the sliding window by the time difference with respect to the preprocessed sample sequence,
  • the energy level representing the energy of the samples of each sub-sequence can be calculated by obtaining the energy of the samples covered by.
  • the highlight extractor 130 is configured to extract the highlight portion of the audio signal based on the energy levels calculated by the energy level calculator 120.
  • the highlight extractor 130 may extract one of the plurality of sub-sequences derived from the energy level calculator 120 as the highlight portion.
  • the extracted highlight portion may be a sub-sequence having the largest energy level among the plurality of sub-sequences.
  • the duration of each sub-sequence may be an integer multiple of the bit time (eg, time corresponding to a measure or measure) according to the tempo of the audio signal estimated by the tempo estimator 110.
  • the highlight portion provided from the highlight extractor 130 may also have a duration of an integer multiple of the bit time.
  • estimating the tempo of the audio signal may increase the likelihood that the extracted highlight portion will be experienced as a natural excerpt for the listener.
  • a highlight portion of each of the two audio signals extracted in this manner is concatenated, a sudden transition from one of the two portions may not occur.
  • the tempo estimator 110 of the audio signal processing apparatus 100 may employ techniques to estimate the global tempo of the audio signal or a local tempo of a portion of the audio signal. have.
  • Some tempo estimation techniques utilize detecting an event from an audio signal represented in the time domain. These time domain analysis techniques involve preprocessing and tempo detection. In the preprocessing, major events with large auditory stimuli are detected from the audio signal (e.g., music signal) and the occurrence time of those events is obtained. In the tempo detection process, an actual tempo of the audio signal is estimated based on the acquired time. For example, an event of an audio signal representing a sound played on a musical instrument or a human utterance may include a unique Attack Decay Sustain Release envelope of the time domain waveform of the audio signal. Filtering, envelope detection, and onset detection may in turn be used for detection of events.
  • these events can be attributed to percussion, where the sound components from the percussion usually occur over the entire frequency band, with a spectrogram for analysis of each sub-band. And / or filter banks may be utilized.
  • Estimation of the actual tempo determines the minimum time unit or period of the repetition pattern in the audio signal by performing clustering of time differences of each adjacent pair of events among the detected events (eg, inter-onset interval clustering). May include an action.
  • Some other tempo estimation techniques utilize detecting a peak frequency from an audio signal represented in the frequency domain to a spectrum of the audio signal.
  • a fast fourier transform FFT
  • the peak frequency of the signal transformed into the frequency domain indicates the frequency of occurrence of an event.
  • a typical audio signal e.g., an audio signal representing music
  • the tempo will be estimated will generate peak frequencies in the low frequency band (e.g., 1-3.3 Hz or 60-200 BPM).
  • These frequency domain analysis techniques also involve preprocessing to extract the envelope from a large number of samples and to lower the sample rate appropriately for accurate detection of peak frequency.
  • the number of FFT points is determined according to the sample rate and the desired BPM precision, and an estimated value of the BPM can be obtained from the FFT bin and the BPM precision at which the peak frequency occurs. Meanwhile, techniques using a combination of frequency domain analysis and time domain analysis of an audio signal may also be applied.
  • Still some tempo estimation techniques involve scaling the frequency axis of the spectral of the audio signal at specific magnifications and adding superimposed spectral copies scaled at different magnifications than the original spectrum.
  • the above specific magnifications are frequencies of events that may occur in accordance with the time signature of the audio signal.
  • the set of magnifications applied to the frequency axis scaling of the spectrum of an audio signal representing 4/4 beats of music is ⁇ 1/4, 2/4, 3/4, 4/4, 8/4, 12 /. 4, 16/4 ⁇ .
  • magnifications of ⁇ 1/3, 2/3, 3/3, 6/3, 9/3, 12/3 ⁇ are applied to an audio signal of 3/4 beat of music. These magnifications match the frequency with which major events occur in most musical works.
  • Typical music has relatively large peaks in the low frequency range when expressed in the frequency domain. However, simply estimating the frequency corresponding to the largest magnitude peak among such observed peaks as the tempo of the music may be of low accuracy.
  • the peak frequencies refer to the frequency of occurrence of the event due to the low frequency instruments of the music (eg bass, bass drum, bassoon, tuba, etc. for the rhythm section).
  • the tempo estimates based on ratios of possible event occurrence frequencies to tempo. For this purpose, it may be useful to sequence events of a given audio signal in units of 1/4 bits or more.
  • bass instruments that produce the rhythm of music have a longer period of oscillation than treble instruments, and events caused by bass instruments take longer to sustain or release; The next bass instrument event is more likely to occur before the release interval of the subsequenced sequenced bass instrument event.
  • 4/4 beats of music do not occur more than 4 times per beat.
  • the frequency of event occurrences possible in one measure or one measure can be regarded as once, twice, three times, four times, eight times, twelve times and sixteen times per four bits, which are included in the above-mentioned set of magnifications. do.
  • the spectrum of the above 4/4 beat music (eg, the spectrum representing the magnitude in the frequency domain of the music) is scaled along the frequency axis and is most likely from the overlapping spectrum representing the addition of the scaled spectral copies.
  • a predetermined number of peak frequencies may be extracted in order of magnitude.
  • One extracted peak frequency or one of the extracted peak frequencies may be estimated as the tempo of the music. This is based on the observation that the superimposition spectrum produces distinct peaks (although not necessarily the largest) at frequencies corresponding to the actual tempo.
  • an example tempo estimator 110 is described in which techniques for estimating tempo using an overlapping spectrum that adds spectral copies scaled to a frequency axis at specific magnifications can be implemented.
  • FIG. 3 is a block diagram of a tempo estimator in accordance with certain embodiments.
  • the tempo estimator 110 scales the frequency axis of the spectrum to specific magnifications by a time-frequency domain converter 310 that performs an operation of converting a time-domain waveform of an audio signal into a frequency-domain spectrum.
  • Copy generation unit 320 to perform the operation, including the original spectrum, the spectrum adder 320 performing the operation of adding the scaled spectral copies, one or more peak frequencies in the overlapping spectrum (for example, 1 ⁇ 3.3Hz to And a tempo estimator 325 for estimating a tempo of the audio signal by detecting one or more peak frequencies within a region of interest of 60 to 200 BPM.
  • the time-frequency domain converter 310 may be configured to convert a sequence that is a time domain representation of the audio signal into a frequency domain spectrum signal.
  • the conversion to the frequency domain can be performed using an FFT. If the region of interest is a very low frequency region, such as the 1 to 1.3 Hz band, the FFT can be applied by converting as many samples as possible into the frequency domain at once, so that a clear peak appears in this region.
  • the number of FFT points may be determined according to the following equation.
  • the sampling frequency is 1000 Hz and the tempo resolution is 0.1 (i.e., indicating that the tempo is extracted to one decimal place)
  • an FFT calculation of 600000 points or more should be performed.
  • the copy generator 315 may be configured to obtain a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications. This frequency axis scaling can be maintained without changing the magnitude of the original spectrum. If the audio signal represents 1/2, 4/4 or 4/8 timed music, the set of ratios used for scaling is ⁇ 1/4, 2/4, 3/4, 4/4, 8/4, 12 / 4, 16/4 ⁇ . If the audio signal represents 3/4 or 6/8 timed music, the set of ratios used for scaling is ⁇ 1/3, 2/3, 3/3, 6/3, 9/3, 12/3 ⁇ Can be.
  • the copy generator 315 If the region of interest is a region of 1 to 3.3 Hz, the copy generator 315 generates spectral copies by scaling only a portion of the low frequency band of the original spectrum (for example, a frequency region between 1/4 * 1 to 4 * 3.3 Hz). can do.
  • the time-frequency domain converter 310 may provide only the low frequency band portion of the original spectrum to the copy generator 315. As such, obtaining copies for a portion of the original spectrum may facilitate high speed computation and memory optimization.
  • the spectral adder 320 is configured to obtain an overlapping spectrum by adding originals and copies of the spectrum.
  • the tempo estimator 325 is configured to estimate the tempo of the audio signal by detecting one or a plurality of peak frequencies from the obtained overlap spectrum.
  • the tempo estimator 325 may detect one or more peak frequencies within a preset frequency range (eg, 1 to 3.3 Hz, which is a region of interest).
  • the estimated tempo may be provided from the tempo estimator 330. For example, when the FFT bin having the largest peak in the overlap spectrum is called FFTBin, the tempo estimator 330 may estimate the tempo BPM of the audio signal as shown below.
  • the tempo estimator 325 may select candidate tempos by detecting a plurality of peak frequencies in descending order of magnitude in the overlap spectrum. Criteria for selecting the final estimated tempo among the candidate tempos may be set in various ways. In certain embodiments, the tempo estimator 325 divides the sequence of the audio signal into frames of a length (e.g., eight times the bit time derived from each candidate tempo) set according to each candidate tempo selected and each One of the candidate tempos can be estimated as the tempo of the audio signal based on cross-correlation between the frame and adjacent frames of the frame.
  • a length e.g., eight times the bit time derived from each candidate tempo
  • This estimation method assumes that when the candidate tempo is a real tempo or a multiple of the actual tempo, adjacent frames exhibit similarly shaped waveforms, and cross-correlation between the frames is likely to have a symmetrical shape with a peak at the center. Based.
  • the tempo estimator 325 has a peak having a maximum magnitude in a waveform of cross-correlation between each frame and an adjacent frame of the frame in the entire audio signal sequence for each candidate tempo within a preset range from the center. It can be determined that such two frames are similar or dissimilar, depending on whether they occur at.
  • the tempo estimator 325 may estimate the candidate tempo of the highest score among candidate tempos as the tempo of the audio signal by scoring the number of pairs of adjacent frames determined to be similar. For example, for a candidate tempo, if a sequence of audio signals is divided into 90 frames each having 8 bits in length and these frames are divided into 9 different patterns (i.e., a total of 8 pattern changes between adjacent frames are If any), the score for that candidate tempo is 81/89 * 100 ⁇ 91. As such, the tempo estimator 325 divides the sequence of the audio signal into frames having a length set according to each candidate tempo, and decomposes one of the candidate tempos based on the similarity of each pair of two adjacent frames in the sequence. It can be estimated by the tempo of.
  • the tempo estimator 110 may further include a preprocessor 305.
  • the preprocessor 305 is configured to perform some preprocessing operations prior to converting the sequence of samples representing the audio signal in the time domain into the frequency domain spectrum.
  • operations performed by the preprocessor 305 may include an operation of downsampling and updating an original sequence to an appropriate level based on a desired tempo resolution. This downsampling is based on the fact that there are major events in the low frequency range that allow the tempo to be perceived in music-like sounds.
  • the downsampled sequence can be converted to the frequency domain at a faster rate.
  • the preprocessor 305 may perform an operation of updating the sequence by performing an absolute value operation on the value of each sample in the sequence.
  • the method of converting the updated sequence into the frequency domain may provide a clearer peak in the frequency domain than the method of converting the original sequence into the frequency domain.
  • FIG. 4 is a flow chart of an audio signal processing method according to some embodiments.
  • the audio signal processing method 400 shown in FIG. 4 highlight extraction techniques based on the estimated tempo are implemented.
  • the audio signal processing method 400 may be performed by the audio signal processing apparatus 100 described above.
  • a sequence of samples representing an audio signal in the time domain is received (405).
  • steps 410 to 435 described below the tempo of the audio signal is estimated from the received sequence.
  • the sequence is downsampled.
  • an absolute value operation is performed on the value of each of the samples in the sequence.
  • the preprocessed sequence is converted into a frequency domain spectrum.
  • the converted spectrum may be a spectrum representing the magnitude in the frequency domain of the audio signal.
  • a plurality of copies of the frequency domain spectrum are generated.
  • the plurality of copy spectra may be obtained by scaling the frequency axis of the original spectrum according to each of the plurality of preset magnifications.
  • an overlapping spectrum is obtained by adding the plurality of scaled copy spectra including the original spectrum.
  • the tempo of the audio signal is estimated according to the peak frequency detected in the overlap spectrum.
  • a plurality of peak frequencies may be detected within the overlap spectrum.
  • the tempo of the audio signal may be estimated through the following time domain analysis. First, candidate tempos according to the plurality of detected peak frequencies are selected. The sequence is then divided into frames of length set according to each candidate tempo. One of the candidate tempos may be estimated as the tempo of the audio signal based on the similarity between each frame and its frame and adjacent frames (eg, similarity or dissimilarity determined according to cross-correlation between these two frames).
  • a plurality of sub-sequences each having a duration set according to the estimated tempo are derived from the sequence (440).
  • the duration may be an integer number of times (eg, 16) times of the bits, and each sub-sequence may be spaced one bit from the adjacent sub-sequence.
  • An energy level corresponding to each sub-sequence is then calculated 445.
  • This energy level calculation may be performed in such a way as to obtain the energy of the samples in the sliding window while moving the sliding window having the same length as the above duration with respect to the samples in the sequence.
  • the interval at which the sliding window moves may be equal to the time difference in which each sub-sequence is spaced apart from an adjacent sub-sequence.
  • one of the plurality of sub-sequences e.g., the sub-sequence with the largest energy level, is extracted 450 as the highlight portion of the audio signal.
  • the order in which at least some steps of the audio signal processing method 400 are performed may be changed. According to some embodiments, at least some steps of the audio signal processing method 400 may be performed in combination with each other, omitted, or divided into sub-steps. According to some embodiments, one or more steps not shown in FIG. 4 may be performed in addition to the audio signal processing method 400.
  • FIG. 5-7 show screenshots of an audio player in accordance with certain embodiments.
  • An exemplary audio player is performed by a computing device, or processor, including the audio signal processing apparatus 100 described above, providing a highlight portion extracted from the audio signal processing apparatus 100 and configured to visually represent the highlight portion.
  • the computer executable command configured to provide the highlight portion extracted according to the audio signal processing method 400 and visually represent the highlight portion in association with the computer executable instruction configured to perform the above-described audio signal processing method 400. May be implemented on a stored computing device.
  • a media file representing time domain samples of the audio signal may be input to the audio player.
  • the audio player can receive a list indicating a collection of media files or a location where the media files are stored and play each media file obtained therefrom.
  • a time domain waveform 510 of an audio signal is provided on a display (eg, a touch sensitive screen) of a computing device.
  • a play button 520, a highlight button 530, a next track button 540, a previous track button 550 and a repeat play button 560 are provided on the display.
  • User input to select the play button 520, highlight button 530, and repeat play button 560 e.g., a left mouse click or a tap with a finger or stylus
  • next track button 540 and the previous track button 550 may cause the audio player to play the next and previous audio signals, respectively. do.
  • Other user inputs on the next track button 540 and previous track button 550 eg, the user is pressing and holding on the next track button 540 or previous track button 550 provided on a touch-sensitive screen
  • the above modes can be activated / deactivated and the next / previous audio signal can be played.
  • the audio signal is played.
  • the cursor 570 for indicating the reproduction position is moved. After all of the audio signals are reproduced, the next audio signal can be reproduced. In the screen shot 500 of FIG. 5, the cursor 570 indicates that a portion of 2 minutes 2.88 seconds is being reproduced in an audio signal of 7 minutes 11.31 seconds long.
  • fast forward mode audio signals are played back faster than the playback mode.
  • the cursor 570 reaches the end of the audio signal, the next audio signal can be played.
  • the rewind mode is activated, the audio signal is played back in reverse.
  • the cursor 570 reaches the beginning of the audio signal, the previous audio signal can be played.
  • the repeat play mode is activated, playback of the audio signal is repeated continuously.
  • the cursor is moved to the starting point of the highlight portion resulting from the performance of the audio signal processing apparatus 100 or the audio signal processing method 400 ( 530 moves.
  • Indications representing such highlight portions may be recorded in tags of the media file representing the audio signal or stored in an external database.
  • the user of the audio player may replace the extracted highlight portion with another highlight portion through user input.
  • an indication indicating a new highlight portion may be stored in a tag of the media file or in an external database.
  • the highlight portion is played.
  • 6 shows a screenshot provided on the display when the highlight mode is activated.
  • cursor 570 indicates that the portion of the audio signal is being played back at 3 minutes 32.67 seconds.
  • the highlight portion 610 to be reproduced is displayed in a time domain waveform 510 of the audio signal in a highlighted manner unlike other portions.
  • an interface e.g., a pop-up menu or a slider
  • a user input for setting the length of the highlight portion 610 of the audio signal has a different user input (e.g., a right-click or a double tap by a finger or a stylus). May be provided on the display.
  • the user of the audio player can select or adjust the length of the highlight portion 610 on this interface. This selection or adjustment may be performed in units of measure or measure of the audio signal.
  • the measure or measure of the audio signal may be derived from an estimated value of the tempo of the audio signal.
  • the changed highlight portion 610 may be provided on the display and the changed highlight portion 610 may be played from the beginning.
  • the highlight portion of the next audio signal or the previous audio signal may also have the same length as the changed highlight portion 610.
  • FIG. 7 shows a screenshot provided on the display when a user input is received on a portion other than the highlight portion during reproduction of the highlight portion of the audio signal.
  • cursor 570 moves to that portion and the audio signal is played from that portion.
  • the time-domain waveform 510 of the audio signal is provided on the display where the highlight button 530 is still active but the emphasis of the highlight portion has disappeared.
  • the highlight mode is activated and playback of the next audio signal or the previous audio signal is instructed (for example, in response to a user input on the next track button 540 or the previous track button 550), The highlight section is played.
  • a computer-readable recording medium includes a program for performing the methods described herein (eg, the audio signal processing method 400) on a computer.
  • Such computer-readable recording media may include program instructions, local data files, local data structures, or the like, alone or in combination.
  • the computer readable recording medium may be those specially designed and configured for the present invention. Examples of such computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs and DVDs, magnetic-optical media such as floppy disks, and ROM, RAM and flash memory. Hardware devices specifically configured to store and execute program instructions, and the like. Examples of program instructions may include high-level language code that can be executed by a computer using an interpreter as well as machine code such as produced by a compiler.

Abstract

Selon un mode de réalisation cité à titre d'exemple, l'invention concerne un dispositif de traitement de signal audio qui comprend : un estimateur de tempo conçu pour estimer le tempo d'un signal audio à partir d'une séquence d'échantillons qui indiquent le signal audio dans un domaine temporel; un calculateur de niveau d'énergie conçu pour extraire de la séquence une pluralité de sous-séquences se chevauchant partiellement et pour calculer respectivement les niveaux d'énergie correspondant à la pluralité de sous-séquences; et un extracteur de partie d'intérêt conçu pour extraire, sur la base des niveaux d'énergie, une sous-séquence de la pluralité de sous-séquences en tant que partie d'intérêt du signal audio, chacune des sous-séquences ayant une durée définie conformément au tempo estimé, et chacun des niveaux d'énergie indiquant l'énergie des échantillons des sous-séquences correspondant respectivement aux niveaux d'énergie.
PCT/KR2013/011935 2013-12-20 2013-12-20 Dispositif et procédé de traitement de signal audio WO2015093668A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2013/011935 WO2015093668A1 (fr) 2013-12-20 2013-12-20 Dispositif et procédé de traitement de signal audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2013/011935 WO2015093668A1 (fr) 2013-12-20 2013-12-20 Dispositif et procédé de traitement de signal audio

Publications (1)

Publication Number Publication Date
WO2015093668A1 true WO2015093668A1 (fr) 2015-06-25

Family

ID=53402999

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2013/011935 WO2015093668A1 (fr) 2013-12-20 2013-12-20 Dispositif et procédé de traitement de signal audio

Country Status (1)

Country Link
WO (1) WO2015093668A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081271A (zh) * 2019-11-29 2020-04-28 福建星网视易信息系统有限公司 基于频域和时域的音乐节奏检测方法及存储介质
CN113709578A (zh) * 2021-09-14 2021-11-26 上海幻电信息科技有限公司 弹幕展示方法及装置
CN114096047A (zh) * 2022-01-11 2022-02-25 卧安科技(深圳)有限公司 一种音频控制灯效方法、装置、系统及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070113724A1 (en) * 2005-11-24 2007-05-24 Samsung Electronics Co., Ltd. Method, medium, and system summarizing music content
US20080046406A1 (en) * 2006-08-15 2008-02-21 Microsoft Corporation Audio and video thumbnails
KR100852196B1 (ko) * 2007-02-12 2008-08-13 삼성전자주식회사 음악 재생 시스템 및 그 방법
KR100995839B1 (ko) * 2008-08-08 2010-11-22 주식회사 아이토비 멀티미디어 디지털 콘텐츠의 축약정보 추출시스템과 축약 정보를 활용한 다중 멀티미디어 콘텐츠 디스플레이 시스템 및 그 방법
KR20120063528A (ko) * 2009-10-30 2012-06-15 돌비 인터네셔널 에이비 복합 확장 인지 템포 추정

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070113724A1 (en) * 2005-11-24 2007-05-24 Samsung Electronics Co., Ltd. Method, medium, and system summarizing music content
US20080046406A1 (en) * 2006-08-15 2008-02-21 Microsoft Corporation Audio and video thumbnails
KR100852196B1 (ko) * 2007-02-12 2008-08-13 삼성전자주식회사 음악 재생 시스템 및 그 방법
KR100995839B1 (ko) * 2008-08-08 2010-11-22 주식회사 아이토비 멀티미디어 디지털 콘텐츠의 축약정보 추출시스템과 축약 정보를 활용한 다중 멀티미디어 콘텐츠 디스플레이 시스템 및 그 방법
KR20120063528A (ko) * 2009-10-30 2012-06-15 돌비 인터네셔널 에이비 복합 확장 인지 템포 추정

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081271A (zh) * 2019-11-29 2020-04-28 福建星网视易信息系统有限公司 基于频域和时域的音乐节奏检测方法及存储介质
CN111081271B (zh) * 2019-11-29 2022-09-06 福建星网视易信息系统有限公司 基于频域和时域的音乐节奏检测方法及存储介质
CN113709578A (zh) * 2021-09-14 2021-11-26 上海幻电信息科技有限公司 弹幕展示方法及装置
CN113709578B (zh) * 2021-09-14 2023-08-11 上海幻电信息科技有限公司 弹幕展示方法、装置、设备及介质
CN114096047A (zh) * 2022-01-11 2022-02-25 卧安科技(深圳)有限公司 一种音频控制灯效方法、装置、系统及存储介质
CN114096047B (zh) * 2022-01-11 2022-04-26 卧安科技(深圳)有限公司 音频控制灯效方法、装置、系统及计算机可读存储介质

Similar Documents

Publication Publication Date Title
US8889976B2 (en) Musical score position estimating device, musical score position estimating method, and musical score position estimating robot
Yamada et al. A rhythm practice support system with annotation-free real-time onset detection
US7189912B2 (en) Method and apparatus for tracking musical score
EP2633524B1 (fr) Procédé, dispositif et support d'enregistrement lisible par machine pour décomposer un signal audio multicanal
Holzapfel et al. Three dimensions of pitched instrument onset detection
JP7448053B2 (ja) 学習装置、自動採譜装置、学習方法、自動採譜方法及びプログラム
JP2015525895A (ja) オーディオ信号分析
US10504498B2 (en) Real-time jamming assistance for groups of musicians
JP3789326B2 (ja) テンポ抽出装置、テンポ抽出方法、テンポ抽出プログラム及び記録媒体
CN110516102B (zh) 一种基于语谱图识别的歌词时间戳生成方法
WO2015093668A1 (fr) Dispositif et procédé de traitement de signal audio
WO2013187986A1 (fr) Systèmes, procédés, appareil et supports lisibles par ordinateur d'analyse de trajectoire de hauteur de son
JP2005292207A (ja) 音楽分析の方法
WO2017057531A1 (fr) Dispositif de traitement acoustique
Yu et al. Singing voice separation by low-rank and sparse spectrogram decomposition with prelearned dictionaries
WO2022070639A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations, et programme
KR101238113B1 (ko) 입력 음성의 분석을 이용한 작곡 및 기성곡 검색 시스템
Özaslan et al. Attack based articulation analysis of nylon string guitar
Kokuer et al. Automatied detection of single-and multi-note ornaments in Irish traditional flute playing
Yeh et al. AutoRhythm: a music game with automatic hit-timing generation and percussion identification
Özaslan et al. Identifying attack articulations in classical guitar
Bonjyotsna et al. Analytical study of vocal vibrato and mordent of Indian popular singers
Elliott et al. Analysing multi-person timing in music and movement: event based methods
Müller et al. Tempo and Beat Tracking
Zhang Mobile Music Recognition based on Deep Neural Network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13899471

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30/09/2016)

122 Ep: pct application non-entry in european phase

Ref document number: 13899471

Country of ref document: EP

Kind code of ref document: A1