WO2015093668A1

WO2015093668A1 - Device and method for processing audio signal

Info

Publication number: WO2015093668A1
Application number: PCT/KR2013/011935
Authority: WO
Inventors: 김태홍
Original assignee: 김태홍
Priority date: 2013-12-20
Filing date: 2013-12-20
Publication date: 2015-06-25

Abstract

According to an exemplary embodiment, provided is an audio signal processing device, the device comprising: a tempo estimator configured to estimate the tempo of an audio signal from a sequence of samples which indicate the audio signal in a time domain; an energy level calculator configured to derive a plurality of partially overlapping sub-sequences from the sequence and to calculate energy levels corresponding to the plurality of sub-sequences, respectively; and a highlight extractor configured to extract, on the basis of the energy levels, one of the plurality of sub-sequences as the highlight portion of the audio signal, wherein each of the sub-sequences has a time duration defined according to the estimated tempo, and each of the energy levels indicates the energy of the samples of the sub-sequences corresponding to the energy levels, respectively.

Description

Audio signal processing apparatus and method

The disclosed embodiments relate to an apparatus and a method for processing an audio signal, and more particularly, to an apparatus and a method for extracting a highlight portion of an audio signal.

BACKGROUND Recently, many electronic devices for reproducing an audio signal from a media file including a representation of an audio signal (eg, data representing a sequence of samples) have been used. As the capacity of a storage medium included in an electronic device increases, a large number of media files may be stored in such an electronic device.

Presenting highlight portions of various media files included in the electronic device to the user of the electronic device enables the user to easily select among the media files that the user wishes to enjoy.

Typically, existing techniques for extracting highlights from an audio signal represented in a media file divide the audio signal into intervals of any time unit (eg, 1 second) without being based on the unique characteristics of the audio signal. Entails. That is, this division applies a common time unit for any audio signals. Therefore, existing highlight extraction techniques have limitations in providing highlights that are naturally experienced by the user.

The disclosed embodiments are directed to an apparatus and method for estimating the tempo of an audio signal and extracting highlight portions of the audio signal based on the estimated tempo.

According to an exemplary embodiment, a tempo estimator configured to estimate a tempo of the audio signal from a sequence of samples representing an audio signal in a time domain; An energy level calculator configured to derive a plurality of partially overlapping sub-sequences from the sequence and to calculate energy levels respectively corresponding to the plurality of sub-sequences; And a highlight extractor configured to extract one of the plurality of sub-sequences into a highlight portion of the audio signal based on the energy levels, each sub-sequence being a time duration set according to the estimated tempo. Is provided, wherein each energy level represents the energy of the samples of the sub-sequence corresponding to each energy level.

In the audio signal processing apparatus, the tempo estimator may also be configured to estimate the tempo in units of beat per minute (BPM), the duration being an integer multiple of the beat according to the estimated tempo. Can be.

In the audio signal processing apparatus, a start sample in each sub-sequence may be spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is the time difference. It may be an integer multiple of.

In the audio signal processing apparatus, the energy level calculator is further adapted to each sub-sequence by obtaining the energy of the samples encapsulated by the sliding window while moving the sliding window by the time difference with respect to the samples in the sequence. It may be configured to calculate the corresponding energy level and the sliding window may have the same length as the duration.

In the audio signal processing apparatus, the highlight extractor may also be configured to extract, as the highlight portion, a sub-sequence having the largest corresponding energy level among the sub-sequences.

In the audio signal processing apparatus, the tempo estimator comprises: a time-to-frequency domain conversion unit configured to convert the sequence into a frequency domain spectrum; A copy generator configured to obtain a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications; A spectrum adder configured to obtain an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And a tempo estimator configured to estimate the tempo by detecting one or a plurality of peak frequencies of the overlap spectrum.

In the audio signal processing apparatus, the tempo estimator may also be configured to detect the one or a plurality of peak frequencies within a preset frequency region of the overlap spectrum.

In the audio signal processing apparatus, the frequency domain spectrum may be a magnitude spectrum in the frequency domain of the audio signal.

In the audio signal processing apparatus, the tempo estimator performs at least one of downsampling the sequence and performing an absolute value operation on the value of each of the samples in the sequence before the conversion to update the sequence. It may further comprise a preprocessor configured to.

In the audio signal processing apparatus, the tempo estimator may further select candidate tempos according to the detected plurality of peak frequencies, divide the sequence into frames having a length set according to each candidate tempo, And may estimate one of the candidate tempos as the tempo based on the similarity between adjacent frames.

According to another exemplary embodiment, there is provided a method of estimating a tempo of an audio signal from a sequence of samples representing an audio signal in a time domain; Deriving a plurality of partially overlapping sub-sequences from the sequence and calculating energy levels respectively corresponding to the plurality of sub-sequences; And extracting one of the plurality of sub-sequences as a highlight portion of the audio signal based on the energy levels, each sub-sequence having a duration set according to the estimated tempo, each energy A level is provided that represents an energy of samples of a sub-sequence corresponding to each energy level.

In the audio signal processing method, the estimating may include estimating the tempo in units of beat per minute (BPM), and the duration is an integer multiple of a beat according to the estimated tempo. It can be time.

In the audio signal processing method, a start sample in each sub-sequence may be spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is the time difference. It may be an integer multiple of.

In the audio signal processing method, the calculating step corresponds to each sub-sequence by obtaining an energy of samples enclosed by the sliding window while moving the sliding window by the time difference with respect to the samples in the sequence. Comprising the step of calculating the energy level, wherein the sliding window may have the same length as the duration.

In the audio signal processing method, the extracting may include extracting, as the highlight portion, a sub-sequence having the largest corresponding energy level among the sub-sequences.

In the audio signal processing method, the estimating comprises: converting the sequence into a frequency domain spectrum; Obtaining a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications; Obtaining an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And estimating the tempo by detecting one or a plurality of peak frequencies in the overlap spectrum.

In the audio signal processing method, detecting the one or the plurality of peak frequencies may include detecting the one or the plurality of peak frequencies within a preset frequency region of the overlap spectrum.

In the audio signal processing method, the frequency domain spectrum may be a magnitude spectrum in the frequency domain of the audio signal.

And in the audio signal processing method, updating the sequence by performing at least one of downsampling the sequence and taking an absolute value operation on the value of each of the samples in the sequence before the conversion. have.

In the audio signal processing method, estimating the tempo from the detected one or a plurality of peak frequencies comprises: selecting candidate tempos according to the detected plurality of peak frequencies; Dividing the sequence into frames of a length set according to each candidate tempo; And estimating one of the candidate tempos as the tempo based on the similarity between each frame and each frame and an adjacent frame.

According to yet another exemplary embodiment, there is provided a computer readable storage medium having stored thereon computer executable instructions configured to perform any one of the above methods when executed by a processor.

According to certain embodiments, the highlight portion of the audio signal may be extracted based on an estimated value of the tempo of the audio signal.

The highlight portion extracted in accordance with certain embodiments may provide an improved experience for the listener.

1 is a block diagram of an audio signal processing apparatus according to some embodiments;

2 illustrates a sequence of samples of an audio signal and a plurality of sub-sequences derived from the sequence;

3 is a block diagram of a tempo estimator in accordance with certain embodiments;

4 is a flowchart of an audio signal processing method according to some embodiments;

5-7 illustrate screen shots of an audio player in accordance with certain embodiments.

Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The following detailed description is provided to assist in a comprehensive understanding of the methods, devices, and / or systems described herein. However, this is only an example and the present invention is not limited thereto.

In describing the embodiments of the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, terms to be described below are terms defined in consideration of functions in the present invention, which may vary according to the intention or custom of a user or an operator. Therefore, the definition should be made based on the contents throughout the specification. The terminology used in the description is for the purpose of describing embodiments of the invention only and should not be limiting. Unless explicitly used otherwise, the singular forms “a,” “an,” and “the” include plural forms of meaning. In this description, expressions such as "comprises" or "equipment" are intended to indicate certain features, numbers, steps, actions, elements, portions or combinations thereof, and one or more than those described. It should not be construed to exclude the presence or possibility of other features, numbers, steps, actions, elements, portions or combinations thereof.

Hereinafter, improved techniques for extracting feature portions or highlight portions of an audio signal based on the estimated tempo of the audio signal included in the media signal are presented.

1 is a block diagram of an audio signal processing apparatus according to some embodiments.

Highlight extraction techniques based on the estimated tempo may be implemented in the audio signal processing apparatus 100 of FIG. 1. The exemplary audio signal processing apparatus 100 includes a tempo estimator 110, an energy level calculator 120, and a highlight extractor 130. Tempo estimator 110 is configured to estimate the tempo of the audio signal. The energy level calculator 120 is configured to calculate energy levels respectively corresponding to the plurality of time periods of the audio signal. The length of each time period may be set according to the tempo estimated by the tempo estimator 110. The highlight extractor 130 is configured to identify the highlight portion of the audio signal. Based on the energy levels calculated by the energy level calculator 120, one of the time periods of the audio signal may be extracted as the highlight portion.

In some embodiments, the tempo estimator 110, the energy level calculator 120, and the highlight extractor 130 of the audio signal processing apparatus 100 may be configured to perform the following operations.

The tempo estimator 110 may estimate the tempo of the audio signal from a sequence of samples representing a time-domain waveform of the audio signal. The sample sequence may be stored locally in the audio signal processing apparatus 100 (eg, data representing a media signal comprising an audio signal) or from a media source remote from the audio signal processing apparatus 100 to the tempo estimator 110. Can be provided.

The tempo of a unit of sound refers to the rate perceived as the successive tones of that unit of sound progress. One of the commonly used tempo units is BPM (Beat Per Minute). For example, if the audio signal provided to the tempo estimator 110 has a tempo of 60 BPM, the beat of the audio signal represents a time unit of one second. A measure of a typical 4/4 beat pop song consists of four bars and each measure consists of four beats, each of which corresponds to 16 seconds if the tempo of that song is 60BPM. .

The tempo estimated by the tempo estimator 110 may be used by the audio signal processing apparatus 100 to extract a highlight portion of the audio signal. This is based on the fact that when an audio signal represents a piece of music, it is typically seen that a climax or characteristic melody appears in units of measure or measure of the piece of music. Exemplary tempo estimation techniques that may be applied to the tempo estimator 110 will be described below.

The energy level calculator 120 is configured to calculate energy levels respectively corresponding to the plurality of time periods of the audio signal. For example, energy level calculator 120 derives a plurality of partially overlapping sub-sequences from a sample sequence representing an audio signal and calculates an energy level representing the energy of the samples of each sub-sequence. can do.

Each sub-sequence has a time duration set according to the tempo estimated by the tempo estimator 110. The start sample in each sub-sequence may be spaced a predetermined time difference from the start sample in the sub-sequence adjacent to that sub-sequence. This time difference may be set according to the tempo of the audio signal estimated by the tempo estimator 110. For example, the duration of each sub-sequence may be an integer multiple of that time difference.

For the sake of explanation, assume that the time-domain waveform of the audio signal is given by the dashed curve 210 as shown in FIG. 2 and the waveform is sampled at 5 Hz and represented by a sequence of 500 samples in total. If the tempo of the audio signal estimated by the tempo estimator 110 is 60 BPM, one bit of the audio signal includes 5 samples. The duration of each sub-sequence may be set according to the tempo estimate. This setting may be performed by the audio signal processing apparatus 100 (eg, the energy level calculator 120) or may be performed according to a user's input of the audio signal processing apparatus 100. For example, the duration of each sub-sequence can be set to a time of four beats, as described above, which corresponds to a word of 4/4 beat music. This duration is exemplary only, and another duration (e.g., with a time of 16 beats corresponding to one measure of 4/4 beat music) may be set. With continued reference to FIG. 2, it can be understood that starting samples of adjacent sub-sequences of the sub-sequences are spaced from each other by five consecutive samples (ie, one bit) from each other. That is, fifteen samples (ie three bits) overlap in any two adjacent sub-sequences. The energy level calculator 120 starts a sliding window having a time equal to the duration of each sub-sequence for the 500 samples in the sequence by the time difference corresponding to five consecutive samples starting from the first sample. The partially overlapped sub-sequences can be derived by selecting 20 samples encompassed by the sliding window as sub-sequences each time they move. The energy level calculator 120 may calculate an energy level corresponding to each sub-sequence by obtaining the energy of the samples in the sliding window. According to some embodiments, the energy level calculator 120 first obtains energy values of five samples included in each bit, and then uses the energy values to calculate energy of 20 samples of the sliding window. The energy level corresponding to the sub-sequence can be calculated with less complexity.

According to certain embodiments, the energy level calculator 120 is configured to preprocess the original audio signal using a predetermined filter, and calculate energy levels corresponding to a plurality of time periods of the preprocessed audio signal, respectively. The filters used in this preprocessing are filters for applying the modeling of human auditory perception to the frequency domain representation of the audio signal (e.g. Mel-Frequency filter, modification of the simplified Mel-frequency filter with coefficients). Or filters according to auditory recognition modeling different from mel-frequency filters. For example, the energy level calculator 120 preprocesses the sample sequence representing the audio signal as above, then derives partially overlapping sub-sequences by moving the sliding window by the time difference with respect to the preprocessed sample sequence, The energy level representing the energy of the samples of each sub-sequence can be calculated by obtaining the energy of the samples covered by.

The highlight extractor 130 is configured to extract the highlight portion of the audio signal based on the energy levels calculated by the energy level calculator 120. For example, the highlight extractor 130 may extract one of the plurality of sub-sequences derived from the energy level calculator 120 as the highlight portion. The extracted highlight portion may be a sub-sequence having the largest energy level among the plurality of sub-sequences.

As described above, the duration of each sub-sequence may be an integer multiple of the bit time (eg, time corresponding to a measure or measure) according to the tempo of the audio signal estimated by the tempo estimator 110. Thus, the highlight portion provided from the highlight extractor 130 may also have a duration of an integer multiple of the bit time. As such, estimating the tempo of the audio signal may increase the likelihood that the extracted highlight portion will be experienced as a natural excerpt for the listener. In addition, even if a highlight portion of each of the two audio signals extracted in this manner is concatenated, a sudden transition from one of the two portions may not occur.

For improved highlight reproduction, the tempo estimator 110 of the audio signal processing apparatus 100 may employ techniques to estimate the global tempo of the audio signal or a local tempo of a portion of the audio signal. have.

Some tempo estimation techniques utilize detecting an event from an audio signal represented in the time domain. These time domain analysis techniques involve preprocessing and tempo detection. In the preprocessing, major events with large auditory stimuli are detected from the audio signal (e.g., music signal) and the occurrence time of those events is obtained. In the tempo detection process, an actual tempo of the audio signal is estimated based on the acquired time. For example, an event of an audio signal representing a sound played on a musical instrument or a human utterance may include a unique Attack Decay Sustain Release envelope of the time domain waveform of the audio signal. Filtering, envelope detection, and onset detection may in turn be used for detection of events. In popular music, these events can be attributed to percussion, where the sound components from the percussion usually occur over the entire frequency band, with a spectrogram for analysis of each sub-band. And / or filter banks may be utilized. Estimation of the actual tempo determines the minimum time unit or period of the repetition pattern in the audio signal by performing clustering of time differences of each adjacent pair of events among the detected events (eg, inter-onset interval clustering). May include an action.

Some other tempo estimation techniques utilize detecting a peak frequency from an audio signal represented in the frequency domain to a spectrum of the audio signal. In order to convert the audio signal from the time domain to the frequency domain, for example, a fast fourier transform (FFT) may be applied. The peak frequency of the signal transformed into the frequency domain indicates the frequency of occurrence of an event. In particular, a typical audio signal (e.g., an audio signal representing music) in which the tempo will be estimated will generate peak frequencies in the low frequency band (e.g., 1-3.3 Hz or 60-200 BPM). These frequency domain analysis techniques also involve preprocessing to extract the envelope from a large number of samples and to lower the sample rate appropriately for accurate detection of peak frequency. The number of FFT points is determined according to the sample rate and the desired BPM precision, and an estimated value of the BPM can be obtained from the FFT bin and the BPM precision at which the peak frequency occurs. Meanwhile, techniques using a combination of frequency domain analysis and time domain analysis of an audio signal may also be applied.

Still some tempo estimation techniques involve scaling the frequency axis of the spectral of the audio signal at specific magnifications and adding superimposed spectral copies scaled at different magnifications than the original spectrum. The above specific magnifications are frequencies of events that may occur in accordance with the time signature of the audio signal. For example, the set of magnifications applied to the frequency axis scaling of the spectrum of an audio signal representing 4/4 beats of music is {1/4, 2/4, 3/4, 4/4, 8/4, 12 /. 4, 16/4}. As another example, magnifications of {1/3, 2/3, 3/3, 6/3, 9/3, 12/3} are applied to an audio signal of 3/4 beat of music. These magnifications match the frequency with which major events occur in most musical works. Of course, other sets of magnifications are available in addition to the sets of magnifications mentioned above. In particular, according to these tempo estimation techniques, the processes of processing the audio signal are streamlined, and the tempo of the audio signal can be estimated with considerable accuracy even if the audio signal includes many offset patterns.

Typical music has relatively large peaks in the low frequency range when expressed in the frequency domain. However, simply estimating the frequency corresponding to the largest magnitude peak among such observed peaks as the tempo of the music may be of low accuracy. In fact the peak frequencies refer to the frequency of occurrence of the event due to the low frequency instruments of the music (eg bass, bass drum, bassoon, tuba, etc. for the rhythm section). According to these example techniques, the tempo estimates based on ratios of possible event occurrence frequencies to tempo. For this purpose, it may be useful to sequence events of a given audio signal in units of 1/4 bits or more. In general, bass instruments that produce the rhythm of music have a longer period of oscillation than treble instruments, and events caused by bass instruments take longer to sustain or release; The next bass instrument event is more likely to occur before the release interval of the subsequenced sequenced bass instrument event. In light of this, it makes sense to assume that 4/4 beats of music do not occur more than 4 times per beat. From the perspective of the frequency domain spectrum of 4/4 beat music, one, two, three or four events occur every bit. Similarly, the frequency of event occurrences possible in one measure or one measure can be regarded as once, twice, three times, four times, eight times, twelve times and sixteen times per four bits, which are included in the above-mentioned set of magnifications. do. According to each of these predetermined magnifications, the spectrum of the above 4/4 beat music (eg, the spectrum representing the magnitude in the frequency domain of the music) is scaled along the frequency axis and is most likely from the overlapping spectrum representing the addition of the scaled spectral copies. A predetermined number of peak frequencies may be extracted in order of magnitude. One extracted peak frequency or one of the extracted peak frequencies may be estimated as the tempo of the music. This is based on the observation that the superimposition spectrum produces distinct peaks (although not necessarily the largest) at frequencies corresponding to the actual tempo.

For a more detailed description, an example tempo estimator 110 is described in which techniques for estimating tempo using an overlapping spectrum that adds spectral copies scaled to a frequency axis at specific magnifications can be implemented.

3 is a block diagram of a tempo estimator in accordance with certain embodiments.

As shown in FIG. 3, the tempo estimator 110 scales the frequency axis of the spectrum to specific magnifications by a time-frequency domain converter 310 that performs an operation of converting a time-domain waveform of an audio signal into a frequency-domain spectrum. Copy generation unit 320 to perform the operation, including the original spectrum, the spectrum adder 320 performing the operation of adding the scaled spectral copies, one or more peak frequencies in the overlapping spectrum (for example, 1 ~ 3.3Hz to And a tempo estimator 325 for estimating a tempo of the audio signal by detecting one or more peak frequencies within a region of interest of 60 to 200 BPM.

The time-frequency domain converter 310 may be configured to convert a sequence that is a time domain representation of the audio signal into a frequency domain spectrum signal. In certain embodiments, the conversion to the frequency domain can be performed using an FFT. If the region of interest is a very low frequency region, such as the 1 to 1.3 Hz band, the FFT can be applied by converting as many samples as possible into the frequency domain at once, so that a clear peak appears in this region. The number of FFT points may be determined according to the following equation.

For example, if the sampling frequency is 1000 Hz and the tempo resolution is 0.1 (i.e., indicating that the tempo is extracted to one decimal place), an FFT calculation of 600000 points or more should be performed.

The copy generator 315 may be configured to obtain a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications. This frequency axis scaling can be maintained without changing the magnitude of the original spectrum. If the audio signal represents 1/2, 4/4 or 4/8 timed music, the set of ratios used for scaling is {1/4, 2/4, 3/4, 4/4, 8/4, 12 / 4, 16/4}. If the audio signal represents 3/4 or 6/8 timed music, the set of ratios used for scaling is {1/3, 2/3, 3/3, 6/3, 9/3, 12/3} Can be. If the region of interest is a region of 1 to 3.3 Hz, the copy generator 315 generates spectral copies by scaling only a portion of the low frequency band of the original spectrum (for example, a frequency region between 1/4 * 1 to 4 * 3.3 Hz). can do. Alternatively, the time-frequency domain converter 310 may provide only the low frequency band portion of the original spectrum to the copy generator 315. As such, obtaining copies for a portion of the original spectrum may facilitate high speed computation and memory optimization.

The spectral adder 320 is configured to obtain an overlapping spectrum by adding originals and copies of the spectrum. The tempo estimator 325 is configured to estimate the tempo of the audio signal by detecting one or a plurality of peak frequencies from the obtained overlap spectrum. The tempo estimator 325 may detect one or more peak frequencies within a preset frequency range (eg, 1 to 3.3 Hz, which is a region of interest). The estimated tempo may be provided from the tempo estimator 330. For example, when the FFT bin having the largest peak in the overlap spectrum is called FFTBin, the tempo estimator 330 may estimate the tempo BPM of the audio signal as shown below.

As another example, the tempo estimator 325 may select candidate tempos by detecting a plurality of peak frequencies in descending order of magnitude in the overlap spectrum. Criteria for selecting the final estimated tempo among the candidate tempos may be set in various ways. In certain embodiments, the tempo estimator 325 divides the sequence of the audio signal into frames of a length (e.g., eight times the bit time derived from each candidate tempo) set according to each candidate tempo selected and each One of the candidate tempos can be estimated as the tempo of the audio signal based on cross-correlation between the frame and adjacent frames of the frame. This estimation method assumes that when the candidate tempo is a real tempo or a multiple of the actual tempo, adjacent frames exhibit similarly shaped waveforms, and cross-correlation between the frames is likely to have a symmetrical shape with a peak at the center. Based. In detail, the tempo estimator 325 has a peak having a maximum magnitude in a waveform of cross-correlation between each frame and an adjacent frame of the frame in the entire audio signal sequence for each candidate tempo within a preset range from the center. It can be determined that such two frames are similar or dissimilar, depending on whether they occur at. Further, the tempo estimator 325 may estimate the candidate tempo of the highest score among candidate tempos as the tempo of the audio signal by scoring the number of pairs of adjacent frames determined to be similar. For example, for a candidate tempo, if a sequence of audio signals is divided into 90 frames each having 8 bits in length and these frames are divided into 9 different patterns (i.e., a total of 8 pattern changes between adjacent frames are If any), the score for that candidate tempo is 81/89 * 100 ≒ 91. As such, the tempo estimator 325 divides the sequence of the audio signal into frames having a length set according to each candidate tempo, and decomposes one of the candidate tempos based on the similarity of each pair of two adjacent frames in the sequence. It can be estimated by the tempo of.

According to some embodiments, the tempo estimator 110 may further include a preprocessor 305. The preprocessor 305 is configured to perform some preprocessing operations prior to converting the sequence of samples representing the audio signal in the time domain into the frequency domain spectrum. For example, operations performed by the preprocessor 305 may include an operation of downsampling and updating an original sequence to an appropriate level based on a desired tempo resolution. This downsampling is based on the fact that there are major events in the low frequency range that allow the tempo to be perceived in music-like sounds. The downsampled sequence can be converted to the frequency domain at a faster rate. In addition or alternatively, the preprocessor 305 may perform an operation of updating the sequence by performing an absolute value operation on the value of each sample in the sequence. The method of converting the updated sequence into the frequency domain may provide a clearer peak in the frequency domain than the method of converting the original sequence into the frequency domain.

4 is a flow chart of an audio signal processing method according to some embodiments.

According to the audio signal processing method 400 shown in FIG. 4, highlight extraction techniques based on the estimated tempo are implemented. For example, the audio signal processing method 400 may be performed by the audio signal processing apparatus 100 described above.

According to the audio signal processing method 400 of FIG. 4, a sequence of samples representing an audio signal in the time domain is received (405). In steps 410 to 435 described below, the tempo of the audio signal is estimated from the received sequence.

At step 410, the sequence is downsampled. In step 415, an absolute value operation is performed on the value of each of the samples in the sequence. In step 420, the preprocessed sequence is converted into a frequency domain spectrum. The converted spectrum may be a spectrum representing the magnitude in the frequency domain of the audio signal. In step 425, a plurality of copies of the frequency domain spectrum are generated. The plurality of copy spectra may be obtained by scaling the frequency axis of the original spectrum according to each of the plurality of preset magnifications. In step 430, an overlapping spectrum is obtained by adding the plurality of scaled copy spectra including the original spectrum. In step 435, the tempo of the audio signal is estimated according to the peak frequency detected in the overlap spectrum. According to some embodiments, at step 435 a plurality of peak frequencies may be detected within the overlap spectrum. In these embodiments, the tempo of the audio signal may be estimated through the following time domain analysis. First, candidate tempos according to the plurality of detected peak frequencies are selected. The sequence is then divided into frames of length set according to each candidate tempo. One of the candidate tempos may be estimated as the tempo of the audio signal based on the similarity between each frame and its frame and adjacent frames (eg, similarity or dissimilarity determined according to cross-correlation between these two frames).

After the tempo of the audio signal is estimated, a plurality of sub-sequences each having a duration set according to the estimated tempo are derived from the sequence (440). For example, when the tempo is estimated in units of BPM, the duration may be an integer number of times (eg, 16) times of the bits, and each sub-sequence may be spaced one bit from the adjacent sub-sequence.

An energy level corresponding to each sub-sequence is then calculated 445. This energy level calculation may be performed in such a way as to obtain the energy of the samples in the sliding window while moving the sliding window having the same length as the above duration with respect to the samples in the sequence. The interval at which the sliding window moves may be equal to the time difference in which each sub-sequence is spaced apart from an adjacent sub-sequence.

Based on the calculated energy levels, one of the plurality of sub-sequences, e.g., the sub-sequence with the largest energy level, is extracted 450 as the highlight portion of the audio signal.

According to some embodiments, the order in which at least some steps of the audio signal processing method 400 are performed may be changed. According to some embodiments, at least some steps of the audio signal processing method 400 may be performed in combination with each other, omitted, or divided into sub-steps. According to some embodiments, one or more steps not shown in FIG. 4 may be performed in addition to the audio signal processing method 400.

5-7 show screenshots of an audio player in accordance with certain embodiments.

An exemplary audio player is performed by a computing device, or processor, including the audio signal processing apparatus 100 described above, providing a highlight portion extracted from the audio signal processing apparatus 100 and configured to visually represent the highlight portion. The computer executable command configured to provide the highlight portion extracted according to the audio signal processing method 400 and visually represent the highlight portion in association with the computer executable instruction configured to perform the above-described audio signal processing method 400. May be implemented on a stored computing device. A media file representing time domain samples of the audio signal may be input to the audio player. Furthermore, the audio player can receive a list indicating a collection of media files or a location where the media files are stored and play each media file obtained therefrom.

According to the screen shot 500 of FIG. 5, a time domain waveform 510 of an audio signal is provided on a display (eg, a touch sensitive screen) of a computing device. In addition, a play button 520, a highlight button 530, a next track button 540, a previous track button 550 and a repeat play button 560 are provided on the display. User input to select the play button 520, highlight button 530, and repeat play button 560 (e.g., a left mouse click or a tap with a finger or stylus) may be used by the audio player in play mode, highlight mode, respectively. And toggle between activating and deactivating the repeat mode. User input to select the next track button 540 and the previous track button 550 (e.g., a click of the left mouse button or a tap with a finger or stylus) may cause the audio player to play the next and previous audio signals, respectively. do. Other user inputs on the next track button 540 and previous track button 550 (eg, the user is pressing and holding on the next track button 540 or previous track button 550 provided on a touch-sensitive screen) Causes the audio player to activate fast forward or rewind mode. Of course the above modes can be activated / deactivated and the next / previous audio signal can be played.

When the play mode is activated, the audio signal is played. As the audio signal is reproduced, the cursor 570 for indicating the reproduction position is moved. After all of the audio signals are reproduced, the next audio signal can be reproduced. In the screen shot 500 of FIG. 5, the cursor 570 indicates that a portion of 2 minutes 2.88 seconds is being reproduced in an audio signal of 7 minutes 11.31 seconds long. When fast forward mode is activated, audio signals are played back faster than the playback mode. When the cursor 570 reaches the end of the audio signal, the next audio signal can be played. When the rewind mode is activated, the audio signal is played back in reverse. When the cursor 570 reaches the beginning of the audio signal, the previous audio signal can be played. When the repeat play mode is activated, playback of the audio signal is repeated continuously.

When the highlight mode is activated while the audio player is playing the audio signal or the playback of the audio signal is stopped, the cursor is moved to the starting point of the highlight portion resulting from the performance of the audio signal processing apparatus 100 or the audio signal processing method 400 ( 530 moves. Indications representing such highlight portions may be recorded in tags of the media file representing the audio signal or stored in an external database. The user of the audio player may replace the extracted highlight portion with another highlight portion through user input. In addition, an indication indicating a new highlight portion may be stored in a tag of the media file or in an external database.

If the playback mode is also activated along with the highlight mode, the highlight portion is played. 6 shows a screenshot provided on the display when the highlight mode is activated. In screen shot 600, cursor 570 indicates that the portion of the audio signal is being played back at 3 minutes 32.67 seconds. The highlight portion 610 to be reproduced is displayed in a time domain waveform 510 of the audio signal in a highlighted manner unlike other portions. When the highlight mode is activated, when the highlight portion 610 of the audio signal is finished, the highlight portion 610 of the above audio signal is reproduced again or the highlight portion of the next audio signal is reproduced depending on whether the repeat play mode is enabled or disabled. do. Therefore, when the highlight mode is activated but the repeat play mode is deactivated, the highlight portion of each of the consecutive audio signals may be sequentially played. On the other hand, an interface (e.g., a pop-up menu or a slider) for receiving a user input for setting the length of the highlight portion 610 of the audio signal has a different user input (e.g., a right-click or a double tap by a finger or a stylus). May be provided on the display. The user of the audio player can select or adjust the length of the highlight portion 610 on this interface. This selection or adjustment may be performed in units of measure or measure of the audio signal. The measure or measure of the audio signal may be derived from an estimated value of the tempo of the audio signal. When the length of the highlight portion 610 is changed according to the user's selection or adjustment, the changed highlight portion 610 may be provided on the display and the changed highlight portion 610 may be played from the beginning. The highlight portion of the next audio signal or the previous audio signal may also have the same length as the changed highlight portion 610.

FIG. 7 shows a screenshot provided on the display when a user input is received on a portion other than the highlight portion during reproduction of the highlight portion of the audio signal. In screen shot 700, in response to a user input on the portion where cursor 570 is located, cursor 570 moves to that portion and the audio signal is played from that portion. In the screen shot 700, the time-domain waveform 510 of the audio signal is provided on the display where the highlight button 530 is still active but the emphasis of the highlight portion has disappeared. However, if the highlight mode is activated and playback of the next audio signal or the previous audio signal is instructed (for example, in response to a user input on the next track button 540 or the previous track button 550), The highlight section is played.

According to certain embodiments, a computer-readable recording medium is provided that includes a program for performing the methods described herein (eg, the audio signal processing method 400) on a computer. Such computer-readable recording media may include program instructions, local data files, local data structures, or the like, alone or in combination. The computer readable recording medium may be those specially designed and configured for the present invention. Examples of such computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs and DVDs, magnetic-optical media such as floppy disks, and ROM, RAM and flash memory. Hardware devices specifically configured to store and execute program instructions, and the like. Examples of program instructions may include high-level language code that can be executed by a computer using an interpreter as well as machine code such as produced by a compiler.

While the exemplary embodiments of the present invention have been described in detail above, those skilled in the art will appreciate that various modifications can be made to the above-described embodiments without departing from the scope of the present invention. . Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the claims below and equivalents thereof.

[Description of the code]

100: audio signal processing device

110: tempo estimator

120: energy level calculator

130: Highlight Extractor

Claims

A tempo estimator configured to estimate a tempo of the audio signal from a sequence of samples representing an audio signal in a time domain;

An energy level calculator configured to derive a plurality of partially overlapping sub-sequences from the sequence and to calculate energy levels respectively corresponding to the plurality of sub-sequences; And

A highlight extractor configured to extract one of the plurality of sub-sequences into a highlight portion of the audio signal based on the energy levels,

Each sub-sequence has a time duration set according to the estimated tempo, each energy level representing the energy of the samples of the sub-sequence corresponding to each energy level,

Audio signal processing device.
The method according to claim 1,

The tempo estimator is further configured to estimate the tempo in units of beat per minute (BPM), wherein the duration is an integer multiple of a beat according to the estimated tempo.
The method according to claim 1,

A start sample in each sub-sequence is spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is an integer multiple of the time difference.
The method according to claim 3,

The energy level calculator also calculates the energy level corresponding to each sub-sequence by finding the energy of the samples encapsulated by the sliding window while moving the sliding window by the time difference relative to the samples in the sequence. And the sliding window has a length equal to the duration.
The method according to claim 1,

The highlight extractor is further configured to extract a sub-sequence of the sub-sequences with the largest corresponding energy level into the highlight portion.
The method according to claim 1,

The tempo estimator

A time-to-frequency domain conversion unit configured to convert the sequence into a frequency domain spectrum;

A copy generator configured to obtain a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications;

A spectrum adder configured to obtain an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And

A tempo estimator configured to estimate the tempo by detecting one or a plurality of peak frequencies of the overlapping spectrum,

Audio signal processing device.
The method according to claim 6,

And the tempo estimator is further configured to detect the one or the plurality of peak frequencies within a preset frequency region of the overlap spectrum.
The method according to claim 6,

And said frequency domain spectrum is a magnitude spectrum in the frequency domain of said audio signal.
The method according to claim 6,

The tempo estimator further includes a preprocessor configured to update the sequence by performing at least one of downsampling the sequence and taking an absolute value operation on the value of each of the samples in the sequence before the conversion. Audio signal processing device.
The method according to claim 6,

The tempo estimator also selects candidate tempos according to the detected plurality of peak frequencies, divides the sequence into frames having a length set according to each candidate tempo, and based on similarity between each frame and the frames adjacent to each frame. And estimate one of the candidate tempos as the tempo.
Estimating the tempo of the audio signal from a sequence of samples representing the audio signal in a time domain;

Deriving a plurality of partially overlapping sub-sequences from the sequence and calculating energy levels respectively corresponding to the plurality of sub-sequences; And

Extracting one of the plurality of sub-sequences as a highlight portion of the audio signal based on the energy levels,

Each sub-sequence has a duration set according to the estimated tempo, each energy level representing the energy of the samples of the sub-sequence corresponding to each energy level,

Audio signal processing method.
The method according to claim 11,

The estimating step includes estimating the tempo in units of beat per minute (BPM), wherein the duration is an integer time of a beat according to the estimated tempo.
The method according to claim 11,

A start sample in each sub-sequence is spaced apart from a start sample in a sub-sequence adjacent to each sub-sequence by a time difference set according to the estimated tempo, and the duration is an integer multiple of the time difference.
The method according to claim 13,

The calculating step comprises calculating the energy level corresponding to each sub-sequence by obtaining the energy of the samples encapsulated by the sliding window while moving the sliding window by the time difference with respect to the samples in the sequence. And the sliding window has a length equal to the duration.
The method according to claim 11,

And wherein said extracting comprises extracting into the highlight portion a sub-sequence having the largest corresponding energy level among the sub-sequences.
The method according to claim 11,

The estimating step

Converting the sequence into a frequency domain spectrum;

Obtaining a plurality of scaled copies of the frequency domain spectrum by scaling the frequency axis of the frequency domain spectrum according to each of a plurality of preset magnifications;

Obtaining an overlap spectrum by adding the plurality of scaled copies including the frequency domain spectrum; And

Estimating the tempo by detecting one or a plurality of peak frequencies in the overlap spectrum;

Audio signal processing method.
The method according to claim 16,

Detecting the one or more peak frequencies comprises detecting the one or more peak frequencies within a preset frequency region of the overlap spectrum.
The method according to claim 16,

Wherein said frequency domain spectrum is a magnitude spectrum in the frequency domain of said audio signal.
The method according to claim 16,

And performing at least one of downsampling the sequence and taking an absolute value operation on the value of each of the samples in the sequence prior to the conversion to update the sequence.
The method according to claim 16,

Estimating the tempo from the detected one or more peak frequencies

Selecting candidate tempos according to the detected plurality of peak frequencies;

Dividing the sequence into frames of a length set according to each candidate tempo; And

Estimating one of the candidate tempos as the tempo based on each frame and the similarity between each frame and an adjacent frame.
A computer readable storage medium having stored thereon computer executable instructions configured to perform the method of any one of claims 11 to 20 when executed by a processor.