WO2010146624A1 - Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program - Google Patents

Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program Download PDF

Info

Publication number
WO2010146624A1
WO2010146624A1 PCT/JP2009/002711 JP2009002711W WO2010146624A1 WO 2010146624 A1 WO2010146624 A1 WO 2010146624A1 JP 2009002711 W JP2009002711 W JP 2009002711W WO 2010146624 A1 WO2010146624 A1 WO 2010146624A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
time scaling
time
signal processing
processing device
Prior art date
Application number
PCT/JP2009/002711
Other languages
French (fr)
Japanese (ja)
Inventor
古川善久
Original Assignee
パイオニア株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by パイオニア株式会社 filed Critical パイオニア株式会社
Priority to PCT/JP2009/002711 priority Critical patent/WO2010146624A1/en
Publication of WO2010146624A1 publication Critical patent/WO2010146624A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/008Means for controlling the transition from one tone waveform to another
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/375Tempo or beat alterations; Music timing control
    • G10H2210/385Speed change, i.e. variations from preestablished tempo, tempo change, e.g. faster or slower, accelerando or ritardando, without change in pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/025Envelope processing of music signals in, e.g. time domain, transform domain or cepstrum domain
    • G10H2250/035Crossfade, i.e. time domain amplitude envelope control of the transition between musical sounds or melodies, obtained for musical purposes, e.g. for ADSR tone generation, articulations, medley, remix

Definitions

  • the present invention relates to a time scaling method for an audio signal processing device that performs time scaling using a crossfade method, a pitch shift method for an audio signal processing device, an audio signal processing device, and a program.
  • a cross-fade method is known as a time scaling technique for extending and compressing the length on the time axis without changing the pitch of a digital voice waveform.
  • SRC Signalling Rate Convert
  • This SRC processing is performed with the sampling frequency change amount that cancels the time scaling amount of the crossfade method, and if the playback is performed at the sampling frequency of the original digital audio, the pitch is changed at the same time, so only the pitch of the original audio waveform is changed.
  • the pitch shift (key control) that does not change the length on the time axis can be realized.
  • Patent Document 1 describes a method of detecting the period of an input audio signal and shifting the signal by an integral multiple of the period to perform crossfading. By this method, there is an effect that the “beat” of the sound due to the phase shift at the time of crossfading can be reduced.
  • “buzz” refers to sound quality-degraded sound that occurs when a melody-based music such as vibrato, tremolo, or echo is cross-faded.
  • the cross-fade method generally has an advantage that the processing amount is smaller than the FFT method and the pitch shift can be realized relatively easily.
  • the sound quality is inferior to that of the FFT method, such as sound quality degradation such as ringing twice and beating due to time stretching.
  • the above-mentioned Patent Document 1 reduces the “beat”, which is one of the sound quality deterioration phenomena that occur in the crossfade method, by detecting one cycle of the audio signal and using it for the crossfade processing. Since only one is detected, the mitigation effect is insufficient.
  • crossfading in addition to the above “growing”, for example, in a rhythm music (music containing percussion instrument sound), crossfading is performed in an area where percussion instrument sound is generated. Examples include “twisting twice”, “petit noise”, and “rhythm disturbance”.
  • the above-mentioned patent document 1 does not consider these problems.
  • the present invention uses a crossfade method to reduce the deterioration of sound quality, and uses a time scaling method for an audio signal processing apparatus and an SRC process to realize a pitch shift suitable for music. It is an object of the present invention to provide a pitch shift method for a processing device, an audio signal processing device, and a program.
  • the time scaling method of the audio signal processing device of the present invention is a time scaling method of the audio signal processing device that performs time scaling using the cross-fade method on the acquired audio signal, and the audio signal processing device acquires When the prohibited area is detected in the prohibited area detection step that analyzes the audio signal and detects the prohibited area affected by the percussion instrument sound, and in the prohibited area detection step, the crossing is performed in other areas except the prohibited area. Performing a first time scaling step of time scaling by fading.
  • the audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method.
  • a cross-fade method When the prohibited area is detected by the prohibited area detecting means for analyzing the acquired audio signal and detecting the prohibited area affected by the percussion instrument sound, and the prohibited area detecting means, in other areas excluding the prohibited area And first time scaling means for time scaling by cross-fade.
  • a prohibited area (crossfade prohibited area) affected by percussion instrument sound is detected, and when the prohibited area is detected, other areas except the prohibited area are excluded. Since time scaling is performed in the region, it is possible to reduce deterioration in sound quality, which is a drawback of the crossfade method.
  • the prohibited area affected by the percussion instrument sound is a portion where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) by crossfading. Therefore, in the case of rhythmic music that includes percussion instrument sounds, by performing time scaling while avoiding the prohibited areas, it is possible to eliminate sound quality deterioration factors such as “twice”, “petit noise”, and “rhythm disturbance”. .
  • “when a prohibited area is detected” may indicate a case where a prohibited area is detected in a part of a song, or a predetermined number or more of prohibited areas are detected in the entire song. It may be a case that indicates.
  • the user may be able to set “partial range of music” in the former case and “predetermined number” in the latter case.
  • time scaling is performed by cross fading in a region sandwiched between two prohibited regions.
  • the position of the prohibited area on the time axis can be made the same in the original music data and the decompressed / compressed data after time scaling. Since the prohibited area is an area including a rhythm sound, by making this position the same before and after time scaling, rhythm disturbance due to time scaling can be prevented.
  • the number corresponding to “1 or more” of “one or more converted signals Bi having a large influence of percussion instrument sound among the plurality of converted signals Bi” may be a predetermined number or an arbitrary number (detectable) Any number).
  • the predetermined number when used, the numerical value may be set by the user.
  • the audio signal processing device performs frequency analysis on the acquired audio signal when the prohibited region is not detected in the prohibited region detection step, and the amplitude is maximized. And a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates that are an integral multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Is further characterized by further executing a second time scaling step of time scaling by cross-fade with a time scaling amount corresponding to the smallest time.
  • frequency analysis is performed to detect a plurality of frequencies (peak frequencies) at which the amplitude is maximum, and based on the plurality of frequencies. Since time scaling is performed with the amount of time scaling, deterioration in sound quality can be reduced. That is, the case where the forbidden area is not detected means that the music is a melody type music that does not include percussion instrument sounds. In general, when crossfading is performed on a melody-type musical piece, since the “beat” of the sound becomes a problem, it corresponds to a period (or an integer multiple thereof) based on a peak frequency (for example, one having the maximum amplitude). By performing time scaling by time, it is possible to reduce “buzz”.
  • a plurality of frequencies having the maximum amplitude are detected, and each of the frequencies having the second and subsequent amplitudes among time candidates of an integer multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Since the time scaling is performed by the time scaling amount corresponding to the time when the difference from the integer multiple of each period based on is the smallest, the occurrence of “beat” can be further reduced. In addition, it is thought that the effect of reducing “beat” is high by detecting as many “frequency at which the amplitude becomes maximum” as much as possible.
  • the number of detections may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.
  • Another audio signal processing apparatus time scaling method of the present invention is an audio signal processing apparatus time scaling method for performing time scaling on an acquired audio signal using a cross-fade method.
  • the percussion instrument sound determination step for analyzing the acquired sound signal and determining the presence or absence of percussion instrument sound
  • the amplitude is obtained by frequency analysis of the acquired sound signal.
  • Another audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method.
  • the percussion instrument sound discriminating means for analyzing the acquired audio signal and determining the presence or absence of the percussion instrument sound, and the percussion instrument sound discriminating means, when it is determined that the percussion instrument sound does not exist, by analyzing the frequency of the acquired audio signal , Detecting a plurality of frequencies where the amplitude is maximum, and among the time candidates that are integer multiples of the cycle based on the frequency having the maximum amplitude among the plurality of frequencies, The second time-scaling time-scaling by cross-fade with the time-scaling amount corresponding to the time when the difference from the integer multiple is the smallest. Characterized in that it has a-ring means.
  • a time scaling amount is calculated by an evaluation function weighted by an amplitude ratio of a plurality of frequencies.
  • the “beat” of the sound can be reduced more effectively.
  • the pitch shift method for an audio signal processing device includes a sampling rate at which the audio signal processing device performs sampling rate conversion before and after each step in the time scaling method of the audio signal processing device described above.
  • the conversion step is executed, and in the sampling rate conversion step, the time length change of the audio signal due to the time scaling and the sampling rate conversion is canceled, and only the pitch is changed.
  • the audio signal processing apparatus described above further includes sampling rate conversion means for performing sampling rate conversion, and the sampling rate conversion means cancels the time length change of the audio signal due to time scaling and sampling rate conversion, and only the pitch is obtained. It is characterized by being changed.
  • the program of the present invention is for causing a computer to execute each step in the time scaling method of the audio signal processing device described above.
  • Another program of the present invention is for causing each step in the pitch shift method of the audio signal processing apparatus described above to be executed.
  • FIG. 1 is a block diagram of a playback apparatus according to an embodiment of the present invention and an audio signal processing unit that is a part of the playback apparatus.
  • FIG. It is a flowchart which shows the time scaling process and pitch shift process by an audio
  • FIG. 1A is a simplified block diagram of the playback device 1.
  • the playback device 1 includes a playback unit 2, an audio signal processing unit 3 (audio signal processing device), a buffer memory 4, and an audio signal output unit 5.
  • the reproducing unit 2 reads out and reproduces music from a device such as a CD.
  • the audio signal processing unit 3 is composed of a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), and stores the audio signal reproduced by the reproduction unit 2 in the buffer memory 4 and also reads from the buffer memory 4 Digital signal processing is performed on the audio signal.
  • the audio signal output unit 5 outputs the audio signal processed by the audio signal processing unit 3 to the outside (such as an output device having an amplifier and a speaker).
  • FIG. 1B is a block diagram showing a functional configuration of the audio signal processing unit 3.
  • the audio signal processing unit 3 includes an audio signal acquisition unit 9 and a time scaling unit 10 as main functional configurations.
  • the audio signal acquisition unit 9 acquires an audio signal to be processed from the buffer memory 4.
  • the time scaling unit 10 includes a prohibited area detection unit 11, a first time scaling unit 12, and a second time scaling unit 13.
  • the time scaling means 10 of this embodiment is characterized by performing time scaling using a cross-fade method. A specific example of the cross fade will be described later with reference to FIGS.
  • the forbidden region N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi.
  • This forbidden area N is a part where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) due to crossfade (“sounding twice”, “petit noise” and “rhythm disturbance” are likely to occur). For this reason, it becomes an area where cross-fading is prohibited.
  • the first time scaling means 12 is a region other than the prohibited area N (in this embodiment, a target area sandwiched between two prohibited areas N). O, see FIG. 8), time scaling is performed by cross-fading. Note that “when the prohibited area N is detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a rhythmic musical piece. That is, in the case of rhythm music, the first time scaling means 12 performs time scaling in an appropriate area (target area O) while avoiding the prohibited area N affected by the percussion instrument sound.
  • the prohibited area N may be detected for a part of the musical piece, or may be detected for the entire musical piece. Further, when a predetermined number or more is detected, it may be considered that the prohibited area N has been detected. In addition, the user may be able to set “partial range” when detecting a partial range of music and “predetermined number” serving as a detection reference.
  • the second time scaling means 13 performs frequency analysis on the acquired audio signal, thereby obtaining a plurality of frequencies (peak frequencies) at which the amplitude is maximized. Detection is performed and time scaling is performed by crossfade with a time scaling amount based on the plurality of frequencies. Specifically, a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates of an integral multiple of the period based on the frequency having the maximum amplitude among the frequencies having the maximum amplitude. Is time-scaled by a time-scaling amount corresponding to the smallest time. As a result, it is possible to effectively reduce the “beat” of the sound caused by the phase shift at the time of crossfading.
  • the prohibited area N when the prohibited area N is not detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a melody-type music. That is, the second time scaling means 13 performs time scaling with an appropriate time scaling amount based on a plurality of peak frequencies obtained by frequency analysis in the case of melody music.
  • FIG. 1C is a block diagram when the pit shift means 20 is realized by adding the SRC means 21 to the time scaling means 10 of FIG.
  • the SRC means 21 performs SRC processing for changing the sampling frequency of the digital speech waveform before or after the time scaling processing by the time scaling means 10.
  • the pit shift means 20 can cancel the time length change of the audio signal by the time scaling means 10 and the SRC means 21 and realize a pitch shift that changes only the pitch without changing the length on the time axis. It is like that.
  • the audio signal processing unit 3 acquires the audio signal (S01), the audio signal processing unit 3 divides the band and generates a plurality of converted signals Bi (S02).
  • the forbidden area N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi (S03).
  • S02 and S03 are processing steps performed by the prohibited area detection unit 11.
  • the music type is determined according to whether or not the prohibited area N is detected (S04). As described above, when the prohibited area N is detected, it is determined that the music is a rhythmic music, and time scaling is performed by the first time scaling means 12 (S05). On the other hand, if the prohibited area N is not detected, it is determined that the music is a melody, and time scaling is performed by the second time scaling means 13 (S06). Note that S04 to S06 are processing steps by the first time scaling means 12 and the second time scaling means 13.
  • an SRC process (S22) is added before the time scaling process (S10).
  • the audio signal processing unit 3 acquires the audio signal (S21)
  • the audio signal processing unit 3 changes the pitch by the sampling rate conversion technique according to the pitch shift amount (S22).
  • the pitch is returned to the same time as the original audio signal by the time scaling process (S10, except S01 in FIG. 2A). Only change the pitch shift without time expansion and contraction.
  • the pitch shift amount may be an amount specified by a user operation, or may be an amount automatically calculated according to the pitch of another song.
  • the SRC process (S22) is performed before the time scaling process (S10). However, after the time scaling process (S10) is performed first, the SRC process (S22) is performed. May be.
  • FIG. 3 is a diagram showing a specific example in which original music data is expanded by crossfading.
  • “Crossfade Example 2” shown in FIG. 5B shows a case where the increment is not only the crossfade portion.
  • reproduction time of signal C reproduction time of signal A.
  • the decompressed data is cross faded (the fade out of the signal C and the fade in of the signal A) after the reproduction of the signals A and B, and then reproduced in the order of the signals B and C.
  • the increase (time scaling amount) in this case is “cross fade portion + reproduction time of signal B”.
  • FIG. 4 is a diagram showing a specific example when the original music data is compressed by cross-fading.
  • the case of crossfading is shown using.
  • “Crossfade Example 4” shown in FIG. 4B shows a case where the decrease is not only the crossfade portion.
  • the decrease is “reproduction time of signal B + reproduction time of signal C”.
  • FIG. 5 is a diagram illustrating an example of band division using DWT (Discrete Wavelet Transform).
  • FIG. 5A shows the original sound.
  • DWT Discrete Wavelet Transform
  • FIG. 7B shows the result of band division by the DWT of about 33 times and the IDWT (inverse discrete wavelet transform) of about 330 times for the original sound shown in FIG.
  • ten converted signals Bi (B1 to B10) are generated by dividing into ten frequency bands (band 1 to band 10).
  • FIG. 6 is a flowchart showing prohibited area detection processing by the audio signal processing unit 3 (prohibited area detection means 11).
  • the audio signal processing unit 3 detects and holds the peak position affected by the percussion instrument sound for the ten converted signals Bi (S11), and performs binarization processing (S12).
  • the range determined as “1” by the binarization process is provisionally determined as the prohibited area N, and then gap filling is performed (S13).
  • the two prohibited areas N are combined into one prohibited area N I do.
  • the forbidden areas N detected by the respective conversion signals Bi are synthesized (OR operation) (S14), and finally the forbidden area N where the crossfade should be prohibited is determined (S15).
  • FIG. 7 is a diagram showing a concept of a method for detecting the prohibited area N.
  • the same figure (a) has shown the original sound like Fig.5 (a).
  • the dotted line frame D1 is a hitting sound of a hyatt (kick hitting sound) and is considered to correspond to a beat position.
  • the dotted frame D2 is a hitting sound of Hyatt and is considered to correspond to the back beat position. Therefore, the prohibited area N is determined based on the band (in this embodiment, band 2 and band 7, see FIG. 5B) where the sound is loud in the areas (detection areas) corresponding to the dotted frames D1 and D2. I will do it.
  • FIG. 5B shows a state in which the forbidden area N is provisionally determined from two converted signals Bi that are most influenced by the percussion instrument sound, that is, the band where the sound is loud in the areas corresponding to the dotted frames D1 and D2. Show. Further, FIG. 5C shows a state in which the prohibition area N of the two bands that have been temporarily determined is synthesized and the prohibition area N is finally determined.
  • the forbidden area N detected from the band 7 is completely included in the forbidden area N detected from the band 2, so detection of the band 7 seems unnecessary, but there is a hyatt. Since there may be a case where it is not possible, it is preferable to determine the prohibited region N based on a plurality of bands as described above.
  • the prohibited area N is provisionally determined from the two converted signals Bi as the converted signal Bi that is most affected by the percussion instrument sound.
  • the number of the converted signals Bi is limited to this. It is not something.
  • the number of conversion signals Bi to be detected may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.
  • FIG. 8 is a diagram showing a specific example of the prohibited area N detected for a rhythmic musical piece.
  • the forbidden area N is detected as the beat appears.
  • the first time scaling means 12 of this embodiment the time scaling is performed in the target area O sandwiched between the prohibited areas N, thereby reducing the sound quality deterioration due to the crossfade.
  • the second time scaling means 13 eliminates the “beat” with respect to the peak frequency by time scaling shifted by an integral multiple of the peak period.
  • FIG. 9 is FFT (Fast Fourier transform) data in which the horizontal axis represents frequency and the vertical axis represents intensity, and shows one of a large number of detected samples.
  • the second time scaling unit 13 performs frequency analysis on the acquired audio signal, and thereby selects a plurality of frequencies (peak frequencies) at which the amplitude is maximized. To detect. In the present embodiment, it is assumed that three “frequency at which the amplitude is maximum” are detected from those having a large amplitude (those having a high intensity).
  • the “frequency at which the amplitude becomes maximum” is not limited to three, and a predetermined number of four or more may be detected, or may be an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user. However, it is considered that the effect of reducing deterioration in sound quality due to cross-fading is higher when the “frequency at which the amplitude becomes maximum” is detected as much as possible.
  • FIG. 10 is a diagram illustrating a concept of a method for determining a time scaling amount.
  • a time scaling amount in an appropriate one operation is determined. Specifically, a time that is an integral multiple of the period T 1 and has the smallest error from the periods T 2 and T 3 is defined as a time scaling amount in one calculation.
  • the blocks corresponding to the periods T 1 , T 2 , T 3 are arranged for each period, and the positions corresponding to integer multiples of the period T 1 are listed together with the start points.
  • (time) is indicated by L1 to L9, it is in the range (range in which crossfading is possible) from the upper limit position of crossfade (20 ms in the example in the figure) to the lower limit position (50 ms in the example in the figure).
  • the time scaling amount TCF is calculated using the evaluation function shown below.
  • Figure 11 is a diagram illustrating an evaluation function for determining a time scaling amount T CF. As shown in the figure, this evaluation function is weighted by the ratio of the peak amplitude An (peak frequency amplitude).
  • the CF upper limit position t 1 and the CF lower limit position t 2 are values obtained from empirical rules.
  • a prohibited area N (crossfade prohibited area) affected by the percussion instrument sound is detected, and the prohibited area N is detected.
  • Appropriate time scaling can be performed according to the music type determined by the presence or absence.
  • the time scaling is performed avoiding the prohibited area N. It is possible to eliminate “twisting twice”, “petit noise”, “rhythm disturbance” and the like which are problems.
  • the relative position on the time axis of the prohibited area N is converted so as to maintain the same ratio as before time scaling. . If the pitch is changed by SRC before time scaling and the time expansion and contraction is applied to the pitch shift to return it to the same as the original song by time scaling, the original song data and the data after the pitch shift are prohibited areas. N positions on the time axis can be made the same.
  • the forbidden area N when the forbidden area N is not detected, it can be determined as a melody music piece.
  • a plurality of peak frequencies with maximum amplitude are detected, and the amplitude is the maximum among the plurality of peak frequencies.
  • Time scaling with a time scaling amount corresponding to the time in which the difference from the integer multiple of each period based on each frequency after the second frequency is the smallest among the time candidates of the integral multiple of the period based on the frequency
  • the evaluation function for calculating the time scaling amount is weighted in consideration of the amplitude ratio (intensity) of a plurality of peak frequencies, it is possible to more effectively reduce the “beat” of the sound. .
  • the band division is performed by DWT when the prohibited area N is detected, the influence of the time delay associated with the filter processing can be reduced as compared with the case where the band division is performed using the filter.
  • the audio signal processing unit 3 includes both the first time scaling unit 12 and the second time scaling unit 13. However, the audio signal processing unit 3 may include only one of them. good. Further, when the audio signal processing unit 3 is configured to include only the second time scaling means 13, the presence / absence of percussion instrument sound may be simply determined instead of detecting the prohibited area N (percussion instrument sound determination). means). In this case, when it is determined that there is no percussion instrument sound, time scaling by the second time scaling means 13 is performed.
  • the prohibited area detection unit 11 detects the prohibited area N while analyzing the audio signal written to the buffer memory 4 along with the reproduction by the reproduction unit 2, but analyzed in advance. Data may be read to detect the prohibited area N. That is, it is good also as a structure which performs time scaling in real time, reproducing
  • each component of the audio signal processing unit 3 shown above can be provided by being stored in various recording media (CD-ROM, flash memory, etc.). That is, a program for causing a computer to function as each component of the audio signal processing unit 3 and a recording medium on which the program is recorded are also included in the scope of the right of the present invention.
  • the audio signal processing unit 3 may be applied to DJ equipment such as a mixer device, various electronic musical instruments, and a computer (PC application).
  • DJ equipment such as a mixer device, various electronic musical instruments, and a computer (PC application).
  • PC application application to a speech processing device having a function of changing the pitch, such as karaoke, a voice changer, and a speech synthesizer is also useful.
  • time scaling such as when changing only the audio time axis length without changing the pitch during double-speed playback of a video (DVD) recorder.
  • Other modifications can be made as appropriate without departing from the scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

Disclosed is a time-scaling method for a voice signal processing device, which performs time-scaling suited for a musical composition, thereby to reduce the deterioration of a sound quality resulting from the time-scaling of a cross-fade process.  The time-scaling method time-scales a voice signal acquired, by using the cross-fade process.  The voice signal processing device is characterized by executing an inhibition region detecting step (S03) for analyzing the voice signal acquired, thereby to detect an inhibition region being influenced by a percussion instrument, and a first time-scaling step (S05) for performing, if the inhibition region is detected at the inhibition region detecting step, the time-scaling by the cross-fade in a region other than the inhibition region.

Description

音声信号処理装置のタイムスケーリング方法、音声信号処理装置のピッチシフト方法、音声信号処理装置およびプログラムAudio signal processor time scaling method, audio signal processor pitch shift method, audio signal processor and program
 クロスフェード方式を用いてタイムスケーリングを行う音声信号処理装置のタイムスケーリング方法、音声信号処理装置のピッチシフト方法、音声信号処理装置およびプログラムに関するものである。 The present invention relates to a time scaling method for an audio signal processing device that performs time scaling using a crossfade method, a pitch shift method for an audio signal processing device, an audio signal processing device, and a program.
 従来、デジタル音声波形の音高を変えずに時間軸上の長さを伸張および圧縮するタイムスケーリング技術として、クロスフェード方式が知られている。
 さらにクロスフェード方式のタイムスケーリング処理の前あるいは後にさらにデジタル音声波形のサンプリング周波数を変更するSRC(Sampling Rate Convert)処理が知られている。このSRC処理を、クロスフェード方式のタイムスケーリング量をキャンセルするサンプリング周波数変更量で行い、元のデジタル音声のサンプリング周波数で再生すれば同時に音高が変更されるため、元の音声波形の音高だけを変更し時間軸上の長さを変更しないピッチシフト(キーコントロール)を実現できる。ピッチシフトに関してはSRCとクロスフェード方式のタイムスケーリングを組み合わせる以外に、FFT(離散フーリエ変換)方式が知られている。
 例えば、特許文献1には、入力された音声信号の周期を検出し、周期の整数倍だけ信号をずらしてクロスフェードする方法が記載されている。この方法により、クロスフェードを行った際の位相ずれに起因する音の「うなり」を軽減できるといった効果がある。なお、「うなり」とは、ビブラート、トレモロ、エコーなど、メロディ系の楽曲をクロスフェードした場合に発生する音質劣化音を指す。
 また、ピッチシフトを実現する上記の2方式を比較すると、一般的にクロスフェード方式は、FFT方式よりも処理量が少なく、比較的容易にピッチシフトを実現できるといった利点がある。
Conventionally, a cross-fade method is known as a time scaling technique for extending and compressing the length on the time axis without changing the pitch of a digital voice waveform.
Furthermore, SRC (Sampling Rate Convert) processing for changing the sampling frequency of a digital audio waveform before or after crossfade time scaling processing is known. This SRC processing is performed with the sampling frequency change amount that cancels the time scaling amount of the crossfade method, and if the playback is performed at the sampling frequency of the original digital audio, the pitch is changed at the same time, so only the pitch of the original audio waveform is changed. The pitch shift (key control) that does not change the length on the time axis can be realized. Regarding the pitch shift, an FFT (Discrete Fourier Transform) method is known in addition to combining SRC and crossfade time scaling.
For example, Patent Document 1 describes a method of detecting the period of an input audio signal and shifting the signal by an integral multiple of the period to perform crossfading. By this method, there is an effect that the “beat” of the sound due to the phase shift at the time of crossfading can be reduced. Note that “buzz” refers to sound quality-degraded sound that occurs when a melody-based music such as vibrato, tremolo, or echo is cross-faded.
Further, comparing the above two methods for realizing the pitch shift, the cross-fade method generally has an advantage that the processing amount is smaller than the FFT method and the pitch shift can be realized relatively easily.
特許第3395560号公報Japanese Patent No. 3395560
 ところが、ピッチシフトにクロスフェード方式を適用した場合、タイムストレッチにより、2度鳴り、うなりといった音質劣化が生じるなど、FFT方式に比べ音質が劣る。上記特許文献1は、クロスフェード方式において生じる音質劣化現象の1つである「うなり」を、音声信号の周期を1つ検出しクロスフェード処理に用いることで軽減しているが、音声信号の周期を1つしか検出していないので軽減効果は不十分である。
 さらに、クロスフェードを行った際の問題点としては、上記の「うなり」以外にも、例えばリズム系の楽曲(打楽器音が含まれる楽曲)において、打楽器音が発生している領域でクロスフェードを行ったことによる「2度鳴り」、「プチノイズ」および「リズム乱れ」などが挙げられる。ところが、上記特許文献1では、これらの問題点について考慮されていない。
However, when the cross-fade method is applied to pitch shift, the sound quality is inferior to that of the FFT method, such as sound quality degradation such as ringing twice and beating due to time stretching. The above-mentioned Patent Document 1 reduces the “beat”, which is one of the sound quality deterioration phenomena that occur in the crossfade method, by detecting one cycle of the audio signal and using it for the crossfade processing. Since only one is detected, the mitigation effect is insufficient.
Furthermore, as a problem when performing crossfading, in addition to the above “growing”, for example, in a rhythm music (music containing percussion instrument sound), crossfading is performed in an area where percussion instrument sound is generated. Examples include “twisting twice”, “petit noise”, and “rhythm disturbance”. However, the above-mentioned patent document 1 does not consider these problems.
 本発明は、上記の問題点に鑑み、クロスフェード方式を用いて、音質の劣化を軽減した音声信号処理装置のタイムスケーリング方法、SRC処理を併用し楽曲に適したピッチシフトを実現可能な音声信号処理装置のピッチシフト方法、音声信号処理装置およびプログラムを提供することを目的とする。 In view of the above-described problems, the present invention uses a crossfade method to reduce the deterioration of sound quality, and uses a time scaling method for an audio signal processing apparatus and an SRC process to realize a pitch shift suitable for music. It is an object of the present invention to provide a pitch shift method for a processing device, an audio signal processing device, and a program.
 本発明の音声信号処理装置のタイムスケーリング方法は、取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行う音声信号処理装置のタイムスケーリング方法であって、音声信号処理装置は、取得した音声信号を解析し、打楽器音の影響を受けている禁止領域を検出する禁止領域検出ステップと、禁止領域検出ステップにおいて、禁止領域が検出された場合、当該禁止領域を除く他の領域で、クロスフェードによりタイムスケーリングする第1タイムスケーリングステップと、を実行することを特徴とする。 The time scaling method of the audio signal processing device of the present invention is a time scaling method of the audio signal processing device that performs time scaling using the cross-fade method on the acquired audio signal, and the audio signal processing device acquires When the prohibited area is detected in the prohibited area detection step that analyzes the audio signal and detects the prohibited area affected by the percussion instrument sound, and in the prohibited area detection step, the crossing is performed in other areas except the prohibited area. Performing a first time scaling step of time scaling by fading.
 本発明の音声信号処理装置は、音声信号を取得する音声信号取得手段と、取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行うタイムスケーリング手段と、を備え、タイムスケーリング手段は、取得した音声信号を解析し、打楽器音の影響を受けている禁止領域を検出する禁止領域検出手段と、禁止領域検出手段により、禁止領域が検出された場合、当該禁止領域を除く他の領域で、クロスフェードによりタイムスケーリングする第1タイムスケーリング手段と、を有することを特徴とする。 The audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method. When the prohibited area is detected by the prohibited area detecting means for analyzing the acquired audio signal and detecting the prohibited area affected by the percussion instrument sound, and the prohibited area detecting means, in other areas excluding the prohibited area And first time scaling means for time scaling by cross-fade.
 これらの構成によれば、タイムスケーリングを行うに当たり、打楽器音の影響を受けている禁止領域(クロスフェード禁止領域)を検出し、当該禁止領域が検出された場合は、当該禁止領域を除く他の領域で、タイムスケーリングするため、クロスフェード方式の欠点である音質の劣化を軽減することができる。そもそも、打楽器音の影響を受けている禁止領域は、クロスフェードによるタイムスケーリング(伸張または圧縮などの時間軸操作)により音質が変化しやすい部分である。したがって、打楽器音を含むリズム系の楽曲の場合、禁止領域を避けてタイムスケーリングを行うことにより、「2度鳴り」、「プチノイズ」および「リズム乱れ」などの音質劣化要因を解消することができる。
 なお、「禁止領域が検出された場合」とは、楽曲の一部の範囲において禁止領域が検出された場合を指すものであっても良いし、楽曲全体において所定数以上、禁止領域が検出された場合を指すものであっても良い。また、前者の場合の「楽曲の一部の範囲」や、後者の場合の「所定数」をユーザーが設定可能としても良い。
According to these configurations, when performing time scaling, a prohibited area (crossfade prohibited area) affected by percussion instrument sound is detected, and when the prohibited area is detected, other areas except the prohibited area are excluded. Since time scaling is performed in the region, it is possible to reduce deterioration in sound quality, which is a drawback of the crossfade method. In the first place, the prohibited area affected by the percussion instrument sound is a portion where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) by crossfading. Therefore, in the case of rhythmic music that includes percussion instrument sounds, by performing time scaling while avoiding the prohibited areas, it is possible to eliminate sound quality deterioration factors such as “twice”, “petit noise”, and “rhythm disturbance”. .
In addition, “when a prohibited area is detected” may indicate a case where a prohibited area is detected in a part of a song, or a predetermined number or more of prohibited areas are detected in the entire song. It may be a case that indicates. In addition, the user may be able to set “partial range of music” in the former case and “predetermined number” in the latter case.
 上記に記載の音声信号処理装置のタイムスケーリング方法において、第1タイムスケーリングステップでは、2つの禁止領域に挟まれた領域で、クロスフェードによりタイムスケーリングすることを特徴とする。 In the time scaling method of the audio signal processing device described above, in the first time scaling step, time scaling is performed by cross fading in a region sandwiched between two prohibited regions.
 この構成によれば、原曲データと、タイムスケーリング後の伸張/圧縮データで、禁止領域の時間軸上の位置を同じにすることができる。禁止領域はリズム音が含まれる領域であるので、この位置をタイムスケーリング前後で同じにすることで、タイムスケーリングによるリズムの乱れが生じないようにできる。 According to this configuration, the position of the prohibited area on the time axis can be made the same in the original music data and the decompressed / compressed data after time scaling. Since the prohibited area is an area including a rhythm sound, by making this position the same before and after time scaling, rhythm disturbance due to time scaling can be prevented.
 上記に記載の音声信号処理装置のタイムスケーリング方法において、禁止領域検出ステップでは、取得した音声信号を、ウェーブレット変換により帯域分割して複数の変換信号Bi(但し、i=1,・・・,n)を生成し、当該複数の変換信号Biのうち、打楽器音の影響度が大きい1以上の変換信号Biを用いて、禁止領域を検出することを特徴とする。 In the time scaling method of the audio signal processing device described above, in the prohibited region detection step, the acquired audio signal is band-divided by wavelet conversion to obtain a plurality of converted signals Bi (where i = 1,..., N ) And the prohibited region is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi.
 この構成によれば、ウェーブレット変換により帯域分割することで、フィルタを用いて帯域分割する場合と比較して、フィルタ処理に伴う時間遅れの影響を小さくすることができる。
 なお、「複数の変換信号Biのうち、打楽器音の影響度が大きい1以上の変換信号Bi」の「1以上」に相当する数は、所定数であっても良いし、任意数(検出可能な数)であっても良い。また、所定数とする場合は、その数値をユーザーが設定可能としても良い。
According to this configuration, by performing the band division by the wavelet transform, it is possible to reduce the influence of the time delay associated with the filter processing compared to the case of performing the band division using the filter.
The number corresponding to “1 or more” of “one or more converted signals Bi having a large influence of percussion instrument sound among the plurality of converted signals Bi” may be a predetermined number or an arbitrary number (detectable) Any number). In addition, when the predetermined number is used, the numerical value may be set by the user.
 上記に記載の音声信号処理装置のタイムスケーリング方法において、音声信号処理装置は、禁止領域検出ステップにおいて、禁止領域が検出されなかった場合、取得した音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリングステップをさらに実行することを特徴とする。 In the time scaling method of the audio signal processing device described above, the audio signal processing device performs frequency analysis on the acquired audio signal when the prohibited region is not detected in the prohibited region detection step, and the amplitude is maximized. And a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates that are an integral multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Is further characterized by further executing a second time scaling step of time scaling by cross-fade with a time scaling amount corresponding to the smallest time.
 この構成によれば、打楽器音の影響を受けている禁止領域が検出されなかった場合、周波数解析を行って、振幅が極大となる周波数(ピーク周波数)を複数検出し、当該複数の周波数に基づくタイムスケーリング量でタイムスケーリングするため、音質の劣化を軽減することができる。つまり、禁止領域が検出されない場合とは、打楽器音を含まないメロディ系の楽曲であることを意味する。一般的に、メロディ系の楽曲でクロスフェードを行った場合、音の「うなり」が問題となるため、ピーク周波数(例えば、振幅が最大のもの)に基づく周期(またはその整数倍)に相当する時間だけタイムスケーリングすることで、「うなり」を軽減することが可能である。本構成では、さらに、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が最も小さくなる時間に相当するタイムスケーリング量でタイムスケーリングするため、「うなり」の発生をより軽減することができる。
 なお、「振幅が極大となる周波数」は、検出可能な限り多く検出することで、「うなり」の軽減効果が高いと考えられる。その検出数は、所定数であっても良いし、任意数(検出可能な数)であっても良い。また、所定数とする場合は、その数値をユーザーが設定可能としても良い。
According to this configuration, when the prohibited region affected by the percussion instrument sound is not detected, frequency analysis is performed to detect a plurality of frequencies (peak frequencies) at which the amplitude is maximum, and based on the plurality of frequencies. Since time scaling is performed with the amount of time scaling, deterioration in sound quality can be reduced. That is, the case where the forbidden area is not detected means that the music is a melody type music that does not include percussion instrument sounds. In general, when crossfading is performed on a melody-type musical piece, since the “beat” of the sound becomes a problem, it corresponds to a period (or an integer multiple thereof) based on a peak frequency (for example, one having the maximum amplitude). By performing time scaling by time, it is possible to reduce “buzz”. In this configuration, furthermore, a plurality of frequencies having the maximum amplitude are detected, and each of the frequencies having the second and subsequent amplitudes among time candidates of an integer multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Since the time scaling is performed by the time scaling amount corresponding to the time when the difference from the integer multiple of each period based on is the smallest, the occurrence of “beat” can be further reduced.
In addition, it is thought that the effect of reducing “beat” is high by detecting as many “frequency at which the amplitude becomes maximum” as much as possible. The number of detections may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.
 本発明の他の音声信号処理装置のタイムスケーリング方法は、取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行う音声信号処理装置のタイムスケーリング方法であって、音声信号処理装置は、取得した音声信号を解析し、打楽器音の有無を判別する打楽器音判別ステップと、打楽器音判別ステップにおいて、打楽器音が存在しないと判別された場合、取得した音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリングステップと、を実行することを特徴とする。 Another audio signal processing apparatus time scaling method of the present invention is an audio signal processing apparatus time scaling method for performing time scaling on an acquired audio signal using a cross-fade method. In the percussion instrument sound determination step for analyzing the acquired sound signal and determining the presence or absence of percussion instrument sound, and in the percussion instrument sound determination step, if it is determined that there is no percussion instrument sound, the amplitude is obtained by frequency analysis of the acquired sound signal. A plurality of frequencies where the maximum is detected, and among the time candidates of an integer multiple of the cycle based on the frequency having the maximum amplitude among the multiple frequencies, an integer multiple of each cycle based on each frequency after the second frequency Is the time scaling amount corresponding to the smallest time and the second time scaled by crossfade And executes arm and scaling step.
 本発明の他の音声信号処理装置は、音声信号を取得する音声信号取得手段と、取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行うタイムスケーリング手段と、を備え、タイムスケーリング手段は、取得した音声信号を解析し、打楽器音の有無を判別する打楽器音判別手段と、打楽器音判別手段により、打楽器音が存在しないと判別された場合、取得した音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリング手段と、を有することを特徴とする。 Another audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method. The percussion instrument sound discriminating means for analyzing the acquired audio signal and determining the presence or absence of the percussion instrument sound, and the percussion instrument sound discriminating means, when it is determined that the percussion instrument sound does not exist, by analyzing the frequency of the acquired audio signal , Detecting a plurality of frequencies where the amplitude is maximum, and among the time candidates that are integer multiples of the cycle based on the frequency having the maximum amplitude among the plurality of frequencies, The second time-scaling time-scaling by cross-fade with the time-scaling amount corresponding to the time when the difference from the integer multiple is the smallest. Characterized in that it has a-ring means.
 これらの構成によれば、打楽器音が存在しない場合、周波数解析を行って、振幅が極大となる周波数を複数検出し、当該複数の周波数に基づくタイムスケーリング量でタイムスケーリングするため、上記の通り、メロディ系の楽曲でクロスフェードを行った場合に問題となる、音の「うなり」を効果的に軽減することができる。 According to these configurations, when there is no percussion instrument sound, frequency analysis is performed to detect a plurality of frequencies with maximum amplitude, and time scaling is performed with a time scaling amount based on the plurality of frequencies. It is possible to effectively reduce the “growing” of sound, which becomes a problem when crossfading is performed with melody music.
 上記に記載の音声信号処理装置のタイムスケーリング方法において、第2タイムスケーリングステップでは、複数の周波数の振幅比によって重み付けされた評価関数により、タイムスケーリング量を算出することを特徴とする。 In the time scaling method of the audio signal processing device described above, in the second time scaling step, a time scaling amount is calculated by an evaluation function weighted by an amplitude ratio of a plurality of frequencies.
 この構成によれば、複数の周波数の振幅比(強度)を考慮して、タイムスケーリング量を算出するため、音の「うなり」をより効果的に軽減することができる。 According to this configuration, since the time scaling amount is calculated in consideration of the amplitude ratio (intensity) of a plurality of frequencies, the “beat” of the sound can be reduced more effectively.
 本発明の音声信号処理装置のピッチシフト方法は、音声信号処理装置が、上記に記載の、音声信号処理装置のタイムスケーリング方法における各ステップと、各ステップの前あるいは後にサンプリングレートコンバートを行うサンプリングレートコンバートステップと、を実行し、サンプリングレートコンバートステップでは、タイムスケーリングとサンプリングレートコンバートによる音声信号の時間長変化を相殺し、音高のみを変更させることを特徴とする。 The pitch shift method for an audio signal processing device according to the present invention includes a sampling rate at which the audio signal processing device performs sampling rate conversion before and after each step in the time scaling method of the audio signal processing device described above. The conversion step is executed, and in the sampling rate conversion step, the time length change of the audio signal due to the time scaling and the sampling rate conversion is canceled, and only the pitch is changed.
 上記に記載の音声信号処理装置において、サンプリングレートコンバートを行うサンプリングレートコンバート手段をさらに備え、サンプリングレートコンバート手段は、タイムスケーリングとサンプリングレートコンバートによる音声信号の時間長変化を相殺し、音高のみを変更させることを特徴とする。 The audio signal processing apparatus described above further includes sampling rate conversion means for performing sampling rate conversion, and the sampling rate conversion means cancels the time length change of the audio signal due to time scaling and sampling rate conversion, and only the pitch is obtained. It is characterized by being changed.
 これらの構成によれば、タイムスケーリング処理による音質劣化を軽減し、音声信号をピッチシフトできる。 According to these configurations, sound quality deterioration due to time scaling processing can be reduced, and the audio signal can be pitch-shifted.
 本発明のプログラムは、コンピューターに、上記に記載の音声信号処理装置のタイムスケーリング方法における各ステップを実行させるためのものであることを特徴とする。また、本発明の他のプログラムは、上記に記載の音声信号処理装置のピッチシフト方法における各ステップを実行させるためのものであることを特徴とする。 The program of the present invention is for causing a computer to execute each step in the time scaling method of the audio signal processing device described above. Another program of the present invention is for causing each step in the pitch shift method of the audio signal processing apparatus described above to be executed.
 これらのプログラムを実行することにより、クロスフェード方式を用いて、音質の劣化を軽減しつつ、楽曲に適したタイムスケーリングおよびピッチシフトを実現することができる。 By executing these programs, it is possible to realize time scaling and pitch shift suitable for music while reducing deterioration in sound quality by using a crossfade method.
本発明の一実施形態に係る再生装置と、その一部である音声信号処理部のブロック図である。1 is a block diagram of a playback apparatus according to an embodiment of the present invention and an audio signal processing unit that is a part of the playback apparatus. FIG. 音声信号処理部によるタイムスケーリング処理およびピッチシフト処理を示すフローチャートである。It is a flowchart which shows the time scaling process and pitch shift process by an audio | voice signal process part. 時間伸張する場合のクロスフェードの具体例を示す図である。It is a figure which shows the specific example of the cross fade in the case of time expansion. 時間圧縮する場合のクロスフェードの具体例を示す図である。It is a figure which shows the specific example of the cross fade in time compression. 帯域分割の一例を示す図である。It is a figure which shows an example of a band division | segmentation. 音声信号処理部による禁止領域検出処理を示すフローチャートである。It is a flowchart which shows the prohibition area | region detection process by an audio | voice signal process part. 禁止領域の検出方法の概念を示す図である。It is a figure which shows the concept of the detection method of a prohibition area | region. リズム系の楽曲に対して検出された禁止領域の具体例を示す図である。It is a figure which shows the specific example of the prohibition area | region detected with respect to the rhythm type music. 周波数変換の一例を示す図である。It is a figure which shows an example of frequency conversion. タイムスケーリング量の決定方法の概念を示す図である。It is a figure which shows the concept of the determination method of time scaling amount. タイムスケーリング量を決定するための評価関数を示す図である。It is a figure which shows the evaluation function for determining a time scaling amount.
 以下、本発明の一実施形態に係る音声信号処理装置のタイムスケーリング方法、音声信号処理装置のピッチシフト方法、音声信号処理装置およびプログラムについて、添付図面を参照しながら詳細に説明する。本実施形態では、本発明の音声信号処理装置を、CDプレーヤーなどの再生装置に適用した場合について例示する。 Hereinafter, a time scaling method for an audio signal processing device, a pitch shift method for an audio signal processing device, an audio signal processing device, and a program according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the present embodiment, the case where the audio signal processing device of the present invention is applied to a playback device such as a CD player is exemplified.
 図1(a)は、再生装置1の簡易ブロック図である。同図に示すように、再生装置1は、再生部2と、音声信号処理部3(音声信号処理装置)と、バッファメモリ4と、音声信号出力部5と、を備えている。再生部2は、CDなどのデバイスから楽曲を読み出して再生する。音声信号処理部3は、CPU(Central Processing Unit)またはDSP(Digital Signal Processor)によって主要部が構成され、再生部2によって再生された音声信号をバッファメモリ4に格納すると共に、バッファメモリ4から読み出した音声信号に対し、デジタル信号処理を施す。音声信号出力部5は、音声信号処理部3による処理後の音声信号を、外部(アンプおよびスピーカーを有する出力装置など)に出力する。 FIG. 1A is a simplified block diagram of the playback device 1. As shown in FIG. 1, the playback device 1 includes a playback unit 2, an audio signal processing unit 3 (audio signal processing device), a buffer memory 4, and an audio signal output unit 5. The reproducing unit 2 reads out and reproduces music from a device such as a CD. The audio signal processing unit 3 is composed of a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), and stores the audio signal reproduced by the reproduction unit 2 in the buffer memory 4 and also reads from the buffer memory 4 Digital signal processing is performed on the audio signal. The audio signal output unit 5 outputs the audio signal processed by the audio signal processing unit 3 to the outside (such as an output device having an amplifier and a speaker).
 図1(b)は、音声信号処理部3の機能構成を示すブロック図である。音声信号処理部3は、主な機能構成として、音声信号取得手段9と、タイムスケーリング手段10と、を備えている。音声信号取得手段9は、処理対象となる音声信号を、上記のバッファメモリ4から取得する。また、タイムスケーリング手段10は、禁止領域検出手段11と、第1タイムスケーリング手段12と、第2タイムスケーリング手段13と、から成る。本実施形態のタイムスケーリング手段10は、クロスフェード方式を用いて、タイムスケーリングを行うことを特徴としている。クロスフェードの具体例については、図3および図4にて後述する。 FIG. 1B is a block diagram showing a functional configuration of the audio signal processing unit 3. The audio signal processing unit 3 includes an audio signal acquisition unit 9 and a time scaling unit 10 as main functional configurations. The audio signal acquisition unit 9 acquires an audio signal to be processed from the buffer memory 4. The time scaling unit 10 includes a prohibited area detection unit 11, a first time scaling unit 12, and a second time scaling unit 13. The time scaling means 10 of this embodiment is characterized by performing time scaling using a cross-fade method. A specific example of the cross fade will be described later with reference to FIGS.
 禁止領域検出手段11は、音声信号取得手段9によって取得した音声信号を解析し、打楽器音(ドラム音)の影響を受けている禁止領域N(図7等参照)を検出する。具体的には、取得した音声信号を、ウェーブレット変換により帯域分割して複数の変換信号Bi(但し、i=1,・・・,n)(複数のバンド,図5参照)を生成し、当該複数の変換信号Biのうち、打楽器音の影響度が大きい1以上の変換信号Biを用いて、禁止領域Nを検出する。この禁止領域Nは、クロスフェードによるタイムスケーリング(伸張または圧縮などの時間軸操作)により音質が変化しやすい(「2度鳴り」、「プチノイズ」および「リズム乱れ」などが発生しやすい)部分であるため、クロスフェードを禁止する領域となる。 The prohibited area detection means 11 analyzes the audio signal acquired by the audio signal acquisition means 9 and detects the prohibited area N (see FIG. 7 and the like) affected by the percussion instrument sound (drum sound). Specifically, the acquired audio signal is band-divided by wavelet transform to generate a plurality of converted signals Bi (where i = 1,..., N) (a plurality of bands, see FIG. 5). The forbidden region N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi. This forbidden area N is a part where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) due to crossfade (“sounding twice”, “petit noise” and “rhythm disturbance” are likely to occur). For this reason, it becomes an area where cross-fading is prohibited.
 第1タイムスケーリング手段12は、禁止領域検出手段11により、禁止領域Nが検出された場合、当該禁止領域Nを除く他の領域(本実施形態では、2つの禁止領域Nに挟まれた対象領域O,図8参照)で、クロスフェードによりタイムスケーリングを行う。なお、「禁止領域検出手段11により、禁止領域Nが検出された場合」とは、取得した音声信号が、リズム系の楽曲に基づくことを意味している。つまり、第1タイムスケーリング手段12は、リズム系の楽曲の場合、打楽器音の影響を受けている禁止領域Nを避けて、適切な領域(対象領域O)でタイムスケーリングを行うものである。 When the prohibited area N is detected by the prohibited area detection means 11, the first time scaling means 12 is a region other than the prohibited area N (in this embodiment, a target area sandwiched between two prohibited areas N). O, see FIG. 8), time scaling is performed by cross-fading. Note that “when the prohibited area N is detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a rhythmic musical piece. That is, in the case of rhythm music, the first time scaling means 12 performs time scaling in an appropriate area (target area O) while avoiding the prohibited area N affected by the percussion instrument sound.
 なお、禁止領域Nは、楽曲の一部の範囲を対象として検出しても良いし、楽曲全体を対象として検出しても良い。また、所定数以上検出された場合に、禁止領域Nが検出されたものと看做すようにしても良い。また、楽曲の一部の範囲を対象として検出する場合の「一部の範囲」や、検出基準となる「所定数」について、ユーザーが設定可能としても良い。 It should be noted that the prohibited area N may be detected for a part of the musical piece, or may be detected for the entire musical piece. Further, when a predetermined number or more is detected, it may be considered that the prohibited area N has been detected. In addition, the user may be able to set “partial range” when detecting a partial range of music and “predetermined number” serving as a detection reference.
 一方、第2タイムスケーリング手段13は、禁止領域検出手段11により、禁止領域Nが検出されなかった場合、取得した音声信号を周波数解析することにより、振幅が極大となる周波数(ピーク周波数)を複数検出し、当該複数の周波数に基づくタイムスケーリング量で、クロスフェードによりタイムスケーリングする。具体的には、振幅が極大となる周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量でタイムスケーリングする。これにより、クロスフェードを行った際の位相ずれに起因する音の「うなり」を効果的に軽減することができる。 On the other hand, when the prohibited area N is not detected by the prohibited area detecting means 11, the second time scaling means 13 performs frequency analysis on the acquired audio signal, thereby obtaining a plurality of frequencies (peak frequencies) at which the amplitude is maximized. Detection is performed and time scaling is performed by crossfade with a time scaling amount based on the plurality of frequencies. Specifically, a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates of an integral multiple of the period based on the frequency having the maximum amplitude among the frequencies having the maximum amplitude. Is time-scaled by a time-scaling amount corresponding to the smallest time. As a result, it is possible to effectively reduce the “beat” of the sound caused by the phase shift at the time of crossfading.
 なお、「禁止領域検出手段11により、禁止領域Nが検出されなかった場合」とは、取得した音声信号が、メロディ系の楽曲に基づくことを意味している。つまり、第2タイムスケーリング手段13は、メロディ系の楽曲の場合、周波数解析により得られる複数のピーク周波数に基づき、適切なタイムスケーリング量でタイムスケーリングを行うものである。 It should be noted that “when the prohibited area N is not detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a melody-type music. That is, the second time scaling means 13 performs time scaling with an appropriate time scaling amount based on a plurality of peak frequencies obtained by frequency analysis in the case of melody music.
 一方、図1(c)は、図1(b)のタイムスケーリング手段10に、SRC手段21を追加して、ピットシフト手段20を実現する場合のブロック図である。SRC手段21は、タイムスケーリング手段10によるタイムスケーリング処理の前あるいは後に、デジタル音声波形のサンプリング周波数を変更するSRC処理を行うものである。これにより、ピットシフト手段20は、タイムスケーリング手段10とSRC手段21による音声信号の時間長変化をキャンセルし、時間軸上の長さを変更することなく音高のみを変更させるピッチシフトを実現できるようになっている。 On the other hand, FIG. 1C is a block diagram when the pit shift means 20 is realized by adding the SRC means 21 to the time scaling means 10 of FIG. The SRC means 21 performs SRC processing for changing the sampling frequency of the digital speech waveform before or after the time scaling processing by the time scaling means 10. As a result, the pit shift means 20 can cancel the time length change of the audio signal by the time scaling means 10 and the SRC means 21 and realize a pitch shift that changes only the pitch without changing the length on the time axis. It is like that.
 次に、図2(a)のフローチャートを参照し、音声信号処理部3によるタイムスケーリング処理について説明する。音声信号処理部3は、音声信号を取得すると(S01)、これを帯域分割し、複数の変換信号Biを生成する(S02)。 Next, the time scaling process by the audio signal processing unit 3 will be described with reference to the flowchart of FIG. When the audio signal processing unit 3 acquires the audio signal (S01), the audio signal processing unit 3 divides the band and generates a plurality of converted signals Bi (S02).
 続いて、これら複数の変換信号Biのうち、打楽器音の影響度が大きい1以上の変換信号Biを用いて、禁止領域Nを検出する(S03)。なお、以上S02およびS03は、禁止領域検出手段11による処理工程である。 Subsequently, the forbidden area N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi (S03). Note that S02 and S03 are processing steps performed by the prohibited area detection unit 11.
 続いて、禁止領域Nが検出されたか否かに応じて楽曲種別を判別する(S04)。上記の通り、禁止領域Nが検出された場合は、リズム系の楽曲であると判別し、第1タイムスケーリング手段12によるタイムスケーリングを行う(S05)。また、禁止領域Nが検出されなかった場合は、メロディ系の楽曲であると判別し、第2タイムスケーリング手段13によるタイムスケーリングを行う(S06)。なお、以上S04ないしS06は、第1タイムスケーリング手段12および第2タイムスケーリング手段13による処理工程である。 Subsequently, the music type is determined according to whether or not the prohibited area N is detected (S04). As described above, when the prohibited area N is detected, it is determined that the music is a rhythmic music, and time scaling is performed by the first time scaling means 12 (S05). On the other hand, if the prohibited area N is not detected, it is determined that the music is a melody, and time scaling is performed by the second time scaling means 13 (S06). Note that S04 to S06 are processing steps by the first time scaling means 12 and the second time scaling means 13.
 次に、図2(b)のフローチャートを参照し、音声信号処理部3によるピッチシフト処理について説明する。当該フローチャートは、タイムスケーリング処理(S10)の前に、SRC処理(S22)が追加されたものである。音声信号処理部3は、音声信号を取得すると(S21)、ピッチシフト量に合わせて、サンプリングレートコンバート技術により、音高を変更する(S22)。当該工程では、音高の変更に伴い時間の伸縮が発生するので、タイムスケーリング処理(S10,但し図2(a)のS01を除く)により元の音声信号と同じ時間に戻すことで、音高のみ変更し、時間伸縮のないピッチシフトを実現する。 Next, the pitch shift processing by the audio signal processing unit 3 will be described with reference to the flowchart of FIG. In the flowchart, an SRC process (S22) is added before the time scaling process (S10). When the audio signal processing unit 3 acquires the audio signal (S21), the audio signal processing unit 3 changes the pitch by the sampling rate conversion technique according to the pitch shift amount (S22). In this process, since the expansion and contraction of time occurs with the change of the pitch, the pitch is returned to the same time as the original audio signal by the time scaling process (S10, except S01 in FIG. 2A). Only change the pitch shift without time expansion and contraction.
 なお、ピッチシフト量は、ユーザーの操作によって指定される量であっても良いし、別の楽曲のピッチなどに合わせて自動計算される量であっても良い。また、図2(b)のフローチャートでは、SRC処理(S22)をタイムスケーリング処理(S10)の前に行ったが、タイムスケーリング処理(S10)を先に行なった後に、SRC処理(S22)を行なっても良い。 Note that the pitch shift amount may be an amount specified by a user operation, or may be an amount automatically calculated according to the pitch of another song. In the flowchart of FIG. 2B, the SRC process (S22) is performed before the time scaling process (S10). However, after the time scaling process (S10) is performed first, the SRC process (S22) is performed. May be.
 次に、図3および図4を参照し、クロスフェードの具体例について説明する。図3は、原曲データをクロスフェードにより伸張する具体例を示す図である。 Next, a specific example of crossfade will be described with reference to FIGS. FIG. 3 is a diagram showing a specific example in which original music data is expanded by crossfading.
 同図(a)に示す「クロスフェード 例1」では、原曲データにおいて、クロスフェード位置CFPの直前に再生される信号Aと、クロスフェード位置CFPの直後に再生される信号B(但し、信号Bの再生時間=信号Aの再生時間)とを用いてクロスフェードする場合を示している。この場合、伸張後の伸張データは、信号Aの再生後に、クロスフェードされる。つまり、信号Bがフェードアウトされ(原音量から徐々に0にされ)、逆に信号Aがフェードインされる(0から徐々に原音量にされる)。このクロスフェード部分が、増加分(タイムスケーリング量)となる。その後、信号Bが再生される。 In “Crossfade Example 1” shown in FIG. 5A, in the original music data, a signal A reproduced immediately before the crossfade position CFP and a signal B reproduced immediately after the crossfade position CFP (however, the signal In this example, crossfading is performed using the reproduction time of B = the reproduction time of signal A). In this case, the expanded data after expansion is cross-faded after the reproduction of the signal A. That is, the signal B is faded out (gradually reduced from the original volume to 0), and conversely, the signal A is faded in (gradually increased from 0 to the original volume). This crossfade portion is an increase (time scaling amount). Thereafter, the signal B is reproduced.
 また、同図(b)に示す「クロスフェード 例2」では、増加分が、クロスフェード部分だけではない場合を示している。例えば、同図に示すように、原曲データが、信号A,信号B,信号C(但し、信号Cの再生時間=信号Aの再生時間)の順に再生され、信号Bと信号Cの間がクロスフェード位置CFPとなる場合、伸張データは、信号A,信号Bの再生後、クロスフェード(信号Cのフェードアウトと信号Aのフェードイン)され、その後信号B,信号Cの順に再生される。この場合の増加分(タイムスケーリング量)は、「クロスフェード部分+信号Bの再生時間」となる。 In addition, “Crossfade Example 2” shown in FIG. 5B shows a case where the increment is not only the crossfade portion. For example, as shown in the figure, the original music data is reproduced in the order of signal A, signal B, and signal C (however, reproduction time of signal C = reproduction time of signal A). When the cross fade position CFP is reached, the decompressed data is cross faded (the fade out of the signal C and the fade in of the signal A) after the reproduction of the signals A and B, and then reproduced in the order of the signals B and C. The increase (time scaling amount) in this case is “cross fade portion + reproduction time of signal B”.
 一方、図4は、原曲データをクロスフェードにより圧縮する場合の具体例を示す図である。同図(a)に示す「クロスフェード 例3」では、原曲データにおいて、クロスフェード位置CFPの直後に再生される信号Aと信号B(但し、信号Bの再生時間=信号Aの再生時間)を用いて、クロスフェードする場合を示している。この場合、圧縮後の圧縮データは、信号Aのフェードアウトと信号Bのフェードインによるクロスフェード部分のみ再生される。つまり、このクロスフェード部分(=信号Bの再生時間)が減少分(タイムスケーリング量)となる。 On the other hand, FIG. 4 is a diagram showing a specific example when the original music data is compressed by cross-fading. In “Crossfade Example 3” shown in FIG. 5A, in the original music data, the signal A and the signal B reproduced immediately after the crossfade position CFP (however, the reproduction time of the signal B = the reproduction time of the signal A) The case of crossfading is shown using. In this case, the compressed data after compression is reproduced only in the cross fade portion by the fade-out of the signal A and the fade-in of the signal B. That is, this cross-fade portion (= reproduction time of signal B) becomes a decrease (time scaling amount).
 また、同図(b)に示す「クロスフェード 例4」では、減少分が、クロスフェード部分だけではない場合を示している。例えば、同図に示すように、原曲データが、クロスフェード位置CFPの直後から信号A,信号B,信号C(但し、信号Cの再生時間=信号Aの再生時間)の順に再生される場合、圧縮データは、信号Aのフェードアウトと信号Cのフェードインによるクロスフェード部分のみ再生される。この場合の減少分(タイムスケーリング量)は、「信号Bの再生時間+信号Cの再生時間」となる。 Also, “Crossfade Example 4” shown in FIG. 4B shows a case where the decrease is not only the crossfade portion. For example, as shown in the figure, when the original music data is reproduced in the order of signal A, signal B, signal C (where reproduction time of signal C = reproduction time of signal A) immediately after the crossfade position CFP. The compressed data is reproduced only in the crossfade portion by the fade-out of the signal A and the fade-in of the signal C. The decrease (time scaling amount) in this case is “reproduction time of signal B + reproduction time of signal C”.
 なお、SRCにより、音高を上げた場合(テンポアップした場合)と、音高を下げた場合(テンポダウンした場合)の双方において、音高の変化量に比例して伸縮する再生時間をクロスフェードによるタイムスケーリングで元に戻すことで、原曲と演奏時間は同じで、音高のみ変化するピッチシフトを実現することもできる。また、図3および図4に示したいずれの例においても、本実施形態に示すタイムスケーリング処理を適用可能である。 Note that the playback time that expands and contracts in proportion to the change in pitch is crossed both when the pitch is raised (when the tempo is raised) and when the pitch is lowered (when the tempo is lowered) by SRC. By returning to the original by time scaling by fading, it is possible to realize a pitch shift in which the performance time is the same as the original music and only the pitch changes. In any of the examples shown in FIG. 3 and FIG. 4, the time scaling process shown in this embodiment can be applied.
 つまり、第1タイムスケーリング手段12により、禁止領域Nを避けてクロスフェード位置CFPを設定することで、特にリズム系の楽曲の場合、「2度鳴り」、「プチノイズ」および「リズム乱れ」などの発生を軽減することができる。また、第2タイムスケーリング手段13により、上記の「増加分」および「減少分」に相当するタイムスケーリング量を、適切な量(時間)とすることで、特にメロディ系の楽曲の場合、音の「うなり」の発生を軽減することができる。 In other words, by setting the crossfade position CFP by avoiding the prohibited area N by the first time scaling means 12, especially in the case of rhythm music, “sound twice”, “petit noise”, “rhythm disturbance”, etc. Occurrence can be reduced. Further, by setting the time scaling amount corresponding to the above “increase” and “decrease” to an appropriate amount (time) by the second time scaling means 13, particularly in the case of a melody music, The occurrence of “beat” can be reduced.
 次に、図5ないし図8を参照し、禁止領域Nの検出について詳細に説明する。図5は、DWT(離散ウェーブレット変換:Discrete Wavelet Transform)を用いた帯域分割の一例を示す図である。同図(a)は、原音を示している。ここでは、バスドラムとハイアットが同時に鳴らされ、その後ハイアットのみが鳴らされているようなドラム音を例示している。 Next, the detection of the prohibited area N will be described in detail with reference to FIGS. FIG. 5 is a diagram illustrating an example of band division using DWT (Discrete Wavelet Transform). FIG. 5A shows the original sound. Here, a drum sound in which a bass drum and a hyatt are played at the same time and then only a hyatt is played is illustrated.
 また、同図(b)は、同図(a)に示した原音に対し、約33回のDWTと、約330回のIDWT(逆離散ウェーブレット変換)による帯域分割結果を示している。同図(b)に示すように、本実施形態では、10個の周波数帯域(バンド1~バンド10)に分割して、10個の変換信号Bi(B1~B10)を生成する。 FIG. 7B shows the result of band division by the DWT of about 33 times and the IDWT (inverse discrete wavelet transform) of about 330 times for the original sound shown in FIG. As shown in FIG. 5B, in the present embodiment, ten converted signals Bi (B1 to B10) are generated by dividing into ten frequency bands (band 1 to band 10).
 図6は、音声信号処理部3(禁止領域検出手段11)による禁止領域検出処理を示すフローチャートである。音声信号処理部3は、10個の変換信号Biに対し、打楽器音の影響を受けているピーク位置を検出してホールドし(S11)、2値化処理を行う(S12)。当該2値化処理によって「1」と判別された範囲を、禁止領域Nとして仮確定し、その後ギャップフィルを行う(S13)。当該工程では、2つの禁止領域Nに挟まれた領域(クロスフェード可能な領域)が小さすぎる場合(所定量以下の場合)、2つの禁止領域Nを結合して1つの禁止領域Nとする処理を行う。また、各変換信号Biで検出された禁止領域Nを合成して(OR演算して)(S14)、最終的にクロスフェードを禁止すべき禁止領域Nを確定する(S15)。 FIG. 6 is a flowchart showing prohibited area detection processing by the audio signal processing unit 3 (prohibited area detection means 11). The audio signal processing unit 3 detects and holds the peak position affected by the percussion instrument sound for the ten converted signals Bi (S11), and performs binarization processing (S12). The range determined as “1” by the binarization process is provisionally determined as the prohibited area N, and then gap filling is performed (S13). In this process, when the area sandwiched between the two prohibited areas N (area that can be crossfade) is too small (when it is equal to or less than a predetermined amount), the two prohibited areas N are combined into one prohibited area N I do. Further, the forbidden areas N detected by the respective conversion signals Bi are synthesized (OR operation) (S14), and finally the forbidden area N where the crossfade should be prohibited is determined (S15).
 図7は、禁止領域Nの検出方法の概念を示す図である。同図(a)は、図5(a)と同様に、原音を示している。ここで、点線枠D1は、ハイアットの打撃音(キック打撃音)であり拍位置に相当すると考えられる。また、点線枠D2は、ハイアットの打撃音であり裏拍位置に相当すると考えられる。そこで、これらの点線枠D1,D2に相当する領域(検出領域)で音が大きいバンド(本実施形態では、バンド2とバンド7,図5(b)参照)に基づいて、禁止領域Nを確定していく。 FIG. 7 is a diagram showing a concept of a method for detecting the prohibited area N. The same figure (a) has shown the original sound like Fig.5 (a). Here, the dotted line frame D1 is a hitting sound of a hyatt (kick hitting sound) and is considered to correspond to a beat position. The dotted frame D2 is a hitting sound of Hyatt and is considered to correspond to the back beat position. Therefore, the prohibited area N is determined based on the band (in this embodiment, band 2 and band 7, see FIG. 5B) where the sound is loud in the areas (detection areas) corresponding to the dotted frames D1 and D2. I will do it.
 同図(b)は、点線枠D1,D2に相当する領域で音が大きいバンド、すなわち、打楽器音の影響を最も大きく受けている2つの変換信号Biから、禁止領域Nを仮確定する様子を示している。さらに、同図(c)は、仮確定された2つのバンドの禁止領域Nを合成して、最終的に禁止領域Nを確定する様子を示している。 FIG. 5B shows a state in which the forbidden area N is provisionally determined from two converted signals Bi that are most influenced by the percussion instrument sound, that is, the band where the sound is loud in the areas corresponding to the dotted frames D1 and D2. Show. Further, FIG. 5C shows a state in which the prohibition area N of the two bands that have been temporarily determined is synthesized and the prohibition area N is finally determined.
 なお、図7に示した例では、バンド7より検出した禁止領域Nが、バンド2より検出した禁止領域Nに完全に含まれているため、バンド7の検出は不要に思えるが、ハイアットが存在しない場合も考えられるため、上記の通り複数のバンドに基づいて、禁止領域Nを確定していくことが好ましい。 In the example shown in FIG. 7, the forbidden area N detected from the band 7 is completely included in the forbidden area N detected from the band 2, so detection of the band 7 seems unnecessary, but there is a hyatt. Since there may be a case where it is not possible, it is preferable to determine the prohibited region N based on a plurality of bands as described above.
 また、上記の例では、打楽器音の影響を最も大きく受けている変換信号Biとして、2つの変換信号Biから、禁止領域Nを仮確定したが、当該変換信号Biの数は、これに限定されるものではない。また、検出する変換信号Biの数は、所定数であっても良いし、任意数(検出可能な数)であっても良い。また、所定数とする場合は、その数値をユーザーが設定可能としても良い。 In the above example, the prohibited area N is provisionally determined from the two converted signals Bi as the converted signal Bi that is most affected by the percussion instrument sound. However, the number of the converted signals Bi is limited to this. It is not something. The number of conversion signals Bi to be detected may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.
 図8は、リズム系の楽曲に対して検出された禁止領域Nの具体例を示す図である。このように、リズム系の楽曲の場合、ドラム音が繰り返されることから、拍の出現に伴って、禁止領域Nが検出される。このため、本実施形態の第1タイムスケーリング手段12では、この禁止領域Nに挟まれた対象領域Oでタイムスケーリングを行うことで、クロスフェードに伴う音質の劣化を軽減している。 FIG. 8 is a diagram showing a specific example of the prohibited area N detected for a rhythmic musical piece. In this way, in the case of rhythm music, since the drum sound is repeated, the forbidden area N is detected as the beat appears. For this reason, in the first time scaling means 12 of this embodiment, the time scaling is performed in the target area O sandwiched between the prohibited areas N, thereby reducing the sound quality deterioration due to the crossfade.
 次に、図9ないし図11を参照し、第2タイムスケーリング手段13の詳細について説明する。第2タイムスケーリング手段13は、ピーク周期の整数倍だけずらすタイムスケーリングにより、ピーク周波数に対する「うなり」をなくすようにしたものである。 Next, details of the second time scaling means 13 will be described with reference to FIGS. 9 to 11. The second time scaling means 13 eliminates the “beat” with respect to the peak frequency by time scaling shifted by an integral multiple of the peak period.
 図9は、横軸を周波数、縦軸を強度として示したFFT(Fast Fourier transform)データであり、検出された多数のサンプルのうちの一つを示したものである。上記の通り、第2タイムスケーリング手段13は、禁止領域検出処理において禁止領域Nが検出されなかった場合、取得した音声信号を周波数解析することにより、振幅が極大となる周波数(ピーク周波数)を複数検出する。本実施形態では、当該「振幅が極大となる周波数」を、振幅の大きいもの(強度の大きいもの)から3つ検出するものとする。つまり、同図の例の場合、強度「-20.1dB」の「514.1Hz」、強度「-28.9dB」の「1468.3Hz」、強度「-27.8dB」の「6461.3Hz」、をピーク周波数候補として検出する。 FIG. 9 is FFT (Fast Fourier transform) data in which the horizontal axis represents frequency and the vertical axis represents intensity, and shows one of a large number of detected samples. As described above, when the prohibited area N is not detected in the prohibited area detection process, the second time scaling unit 13 performs frequency analysis on the acquired audio signal, and thereby selects a plurality of frequencies (peak frequencies) at which the amplitude is maximized. To detect. In the present embodiment, it is assumed that three “frequency at which the amplitude is maximum” are detected from those having a large amplitude (those having a high intensity). That is, in the case of the example in the figure, “514.1 Hz” with an intensity “−20.1 dB”, “1468.3 Hz” with an intensity “−28.9 dB”, and “6461.3 Hz” with an intensity “−27.8 dB”. Are detected as peak frequency candidates.
 なお、「振幅が極大となる周波数」は、3つに限らず、4つ以上の所定数を検出しても良いし、任意数(検出可能な数)であっても良い。また、所定数とする場合は、その数値をユーザーが設定可能としても良い。但し、「振幅が極大となる周波数」は、検出可能な限り多く検出した方が、クロスフェードによる音質の劣化の軽減効果は高いと考えられる。 Note that the “frequency at which the amplitude becomes maximum” is not limited to three, and a predetermined number of four or more may be detected, or may be an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user. However, it is considered that the effect of reducing deterioration in sound quality due to cross-fading is higher when the “frequency at which the amplitude becomes maximum” is detected as much as possible.
 図10は、タイムスケーリング量の決定方法の概念を示す図である。同図(a)に示すように、ピーク周波数として、ω,ω,ωが検出された場合、これらに基づいて、ピーク周期T=1/ω,T=1/ω,T=1/ωが得られる。 FIG. 10 is a diagram illustrating a concept of a method for determining a time scaling amount. As shown in FIG. 5A, when ω 1 , ω 2 , ω 3 are detected as peak frequencies, based on these, peak periods T 1 = 1 / ω 1 , T 2 = 1 / ω 2 are detected. , T 3 = 1 / ω 3 is obtained.
 続いて、この3つのピーク周期T,T,Tに基づいて、適切な1回の演算におけるタイムスケーリング量を決定する。具体的には、周期Tの整数倍で、且つ周期T,Tとの誤差が一番少なくなる時間を、1回の演算におけるタイムスケーリング量とする。この方法で決定した1回の演算におけるタイムスケーリング量をTCFとし、タイムスケーリングによって時間を伸張する場合に伸縮率がαであったとすると、原曲時間T=TCF/|1-α|の時間に1回タイムスケーリング演算をおこなうことになる。tはピーク周波数を演算するごとに可変されるので、タイムスケーリング周期時間Tも可変となる。
 また、タイムスケーリングによる伸縮率αは、SRCと組み合わせてピッチシフトをおこなう場合にはピッチシフトの目的の音高の変化率βに反比例する値α=1/βとなる。
Subsequently, based on these three peak periods T 1 , T 2 , T 3 , a time scaling amount in an appropriate one operation is determined. Specifically, a time that is an integral multiple of the period T 1 and has the smallest error from the periods T 2 and T 3 is defined as a time scaling amount in one calculation. The time scaling amount and T CF in one operation determined in this way, when the scaling factor is assumed to be alpha when decompressing time by time scaling, original music time T = T CF / | 1- α | of A time scaling operation is performed once in time. Since t is varied every time the peak frequency is calculated, the time scaling period T is also variable.
In addition, when the pitch shift is performed in combination with the SRC, the expansion / contraction rate α by time scaling is a value α = 1 / β that is inversely proportional to the pitch change rate β of the target pitch shift.
 つまり、図10(b)に示すように、周期T,T,Tを示すブロックを、周期毎に並べ、始点を合わせて列記した状態で、周期Tの整数倍に相当する位置(時間)をL1~L9で示すと、クロスフェードの上限位置(同図の例では、20ms)から下限位置(同図の例では、50ms)までの範囲(クロスフェードが可能な範囲)にあるL4~L9のうち、周期Tの整数倍に相当する位置と、周期Tの整数倍に相当する位置との誤差が一番少ないものは、「L6」であることが分かる。この「L6」は、周期Tの6倍に相当するものであるから、適切なタイムスケーリング量TCFは、6T=6/ωと、算出できる。但し、実際には、以下に示す評価関数を用いて、タイムスケーリング量TCFを算出している。 That is, as shown in FIG. 10B, the blocks corresponding to the periods T 1 , T 2 , T 3 are arranged for each period, and the positions corresponding to integer multiples of the period T 1 are listed together with the start points. When (time) is indicated by L1 to L9, it is in the range (range in which crossfading is possible) from the upper limit position of crossfade (20 ms in the example in the figure) to the lower limit position (50 ms in the example in the figure). of L4 ~ L9, a position corresponding to an integral multiple of the cycle T 2, with less error is the best with the position corresponding to an integral multiple of the period T 3 is found to be "L6". Since “L6” corresponds to six times the period T 1 , the appropriate time scaling amount T CF can be calculated as 6T 1 = 6 / ω 1 . However, in practice, the time scaling amount TCF is calculated using the evaluation function shown below.
 図11は、タイムスケーリング量TCFを決定するための評価関数を示す図である。同図に示すように、この評価関数は、ピーク振幅A(ピーク周波数の振幅)の比によって重み付けされている。なお、CF上限位置tおよびCF下限位置tは、経験則から得られる値である。このように、本実施形態の第2タイムスケーリング手段13は、この評価関数が最も小さくなるiの値を求め、当該iの値から、タイムスケーリング量TCFを求める。すなわち、タイムスケーリング量TCF=iT=i/ωと、算出できる。 Figure 11 is a diagram illustrating an evaluation function for determining a time scaling amount T CF. As shown in the figure, this evaluation function is weighted by the ratio of the peak amplitude An (peak frequency amplitude). The CF upper limit position t 1 and the CF lower limit position t 2 are values obtained from empirical rules. Thus, the second time scaling unit 13 of the present embodiment obtains the value of the evaluation function becomes minimum i, from the value of the i, obtains the time scaling amount T CF. That is, the time scaling amount T CF = iT 1 = i / ω 1 can be calculated.
 以上説明したとおり、本実施形態によれば、クロスフェード方式を用いてタイムスケーリングを行うに当たり、打楽器音の影響を受けている禁止領域N(クロスフェード禁止領域)を検出し、当該禁止領域Nの有無によって判別される楽曲種別に応じて、適切なタイムスケーリングを行うことができる。 As described above, according to the present embodiment, when performing time scaling using the crossfade method, a prohibited area N (crossfade prohibited area) affected by the percussion instrument sound is detected, and the prohibited area N is detected. Appropriate time scaling can be performed according to the music type determined by the presence or absence.
 つまり、禁止領域Nが検出された場合は、リズム系の楽曲と判別することができ、この場合は、禁止領域Nを避けてタイムスケーリングを行うため、リズム系の楽曲においてクロスフェードを行った場合に問題となる「2度鳴り」、「プチノイズ」および「リズム乱れ」などを解消することができる。特に、本実施形態では、2つの禁止領域Nに挟まれた対象領域Oでタイムスケーリングするため、禁止領域Nの時間軸上の相対位置がタイムスケーリング前と同一の比を保つように変換される。タイムスケーリング前にSRCにより音高を変更し、それに伴う時間の伸縮をタイムスケーリングにより原曲と同じに戻すピッチシフトに応用した場合は、原曲データと、ピッチシフト後のデータとで、禁止領域Nの時間軸上の位置を同じにすることができる。 That is, when the prohibited area N is detected, it can be determined as a rhythm music, and in this case, the time scaling is performed avoiding the prohibited area N. It is possible to eliminate “twisting twice”, “petit noise”, “rhythm disturbance” and the like which are problems. In particular, in this embodiment, since time scaling is performed in the target area O sandwiched between two prohibited areas N, the relative position on the time axis of the prohibited area N is converted so as to maintain the same ratio as before time scaling. . If the pitch is changed by SRC before time scaling and the time expansion and contraction is applied to the pitch shift to return it to the same as the original song by time scaling, the original song data and the data after the pitch shift are prohibited areas. N positions on the time axis can be made the same.
 一方、禁止領域Nが検出されなかった場合は、メロディ系の楽曲と判別することができ、この場合は、振幅が極大となるピーク周波数を複数検出し、当該複数のピーク周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が最も小さくなる時間に相当するタイムスケーリング量でタイムスケーリングするため、振幅が2番目以降の周波数に基づく位相ずれを起因とする音の「うなり」も軽減することができる。さらに、タイムスケーリング量を算出するための評価関数は、複数のピーク周波数の振幅比(強度)を考慮して重み付けされているため、より効果的に、音の「うなり」を軽減することができる。 On the other hand, when the forbidden area N is not detected, it can be determined as a melody music piece. In this case, a plurality of peak frequencies with maximum amplitude are detected, and the amplitude is the maximum among the plurality of peak frequencies. Time scaling with a time scaling amount corresponding to the time in which the difference from the integer multiple of each period based on each frequency after the second frequency is the smallest among the time candidates of the integral multiple of the period based on the frequency Further, it is possible to reduce the “beat” of the sound caused by the phase shift based on the frequency having the second and subsequent amplitudes. Furthermore, since the evaluation function for calculating the time scaling amount is weighted in consideration of the amplitude ratio (intensity) of a plurality of peak frequencies, it is possible to more effectively reduce the “beat” of the sound. .
 また、禁止領域Nを検出する際は、DWTにより帯域分割するため、フィルタを用いて帯域分割する場合と比較して、フィルタ処理に伴う時間遅れの影響を小さくすることができる。 Further, since the band division is performed by DWT when the prohibited area N is detected, the influence of the time delay associated with the filter processing can be reduced as compared with the case where the band division is performed using the filter.
 なお、上記の実施形態では、音声信号処理部3内に、第1タイムスケーリング手段12と、第2タイムスケーリング手段13と、の両方を備える構成としたが、いずれか一方のみを備える構成としても良い。また、音声信号処理部3内に、第2タイムスケーリング手段13のみを備える構成とした場合は、禁止領域Nを検出するのではなく、単に打楽器音の有無を判別しても良い(打楽器音判別手段)。この場合は、打楽器音が存在しないと判別されたときに、第2タイムスケーリング手段13によるタイムスケーリングを行うこととなる。 In the above embodiment, the audio signal processing unit 3 includes both the first time scaling unit 12 and the second time scaling unit 13. However, the audio signal processing unit 3 may include only one of them. good. Further, when the audio signal processing unit 3 is configured to include only the second time scaling means 13, the presence / absence of percussion instrument sound may be simply determined instead of detecting the prohibited area N (percussion instrument sound determination). means). In this case, when it is determined that there is no percussion instrument sound, time scaling by the second time scaling means 13 is performed.
 また、上記の実施形態において、禁止領域検出手段11は、再生部2による再生に伴ってバッファメモリ4に書き込まれる音声信号を解析しながら禁止領域Nを検出するものとしたが、事前に解析したデータを読み出して禁止領域Nを検出するようにしても良い。つまり、楽曲を再生しながらリアルタイムにタイムスケーリングを行う構成としても良いし、事前に解析したデータを利用して、楽曲全体または楽曲の一部をタイムスケーリングする構成としても良い。 Further, in the above embodiment, the prohibited area detection unit 11 detects the prohibited area N while analyzing the audio signal written to the buffer memory 4 along with the reproduction by the reproduction unit 2, but analyzed in advance. Data may be read to detect the prohibited area N. That is, it is good also as a structure which performs time scaling in real time, reproducing | regenerating a music, It is good also as a structure which time-scales the whole music or a part of music using the data analyzed in advance.
 また、上記に示した音声信号処理部3の各構成要素をプログラムとして提供することが可能である。また、そのプログラムを各種記録媒体(CD-ROM、フラッシュメモリ等)に格納して提供することも可能である。すなわち、コンピューターを音声信号処理部3の各構成要素として機能させるためのプログラム、およびそれを記録した記録媒体も、本発明の権利範囲に含まれるものである。 Further, it is possible to provide each component of the audio signal processing unit 3 shown above as a program. Further, the program can be provided by being stored in various recording media (CD-ROM, flash memory, etc.). That is, a program for causing a computer to function as each component of the audio signal processing unit 3 and a recording medium on which the program is recorded are also included in the scope of the right of the present invention.
 また、上記の実施形態では、音声信号処理部3を再生装置1に適用した場合を例示したが、ミキサー装置などのDJ機器、各種電子楽器およびコンピューター(PCアプリケーション)などに適用しても良い。また、カラオケ、ボイスチェンジャおよび音声合成装置など、音高を変更する機能を有する音声処理装置への適用も有用である。さらに、ビデオ(DVD)レコーダーの2倍速再生時に音高を変えずに音声の時間軸長さだけを変更する場合など、タイムスケーリングのみの適用も可能である。その他、本発明の要旨を逸脱しない範囲で、適宜変更が可能である。 In the above-described embodiment, the case where the audio signal processing unit 3 is applied to the playback device 1 is exemplified. However, the audio signal processing unit 3 may be applied to DJ equipment such as a mixer device, various electronic musical instruments, and a computer (PC application). Moreover, application to a speech processing device having a function of changing the pitch, such as karaoke, a voice changer, and a speech synthesizer is also useful. Furthermore, it is also possible to apply only time scaling, such as when changing only the audio time axis length without changing the pitch during double-speed playback of a video (DVD) recorder. Other modifications can be made as appropriate without departing from the scope of the present invention.
 1…再生装置 2…再生部 3…音声信号処理部 4…バッファメモリ 5…音声信号出力部 9…音声信号取得手段 10…タイムスケーリング手段 11…禁止領域検出手段 12…第1タイムスケーリング手段 13…第2タイムスケーリング手段 20…ピッチシフト手段 21…SRC手段 CFP…クロスフェード位置 N…禁止領域 O…対象領域 DESCRIPTION OF SYMBOLS 1 ... Playback apparatus 2 ... Playback part 3 ... Audio | voice signal processing part 4 ... Buffer memory 5 ... Audio | voice signal output part 9 ... Audio | voice signal acquisition means 10 ... Time scaling means 11 ... Prohibition area | region detection means 12 ... 1st time scaling means 13 ... Second time scaling means 20 ... pitch shift means 21 ... SRC means CFP ... crossfade position N ... prohibited area O ... target area

Claims (12)

  1.  取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行う音声信号処理装置のタイムスケーリング方法であって、
     前記音声信号処理装置は、
     取得した前記音声信号を解析し、打楽器音の影響を受けている禁止領域を検出する禁止領域検出ステップと、
     前記禁止領域検出ステップにおいて、前記禁止領域が検出された場合、当該禁止領域を除く他の領域で、クロスフェードによりタイムスケーリングする第1タイムスケーリングステップと、を実行することを特徴とする音声信号処理装置のタイムスケーリング方法。
    A time scaling method for an audio signal processing device that performs time scaling using a cross-fade method on an acquired audio signal,
    The audio signal processing device includes:
    A prohibited area detecting step of analyzing the acquired audio signal and detecting a prohibited area affected by a percussion instrument sound; and
    In the prohibited area detecting step, when the prohibited area is detected, a first time scaling step of performing time scaling by cross fading in another area other than the prohibited area is executed. Device time scaling method.
  2.  前記第1タイムスケーリングステップでは、2つの前記禁止領域に挟まれた領域で、クロスフェードによりタイムスケーリングすることを特徴とする請求項1に記載の音声信号処理装置のタイムスケーリング方法。 The time scaling method for an audio signal processing device according to claim 1, wherein, in the first time scaling step, time scaling is performed by cross-fading in a region sandwiched between the two prohibited regions.
  3.  前記禁止領域検出ステップでは、取得した音声信号を、ウェーブレット変換により帯域分割して複数の変換信号Bi(但し、i=1,・・・,n)を生成し、当該複数の変換信号Biのうち、前記打楽器音の影響度が大きい1以上の変換信号Biを用いて、前記禁止領域を検出することを特徴とする請求項1に記載の音声信号処理装置のタイムスケーリング方法。 In the forbidden area detecting step, the acquired audio signal is band-divided by wavelet transform to generate a plurality of converted signals Bi (where i = 1,..., N), and among the plurality of converted signals Bi 2. The time scaling method for an audio signal processing device according to claim 1, wherein the prohibited area is detected using one or more conversion signals Bi having a large influence degree of the percussion instrument sound.
  4.  前記音声信号処理装置は、
     前記禁止領域検出ステップにおいて、前記禁止領域が検出されなかった場合、取得した前記音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリングステップをさらに実行することを特徴とする請求項1に記載の音声信号処理装置のタイムスケーリング方法。
    The audio signal processing device includes:
    In the prohibited area detecting step, when the prohibited area is not detected, a frequency analysis is performed on the acquired audio signal to detect a plurality of frequencies having a maximum amplitude, and the amplitude is the maximum among the plurality of frequencies. The time scaling amount corresponding to the time when the difference from the integer multiple of each cycle based on the frequency after the second frequency is among the time candidates of the integral multiple of the cycle based on the 2. The time scaling method for an audio signal processing device according to claim 1, further comprising a second time scaling step of time scaling.
  5.  取得した音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行う音声信号処理装置のタイムスケーリング方法であって、
     前記音声信号処理装置は、
     取得した前記音声信号を解析し、打楽器音の有無を判別する打楽器音判別ステップと、
     前記打楽器音判別ステップにおいて、前記打楽器音が存在しないと判別された場合、取得した前記音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリングステップと、を実行することを特徴とする音声信号処理装置のタイムスケーリング方法。
    A time scaling method for an audio signal processing device that performs time scaling using a cross-fade method on an acquired audio signal,
    The audio signal processing device includes:
    A percussion instrument sound determination step of analyzing the acquired sound signal and determining the presence or absence of percussion instrument sound;
    In the percussion instrument sound determination step, when it is determined that the percussion instrument sound does not exist, a plurality of frequencies having the maximum amplitude are detected by performing frequency analysis on the acquired audio signal, and the amplitude of the plurality of frequencies is The time scaling amount corresponding to the time at which the difference from the integral multiple of each period based on each frequency after the second frequency is crossed by the time scaling amount corresponding to the smallest time among the time candidates of the integral multiple of the period based on the maximum frequency And a second time scaling step of time scaling by fading.
  6.  前記第2タイムスケーリングステップでは、前記複数の周波数の振幅比によって重み付けされた評価関数により、前記タイムスケーリング量を算出することを特徴とする請求項5に記載の音声信号処理装置のタイムスケーリング方法。 6. The time scaling method for an audio signal processing device according to claim 5, wherein, in the second time scaling step, the time scaling amount is calculated by an evaluation function weighted by an amplitude ratio of the plurality of frequencies.
  7.  音声信号処理装置は、
     請求項1ないし6のいずれか1項に記載の、音声信号処理装置のタイムスケーリング方法における各ステップと、
     前記各ステップの前あるいは後にサンプリングレートコンバートを行うサンプリングレートコンバートステップと、を実行し、
     前記サンプリングレートコンバートステップでは、前記タイムスケーリングと前記サンプリングレートコンバートによる音声信号の時間長変化を相殺し、音高のみを変更させることを特徴とする音声信号処理装置のピッチシフト方法。
    The audio signal processing device
    Each step in the time scaling method of the audio signal processing device according to any one of claims 1 to 6,
    Performing a sampling rate conversion step of performing sampling rate conversion before or after each step,
    In the sampling rate conversion step, the pitch shift method of the audio signal processing apparatus is characterized in that the time length change of the audio signal due to the time scaling and the sampling rate conversion is canceled and only the pitch is changed.
  8.  音声信号を取得する音声信号取得手段と、
     取得した前記音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行うタイムスケーリング手段と、を備え、
     前記タイムスケーリング手段は、
     取得した前記音声信号を解析し、打楽器音の影響を受けている禁止領域を検出する禁止領域検出手段と、
     前記禁止領域検出手段により、前記禁止領域が検出された場合、当該禁止領域を除く他の領域で、クロスフェードによりタイムスケーリングする第1タイムスケーリング手段と、を有することを特徴とする音声信号処理装置。
    An audio signal acquisition means for acquiring an audio signal;
    Time scaling means for performing time scaling using a cross-fade method on the acquired audio signal,
    The time scaling means includes
    A prohibited area detecting means for analyzing the acquired audio signal and detecting a prohibited area affected by a percussion instrument sound;
    An audio signal processing apparatus comprising: a first time scaling unit that performs time scaling by cross-fading in a region other than the prohibited region when the prohibited region is detected by the prohibited region detection unit; .
  9.  音声信号を取得する音声信号取得手段と、
     取得した前記音声信号に対し、クロスフェード方式を用いてタイムスケーリングを行うタイムスケーリング手段と、を備え、
     前記タイムスケーリング手段は、
     取得した前記音声信号を解析し、打楽器音の有無を判別する打楽器音判別手段と、
     前記打楽器音判別手段により、前記打楽器音が存在しないと判別された場合、取得した前記音声信号を周波数解析することにより、振幅が極大となる周波数を複数検出し、当該複数の周波数のうち振幅が最大となる周波数に基づく周期の整数倍の時間候補の中から、振幅が2番目以降の各周波数に基づく各周期の整数倍との差が、最も小さくなる時間に相当するタイムスケーリング量で、クロスフェードによりタイムスケーリングする第2タイムスケーリング手段と、を有することを特徴とする音声信号処理装置。
    An audio signal acquisition means for acquiring an audio signal;
    Time scaling means for performing time scaling using a cross-fade method on the acquired audio signal,
    The time scaling means includes
    Percussion instrument sound discriminating means for analyzing the acquired audio signal and discriminating the presence or absence of percussion instrument sound;
    When the percussion instrument sound discriminating unit determines that the percussion instrument sound does not exist, the acquired audio signal is subjected to frequency analysis to detect a plurality of frequencies having the maximum amplitude, and the amplitude of the plurality of frequencies is The time scaling amount corresponding to the time at which the difference from the integral multiple of each period based on each frequency after the second frequency is crossed by the time scaling amount corresponding to the smallest time among the time candidates of the integral multiple of the period based on the maximum frequency And a second time scaling means for time scaling by fading.
  10.  サンプリングレートコンバートを行うサンプリングレートコンバート手段をさらに備え、
     前記サンプリングレートコンバート手段は、前記タイムスケーリングと前記サンプリングレートコンバートによる音声信号の時間長変化を相殺し、音高のみを変更させることを特徴とする請求項8または9に記載の音声信号処理装置。
    A sampling rate converting means for converting the sampling rate;
    The audio signal processing apparatus according to claim 8 or 9, wherein the sampling rate converting means cancels a time length change of the audio signal due to the time scaling and the sampling rate conversion, and changes only the pitch.
  11.  コンピューターに、請求項1ないし6のいずれか1項に記載の音声信号処理装置のタイムスケーリング方法における各ステップを実行させるためのプログラム。 A program for causing a computer to execute each step in the time scaling method of the audio signal processing device according to any one of claims 1 to 6.
  12.  コンピューターに、請求項7に記載の音声信号処理装置のピッチシフト方法における各ステップを実行させるためのプログラム。 A program for causing a computer to execute each step in the pitch shift method of the audio signal processing device according to claim 7.
PCT/JP2009/002711 2009-06-15 2009-06-15 Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program WO2010146624A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/002711 WO2010146624A1 (en) 2009-06-15 2009-06-15 Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/002711 WO2010146624A1 (en) 2009-06-15 2009-06-15 Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program

Publications (1)

Publication Number Publication Date
WO2010146624A1 true WO2010146624A1 (en) 2010-12-23

Family

ID=43355969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/002711 WO2010146624A1 (en) 2009-06-15 2009-06-15 Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program

Country Status (1)

Country Link
WO (1) WO2010146624A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016208000A1 (en) * 2015-06-24 2016-12-29 Pioneer DJ株式会社 Display control device, display control method, and display control program
WO2016208002A1 (en) * 2015-06-24 2016-12-29 Pioneer DJ株式会社 Display control device, display control method, and display control program
TWI574252B (en) * 2014-02-26 2017-03-11 蘇州樂聚一堂電子科技有限公司 Synchronous beat effect system and method for processing synchronous beat effect

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001051700A (en) * 1999-08-10 2001-02-23 Yamaha Corp Method and device for companding time base of multi- track voice source signal
JP2005173423A (en) * 2003-12-15 2005-06-30 Roland Corp Waveform reproducing device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001051700A (en) * 1999-08-10 2001-02-23 Yamaha Corp Method and device for companding time base of multi- track voice source signal
JP2005173423A (en) * 2003-12-15 2005-06-30 Roland Corp Waveform reproducing device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI574252B (en) * 2014-02-26 2017-03-11 蘇州樂聚一堂電子科技有限公司 Synchronous beat effect system and method for processing synchronous beat effect
WO2016208000A1 (en) * 2015-06-24 2016-12-29 Pioneer DJ株式会社 Display control device, display control method, and display control program
WO2016208002A1 (en) * 2015-06-24 2016-12-29 Pioneer DJ株式会社 Display control device, display control method, and display control program
JPWO2016208002A1 (en) * 2015-06-24 2018-03-15 Pioneer DJ株式会社 Display control apparatus, display control method, and display control program

Similar Documents

Publication Publication Date Title
JP4823804B2 (en) Code name detection device and code name detection program
US8415549B2 (en) Time compression/expansion of selected audio segments in an audio file
KR100677622B1 (en) Method for equalizer setting of audio file and method for reproducing audio file using thereof
JP4645241B2 (en) Voice processing apparatus and program
US20070191976A1 (en) Method and system for modification of audio signals
JPH0997091A (en) Method for pitch change of prerecorded background music and karaoke system
WO2014003072A1 (en) Automated performance technology using audio waveform data
JP2009300707A (en) Information processing device and method, and program
JP6751810B2 (en) Voice processing device and voice processing method
JP2012002858A (en) Time scaling method, pitch shift method, audio data processing apparatus and program
US5196639A (en) Method and apparatus for producing an electronic representation of a musical sound using coerced harmonics
WO2010146624A1 (en) Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program
JP6118522B2 (en) Time scaling method, pitch shift method, audio data processing apparatus and program
KR20080036518A (en) Apparatus and method for expanding/compressing audio signal
JP3659489B2 (en) Digital audio processing apparatus and computer program recording medium
JP3008922B2 (en) Music sound generating apparatus and music sound generating method
JP4542805B2 (en) Variable speed reproduction method and apparatus, and program
JP3780857B2 (en) Waveform editing method and waveform editing apparatus
JP6616099B2 (en) Audio processing device
US20030165326A1 (en) Audio frequency shifting during video trick modes
JP6803494B2 (en) Voice processing device and voice processing method
JP5552794B2 (en) Method and apparatus for encoding acoustic signal
JP2005309464A (en) Method and device to eliminate noise and program
JP4152502B2 (en) Sound signal encoding device and code data editing device
WO2024034115A1 (en) Audio signal processing device, audio signal processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09846119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09846119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP