WO2010146624A1

WO2010146624A1 - Time-scaling method for voice signal processing device, pitch shift method for voice signal processing device, voice signal processing device, and program

Info

Publication number: WO2010146624A1
Application number: PCT/JP2009/002711
Authority: WO
Inventors: 古川善久
Original assignee: パイオニア株式会社
Priority date: 2009-06-15
Filing date: 2009-06-15
Publication date: 2010-12-23

Abstract

Disclosed is a time-scaling method for a voice signal processing device, which performs time-scaling suited for a musical composition, thereby to reduce the deterioration of a sound quality resulting from the time-scaling of a cross-fade process. The time-scaling method time-scales a voice signal acquired, by using the cross-fade process. The voice signal processing device is characterized by executing an inhibition region detecting step (S03) for analyzing the voice signal acquired, thereby to detect an inhibition region being influenced by a percussion instrument, and a first time-scaling step (S05) for performing, if the inhibition region is detected at the inhibition region detecting step, the time-scaling by the cross-fade in a region other than the inhibition region.

Description

Audio signal processor time scaling method, audio signal processor pitch shift method, audio signal processor and program

The present invention relates to a time scaling method for an audio signal processing device that performs time scaling using a crossfade method, a pitch shift method for an audio signal processing device, an audio signal processing device, and a program.

Conventionally, a cross-fade method is known as a time scaling technique for extending and compressing the length on the time axis without changing the pitch of a digital voice waveform.
Furthermore, SRC (Sampling Rate Convert) processing for changing the sampling frequency of a digital audio waveform before or after crossfade time scaling processing is known. This SRC processing is performed with the sampling frequency change amount that cancels the time scaling amount of the crossfade method, and if the playback is performed at the sampling frequency of the original digital audio, the pitch is changed at the same time, so only the pitch of the original audio waveform is changed. The pitch shift (key control) that does not change the length on the time axis can be realized. Regarding the pitch shift, an FFT (Discrete Fourier Transform) method is known in addition to combining SRC and crossfade time scaling.
For example, Patent Document 1 describes a method of detecting the period of an input audio signal and shifting the signal by an integral multiple of the period to perform crossfading. By this method, there is an effect that the “beat” of the sound due to the phase shift at the time of crossfading can be reduced. Note that “buzz” refers to sound quality-degraded sound that occurs when a melody-based music such as vibrato, tremolo, or echo is cross-faded.
Further, comparing the above two methods for realizing the pitch shift, the cross-fade method generally has an advantage that the processing amount is smaller than the FFT method and the pitch shift can be realized relatively easily.

Japanese Patent No. 3395560

However, when the cross-fade method is applied to pitch shift, the sound quality is inferior to that of the FFT method, such as sound quality degradation such as ringing twice and beating due to time stretching. The above-mentioned Patent Document 1 reduces the “beat”, which is one of the sound quality deterioration phenomena that occur in the crossfade method, by detecting one cycle of the audio signal and using it for the crossfade processing. Since only one is detected, the mitigation effect is insufficient.
Furthermore, as a problem when performing crossfading, in addition to the above “growing”, for example, in a rhythm music (music containing percussion instrument sound), crossfading is performed in an area where percussion instrument sound is generated. Examples include “twisting twice”, “petit noise”, and “rhythm disturbance”. However, the above-mentioned patent document 1 does not consider these problems.

In view of the above-described problems, the present invention uses a crossfade method to reduce the deterioration of sound quality, and uses a time scaling method for an audio signal processing apparatus and an SRC process to realize a pitch shift suitable for music. It is an object of the present invention to provide a pitch shift method for a processing device, an audio signal processing device, and a program.

The time scaling method of the audio signal processing device of the present invention is a time scaling method of the audio signal processing device that performs time scaling using the cross-fade method on the acquired audio signal, and the audio signal processing device acquires When the prohibited area is detected in the prohibited area detection step that analyzes the audio signal and detects the prohibited area affected by the percussion instrument sound, and in the prohibited area detection step, the crossing is performed in other areas except the prohibited area. Performing a first time scaling step of time scaling by fading.

The audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method. When the prohibited area is detected by the prohibited area detecting means for analyzing the acquired audio signal and detecting the prohibited area affected by the percussion instrument sound, and the prohibited area detecting means, in other areas excluding the prohibited area And first time scaling means for time scaling by cross-fade.

According to these configurations, when performing time scaling, a prohibited area (crossfade prohibited area) affected by percussion instrument sound is detected, and when the prohibited area is detected, other areas except the prohibited area are excluded. Since time scaling is performed in the region, it is possible to reduce deterioration in sound quality, which is a drawback of the crossfade method. In the first place, the prohibited area affected by the percussion instrument sound is a portion where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) by crossfading. Therefore, in the case of rhythmic music that includes percussion instrument sounds, by performing time scaling while avoiding the prohibited areas, it is possible to eliminate sound quality deterioration factors such as “twice”, “petit noise”, and “rhythm disturbance”. .
In addition, “when a prohibited area is detected” may indicate a case where a prohibited area is detected in a part of a song, or a predetermined number or more of prohibited areas are detected in the entire song. It may be a case that indicates. In addition, the user may be able to set “partial range of music” in the former case and “predetermined number” in the latter case.

In the time scaling method of the audio signal processing device described above, in the first time scaling step, time scaling is performed by cross fading in a region sandwiched between two prohibited regions.

According to this configuration, the position of the prohibited area on the time axis can be made the same in the original music data and the decompressed / compressed data after time scaling. Since the prohibited area is an area including a rhythm sound, by making this position the same before and after time scaling, rhythm disturbance due to time scaling can be prevented.

In the time scaling method of the audio signal processing device described above, in the prohibited region detection step, the acquired audio signal is band-divided by wavelet conversion to obtain a plurality of converted signals Bi (where i = 1,..., N ) And the prohibited region is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi.

According to this configuration, by performing the band division by the wavelet transform, it is possible to reduce the influence of the time delay associated with the filter processing compared to the case of performing the band division using the filter.
The number corresponding to “1 or more” of “one or more converted signals Bi having a large influence of percussion instrument sound among the plurality of converted signals Bi” may be a predetermined number or an arbitrary number (detectable) Any number). In addition, when the predetermined number is used, the numerical value may be set by the user.

In the time scaling method of the audio signal processing device described above, the audio signal processing device performs frequency analysis on the acquired audio signal when the prohibited region is not detected in the prohibited region detection step, and the amplitude is maximized. And a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates that are an integral multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Is further characterized by further executing a second time scaling step of time scaling by cross-fade with a time scaling amount corresponding to the smallest time.

According to this configuration, when the prohibited region affected by the percussion instrument sound is not detected, frequency analysis is performed to detect a plurality of frequencies (peak frequencies) at which the amplitude is maximum, and based on the plurality of frequencies. Since time scaling is performed with the amount of time scaling, deterioration in sound quality can be reduced. That is, the case where the forbidden area is not detected means that the music is a melody type music that does not include percussion instrument sounds. In general, when crossfading is performed on a melody-type musical piece, since the “beat” of the sound becomes a problem, it corresponds to a period (or an integer multiple thereof) based on a peak frequency (for example, one having the maximum amplitude). By performing time scaling by time, it is possible to reduce “buzz”. In this configuration, furthermore, a plurality of frequencies having the maximum amplitude are detected, and each of the frequencies having the second and subsequent amplitudes among time candidates of an integer multiple of the period based on the frequency having the maximum amplitude among the plurality of frequencies. Since the time scaling is performed by the time scaling amount corresponding to the time when the difference from the integer multiple of each period based on is the smallest, the occurrence of “beat” can be further reduced.
In addition, it is thought that the effect of reducing “beat” is high by detecting as many “frequency at which the amplitude becomes maximum” as much as possible. The number of detections may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.

Another audio signal processing apparatus time scaling method of the present invention is an audio signal processing apparatus time scaling method for performing time scaling on an acquired audio signal using a cross-fade method. In the percussion instrument sound determination step for analyzing the acquired sound signal and determining the presence or absence of percussion instrument sound, and in the percussion instrument sound determination step, if it is determined that there is no percussion instrument sound, the amplitude is obtained by frequency analysis of the acquired sound signal. A plurality of frequencies where the maximum is detected, and among the time candidates of an integer multiple of the cycle based on the frequency having the maximum amplitude among the multiple frequencies, an integer multiple of each cycle based on each frequency after the second frequency Is the time scaling amount corresponding to the smallest time and the second time scaled by crossfade And executes arm and scaling step.

Another audio signal processing apparatus of the present invention includes an audio signal acquisition unit that acquires an audio signal, and a time scaling unit that performs time scaling on the acquired audio signal using a cross-fade method. The percussion instrument sound discriminating means for analyzing the acquired audio signal and determining the presence or absence of the percussion instrument sound, and the percussion instrument sound discriminating means, when it is determined that the percussion instrument sound does not exist, by analyzing the frequency of the acquired audio signal , Detecting a plurality of frequencies where the amplitude is maximum, and among the time candidates that are integer multiples of the cycle based on the frequency having the maximum amplitude among the plurality of frequencies, The second time-scaling time-scaling by cross-fade with the time-scaling amount corresponding to the time when the difference from the integer multiple is the smallest. Characterized in that it has a-ring means.

According to these configurations, when there is no percussion instrument sound, frequency analysis is performed to detect a plurality of frequencies with maximum amplitude, and time scaling is performed with a time scaling amount based on the plurality of frequencies. It is possible to effectively reduce the “growing” of sound, which becomes a problem when crossfading is performed with melody music.

In the time scaling method of the audio signal processing device described above, in the second time scaling step, a time scaling amount is calculated by an evaluation function weighted by an amplitude ratio of a plurality of frequencies.

According to this configuration, since the time scaling amount is calculated in consideration of the amplitude ratio (intensity) of a plurality of frequencies, the “beat” of the sound can be reduced more effectively.

The pitch shift method for an audio signal processing device according to the present invention includes a sampling rate at which the audio signal processing device performs sampling rate conversion before and after each step in the time scaling method of the audio signal processing device described above. The conversion step is executed, and in the sampling rate conversion step, the time length change of the audio signal due to the time scaling and the sampling rate conversion is canceled, and only the pitch is changed.

The audio signal processing apparatus described above further includes sampling rate conversion means for performing sampling rate conversion, and the sampling rate conversion means cancels the time length change of the audio signal due to time scaling and sampling rate conversion, and only the pitch is obtained. It is characterized by being changed.

According to these configurations, sound quality deterioration due to time scaling processing can be reduced, and the audio signal can be pitch-shifted.

The program of the present invention is for causing a computer to execute each step in the time scaling method of the audio signal processing device described above. Another program of the present invention is for causing each step in the pitch shift method of the audio signal processing apparatus described above to be executed.

By executing these programs, it is possible to realize time scaling and pitch shift suitable for music while reducing deterioration in sound quality by using a crossfade method.

1 is a block diagram of a playback apparatus according to an embodiment of the present invention and an audio signal processing unit that is a part of the playback apparatus. FIG. It is a flowchart which shows the time scaling process and pitch shift process by an audio | voice signal process part. It is a figure which shows the specific example of the cross fade in the case of time expansion. It is a figure which shows the specific example of the cross fade in time compression. It is a figure which shows an example of a band division | segmentation. It is a flowchart which shows the prohibition area | region detection process by an audio | voice signal process part. It is a figure which shows the concept of the detection method of a prohibition area | region. It is a figure which shows the specific example of the prohibition area | region detected with respect to the rhythm type music. It is a figure which shows an example of frequency conversion. It is a figure which shows the concept of the determination method of time scaling amount. It is a figure which shows the evaluation function for determining a time scaling amount.

Hereinafter, a time scaling method for an audio signal processing device, a pitch shift method for an audio signal processing device, an audio signal processing device, and a program according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the present embodiment, the case where the audio signal processing device of the present invention is applied to a playback device such as a CD player is exemplified.

FIG. 1A is a simplified block diagram of the playback device 1. As shown in FIG. 1, the playback device 1 includes a playback unit 2, an audio signal processing unit 3 (audio signal processing device), a buffer memory 4, and an audio signal output unit 5. The reproducing unit 2 reads out and reproduces music from a device such as a CD. The audio signal processing unit 3 is composed of a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), and stores the audio signal reproduced by the reproduction unit 2 in the buffer memory 4 and also reads from the buffer memory 4 Digital signal processing is performed on the audio signal. The audio signal output unit 5 outputs the audio signal processed by the audio signal processing unit 3 to the outside (such as an output device having an amplifier and a speaker).

FIG. 1B is a block diagram showing a functional configuration of the audio signal processing unit 3. The audio signal processing unit 3 includes an audio signal acquisition unit 9 and a time scaling unit 10 as main functional configurations. The audio signal acquisition unit 9 acquires an audio signal to be processed from the buffer memory 4. The time scaling unit 10 includes a prohibited area detection unit 11, a first time scaling unit 12, and a second time scaling unit 13. The time scaling means 10 of this embodiment is characterized by performing time scaling using a cross-fade method. A specific example of the cross fade will be described later with reference to FIGS.

The prohibited area detection means 11 analyzes the audio signal acquired by the audio signal acquisition means 9 and detects the prohibited area N (see FIG. 7 and the like) affected by the percussion instrument sound (drum sound). Specifically, the acquired audio signal is band-divided by wavelet transform to generate a plurality of converted signals Bi (where i = 1,..., N) (a plurality of bands, see FIG. 5). The forbidden region N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi. This forbidden area N is a part where the sound quality is likely to change due to time scaling (time axis operation such as expansion or compression) due to crossfade (“sounding twice”, “petit noise” and “rhythm disturbance” are likely to occur). For this reason, it becomes an area where cross-fading is prohibited.

When the prohibited area N is detected by the prohibited area detection means 11, the first time scaling means 12 is a region other than the prohibited area N (in this embodiment, a target area sandwiched between two prohibited areas N). O, see FIG. 8), time scaling is performed by cross-fading. Note that “when the prohibited area N is detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a rhythmic musical piece. That is, in the case of rhythm music, the first time scaling means 12 performs time scaling in an appropriate area (target area O) while avoiding the prohibited area N affected by the percussion instrument sound.

It should be noted that the prohibited area N may be detected for a part of the musical piece, or may be detected for the entire musical piece. Further, when a predetermined number or more is detected, it may be considered that the prohibited area N has been detected. In addition, the user may be able to set “partial range” when detecting a partial range of music and “predetermined number” serving as a detection reference.

On the other hand, when the prohibited area N is not detected by the prohibited area detecting means 11, the second time scaling means 13 performs frequency analysis on the acquired audio signal, thereby obtaining a plurality of frequencies (peak frequencies) at which the amplitude is maximized. Detection is performed and time scaling is performed by crossfade with a time scaling amount based on the plurality of frequencies. Specifically, a difference between an integer multiple of a period based on each frequency of the second and subsequent frequencies from among time candidates of an integral multiple of the period based on the frequency having the maximum amplitude among the frequencies having the maximum amplitude. Is time-scaled by a time-scaling amount corresponding to the smallest time. As a result, it is possible to effectively reduce the “beat” of the sound caused by the phase shift at the time of crossfading.

It should be noted that “when the prohibited area N is not detected by the prohibited area detection unit 11” means that the acquired audio signal is based on a melody-type music. That is, the second time scaling means 13 performs time scaling with an appropriate time scaling amount based on a plurality of peak frequencies obtained by frequency analysis in the case of melody music.

On the other hand, FIG. 1C is a block diagram when the pit shift means 20 is realized by adding the SRC means 21 to the time scaling means 10 of FIG. The SRC means 21 performs SRC processing for changing the sampling frequency of the digital speech waveform before or after the time scaling processing by the time scaling means 10. As a result, the pit shift means 20 can cancel the time length change of the audio signal by the time scaling means 10 and the SRC means 21 and realize a pitch shift that changes only the pitch without changing the length on the time axis. It is like that.

Next, the time scaling process by the audio signal processing unit 3 will be described with reference to the flowchart of FIG. When the audio signal processing unit 3 acquires the audio signal (S01), the audio signal processing unit 3 divides the band and generates a plurality of converted signals Bi (S02).

Subsequently, the forbidden area N is detected using one or more conversion signals Bi having a large influence of the percussion instrument sound among the plurality of conversion signals Bi (S03). Note that S02 and S03 are processing steps performed by the prohibited area detection unit 11.

Subsequently, the music type is determined according to whether or not the prohibited area N is detected (S04). As described above, when the prohibited area N is detected, it is determined that the music is a rhythmic music, and time scaling is performed by the first time scaling means 12 (S05). On the other hand, if the prohibited area N is not detected, it is determined that the music is a melody, and time scaling is performed by the second time scaling means 13 (S06). Note that S04 to S06 are processing steps by the first time scaling means 12 and the second time scaling means 13.

Next, the pitch shift processing by the audio signal processing unit 3 will be described with reference to the flowchart of FIG. In the flowchart, an SRC process (S22) is added before the time scaling process (S10). When the audio signal processing unit 3 acquires the audio signal (S21), the audio signal processing unit 3 changes the pitch by the sampling rate conversion technique according to the pitch shift amount (S22). In this process, since the expansion and contraction of time occurs with the change of the pitch, the pitch is returned to the same time as the original audio signal by the time scaling process (S10, except S01 in FIG. 2A). Only change the pitch shift without time expansion and contraction.

Note that the pitch shift amount may be an amount specified by a user operation, or may be an amount automatically calculated according to the pitch of another song. In the flowchart of FIG. 2B, the SRC process (S22) is performed before the time scaling process (S10). However, after the time scaling process (S10) is performed first, the SRC process (S22) is performed. May be.

Next, a specific example of crossfade will be described with reference to FIGS. FIG. 3 is a diagram showing a specific example in which original music data is expanded by crossfading.

In “Crossfade Example 1” shown in FIG. 5A, in the original music data, a signal A reproduced immediately before the crossfade position CFP and a signal B reproduced immediately after the crossfade position CFP (however, the signal In this example, crossfading is performed using the reproduction time of B = the reproduction time of signal A). In this case, the expanded data after expansion is cross-faded after the reproduction of the signal A. That is, the signal B is faded out (gradually reduced from the original volume to 0), and conversely, the signal A is faded in (gradually increased from 0 to the original volume). This crossfade portion is an increase (time scaling amount). Thereafter, the signal B is reproduced.

In addition, “Crossfade Example 2” shown in FIG. 5B shows a case where the increment is not only the crossfade portion. For example, as shown in the figure, the original music data is reproduced in the order of signal A, signal B, and signal C (however, reproduction time of signal C = reproduction time of signal A). When the cross fade position CFP is reached, the decompressed data is cross faded (the fade out of the signal C and the fade in of the signal A) after the reproduction of the signals A and B, and then reproduced in the order of the signals B and C. The increase (time scaling amount) in this case is “cross fade portion + reproduction time of signal B”.

On the other hand, FIG. 4 is a diagram showing a specific example when the original music data is compressed by cross-fading. In “Crossfade Example 3” shown in FIG. 5A, in the original music data, the signal A and the signal B reproduced immediately after the crossfade position CFP (however, the reproduction time of the signal B = the reproduction time of the signal A) The case of crossfading is shown using. In this case, the compressed data after compression is reproduced only in the cross fade portion by the fade-out of the signal A and the fade-in of the signal B. That is, this cross-fade portion (= reproduction time of signal B) becomes a decrease (time scaling amount).

Also, “Crossfade Example 4” shown in FIG. 4B shows a case where the decrease is not only the crossfade portion. For example, as shown in the figure, when the original music data is reproduced in the order of signal A, signal B, signal C (where reproduction time of signal C = reproduction time of signal A) immediately after the crossfade position CFP. The compressed data is reproduced only in the crossfade portion by the fade-out of the signal A and the fade-in of the signal C. The decrease (time scaling amount) in this case is “reproduction time of signal B + reproduction time of signal C”.

Note that the playback time that expands and contracts in proportion to the change in pitch is crossed both when the pitch is raised (when the tempo is raised) and when the pitch is lowered (when the tempo is lowered) by SRC. By returning to the original by time scaling by fading, it is possible to realize a pitch shift in which the performance time is the same as the original music and only the pitch changes. In any of the examples shown in FIG. 3 and FIG. 4, the time scaling process shown in this embodiment can be applied.

In other words, by setting the crossfade position CFP by avoiding the prohibited area N by the first time scaling means 12, especially in the case of rhythm music, “sound twice”, “petit noise”, “rhythm disturbance”, etc. Occurrence can be reduced. Further, by setting the time scaling amount corresponding to the above “increase” and “decrease” to an appropriate amount (time) by the second time scaling means 13, particularly in the case of a melody music, The occurrence of “beat” can be reduced.

Next, the detection of the prohibited area N will be described in detail with reference to FIGS. FIG. 5 is a diagram illustrating an example of band division using DWT (Discrete Wavelet Transform). FIG. 5A shows the original sound. Here, a drum sound in which a bass drum and a hyatt are played at the same time and then only a hyatt is played is illustrated.

FIG. 7B shows the result of band division by the DWT of about 33 times and the IDWT (inverse discrete wavelet transform) of about 330 times for the original sound shown in FIG. As shown in FIG. 5B, in the present embodiment, ten converted signals Bi (B1 to B10) are generated by dividing into ten frequency bands (band 1 to band 10).

FIG. 6 is a flowchart showing prohibited area detection processing by the audio signal processing unit 3 (prohibited area detection means 11). The audio signal processing unit 3 detects and holds the peak position affected by the percussion instrument sound for the ten converted signals Bi (S11), and performs binarization processing (S12). The range determined as “1” by the binarization process is provisionally determined as the prohibited area N, and then gap filling is performed (S13). In this process, when the area sandwiched between the two prohibited areas N (area that can be crossfade) is too small (when it is equal to or less than a predetermined amount), the two prohibited areas N are combined into one prohibited area N I do. Further, the forbidden areas N detected by the respective conversion signals Bi are synthesized (OR operation) (S14), and finally the forbidden area N where the crossfade should be prohibited is determined (S15).

FIG. 7 is a diagram showing a concept of a method for detecting the prohibited area N. The same figure (a) has shown the original sound like Fig.5 (a). Here, the dotted line frame D1 is a hitting sound of a hyatt (kick hitting sound) and is considered to correspond to a beat position. The dotted frame D2 is a hitting sound of Hyatt and is considered to correspond to the back beat position. Therefore, the prohibited area N is determined based on the band (in this embodiment, band 2 and band 7, see FIG. 5B) where the sound is loud in the areas (detection areas) corresponding to the dotted frames D1 and D2. I will do it.

FIG. 5B shows a state in which the forbidden area N is provisionally determined from two converted signals Bi that are most influenced by the percussion instrument sound, that is, the band where the sound is loud in the areas corresponding to the dotted frames D1 and D2. Show. Further, FIG. 5C shows a state in which the prohibition area N of the two bands that have been temporarily determined is synthesized and the prohibition area N is finally determined.

In the example shown in FIG. 7, the forbidden area N detected from the band 7 is completely included in the forbidden area N detected from the band 2, so detection of the band 7 seems unnecessary, but there is a hyatt. Since there may be a case where it is not possible, it is preferable to determine the prohibited region N based on a plurality of bands as described above.

In the above example, the prohibited area N is provisionally determined from the two converted signals Bi as the converted signal Bi that is most affected by the percussion instrument sound. However, the number of the converted signals Bi is limited to this. It is not something. The number of conversion signals Bi to be detected may be a predetermined number or an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user.

FIG. 8 is a diagram showing a specific example of the prohibited area N detected for a rhythmic musical piece. In this way, in the case of rhythm music, since the drum sound is repeated, the forbidden area N is detected as the beat appears. For this reason, in the first time scaling means 12 of this embodiment, the time scaling is performed in the target area O sandwiched between the prohibited areas N, thereby reducing the sound quality deterioration due to the crossfade.

Next, details of the second time scaling means 13 will be described with reference to FIGS. 9 to 11. The second time scaling means 13 eliminates the “beat” with respect to the peak frequency by time scaling shifted by an integral multiple of the peak period.

FIG. 9 is FFT (Fast Fourier transform) data in which the horizontal axis represents frequency and the vertical axis represents intensity, and shows one of a large number of detected samples. As described above, when the prohibited area N is not detected in the prohibited area detection process, the second time scaling unit 13 performs frequency analysis on the acquired audio signal, and thereby selects a plurality of frequencies (peak frequencies) at which the amplitude is maximized. To detect. In the present embodiment, it is assumed that three “frequency at which the amplitude is maximum” are detected from those having a large amplitude (those having a high intensity). That is, in the case of the example in the figure, “514.1 Hz” with an intensity “−20.1 dB”, “1468.3 Hz” with an intensity “−28.9 dB”, and “6461.3 Hz” with an intensity “−27.8 dB”. Are detected as peak frequency candidates.

Note that the “frequency at which the amplitude becomes maximum” is not limited to three, and a predetermined number of four or more may be detected, or may be an arbitrary number (detectable number). In addition, when the predetermined number is used, the numerical value may be set by the user. However, it is considered that the effect of reducing deterioration in sound quality due to cross-fading is higher when the “frequency at which the amplitude becomes maximum” is detected as much as possible.

FIG. 10 is a diagram illustrating a concept of a method for determining a time scaling amount. As shown in FIG. 5A, when ω ₁ , ω ₂ , ω ₃ are detected as peak frequencies, based on these, peak periods T ₁ = 1 / ω ₁ , T ₂ = 1 / ω ₂ are detected. , T ₃ = 1 / ω ₃ is obtained.

Subsequently, based on these three peak periods T ₁ , T ₂ , T ₃ , a time scaling amount in an appropriate one operation is determined. Specifically, a time that is an integral multiple of the period T ₁ and has the smallest error from the periods T ₂ and T ₃ is defined as a time scaling amount in one calculation. The time scaling amount and T _CF in one operation determined in this way, when the scaling factor is assumed to be alpha when decompressing time by time scaling, original music time _{T = T CF / | 1-} α | of A time scaling operation is performed once in time. Since t is varied every time the peak frequency is calculated, the time scaling period T is also variable.
In addition, when the pitch shift is performed in combination with the SRC, the expansion / contraction rate α by time scaling is a value α = 1 / β that is inversely proportional to the pitch change rate β of the target pitch shift.

That is, as shown in FIG. 10B, the blocks corresponding to the periods T ₁ , T ₂ , T ₃ are arranged for each period, and the positions corresponding to integer multiples of the period T ₁ are listed together with the start points. When (time) is indicated by L1 to L9, it is in the range (range in which crossfading is possible) from the upper limit position of crossfade (20 ms in the example in the figure) to the lower limit position (50 ms in the example in the figure). of L4 ~ L9, a position corresponding to an integral multiple of the cycle T _2, with less error is the best with the position corresponding to an integral multiple of the period T ₃ is found to be "L6". Since “L6” corresponds to six times the period T ₁ , the appropriate time scaling amount T _CF can be calculated as 6T ₁ = 6 / ω ₁ . However, in practice, the time scaling amount _TCF is calculated using the evaluation function shown below.

Figure 11 is a diagram illustrating an evaluation function for determining a time scaling amount T _CF. As shown in the figure, this evaluation function is weighted by the ratio of the peak amplitude _An (peak frequency amplitude). The CF upper limit position t ₁ and the CF lower limit position t ₂ are values obtained from empirical rules. Thus, the second time scaling unit 13 of the present embodiment obtains the value of the evaluation function becomes minimum i, from the value of the i, obtains the time scaling amount T _CF. That is, the time scaling amount T _CF = iT ₁ = i / ω ₁ can be calculated.

As described above, according to the present embodiment, when performing time scaling using the crossfade method, a prohibited area N (crossfade prohibited area) affected by the percussion instrument sound is detected, and the prohibited area N is detected. Appropriate time scaling can be performed according to the music type determined by the presence or absence.

That is, when the prohibited area N is detected, it can be determined as a rhythm music, and in this case, the time scaling is performed avoiding the prohibited area N. It is possible to eliminate “twisting twice”, “petit noise”, “rhythm disturbance” and the like which are problems. In particular, in this embodiment, since time scaling is performed in the target area O sandwiched between two prohibited areas N, the relative position on the time axis of the prohibited area N is converted so as to maintain the same ratio as before time scaling. . If the pitch is changed by SRC before time scaling and the time expansion and contraction is applied to the pitch shift to return it to the same as the original song by time scaling, the original song data and the data after the pitch shift are prohibited areas. N positions on the time axis can be made the same.

On the other hand, when the forbidden area N is not detected, it can be determined as a melody music piece. In this case, a plurality of peak frequencies with maximum amplitude are detected, and the amplitude is the maximum among the plurality of peak frequencies. Time scaling with a time scaling amount corresponding to the time in which the difference from the integer multiple of each period based on each frequency after the second frequency is the smallest among the time candidates of the integral multiple of the period based on the frequency Further, it is possible to reduce the “beat” of the sound caused by the phase shift based on the frequency having the second and subsequent amplitudes. Furthermore, since the evaluation function for calculating the time scaling amount is weighted in consideration of the amplitude ratio (intensity) of a plurality of peak frequencies, it is possible to more effectively reduce the “beat” of the sound. .

Further, since the band division is performed by DWT when the prohibited area N is detected, the influence of the time delay associated with the filter processing can be reduced as compared with the case where the band division is performed using the filter.

In the above embodiment, the audio signal processing unit 3 includes both the first time scaling unit 12 and the second time scaling unit 13. However, the audio signal processing unit 3 may include only one of them. good. Further, when the audio signal processing unit 3 is configured to include only the second time scaling means 13, the presence / absence of percussion instrument sound may be simply determined instead of detecting the prohibited area N (percussion instrument sound determination). means). In this case, when it is determined that there is no percussion instrument sound, time scaling by the second time scaling means 13 is performed.

Further, in the above embodiment, the prohibited area detection unit 11 detects the prohibited area N while analyzing the audio signal written to the buffer memory 4 along with the reproduction by the reproduction unit 2, but analyzed in advance. Data may be read to detect the prohibited area N. That is, it is good also as a structure which performs time scaling in real time, reproducing | regenerating a music, It is good also as a structure which time-scales the whole music or a part of music using the data analyzed in advance.

Further, it is possible to provide each component of the audio signal processing unit 3 shown above as a program. Further, the program can be provided by being stored in various recording media (CD-ROM, flash memory, etc.). That is, a program for causing a computer to function as each component of the audio signal processing unit 3 and a recording medium on which the program is recorded are also included in the scope of the right of the present invention.

In the above-described embodiment, the case where the audio signal processing unit 3 is applied to the playback device 1 is exemplified. However, the audio signal processing unit 3 may be applied to DJ equipment such as a mixer device, various electronic musical instruments, and a computer (PC application). Moreover, application to a speech processing device having a function of changing the pitch, such as karaoke, a voice changer, and a speech synthesizer is also useful. Furthermore, it is also possible to apply only time scaling, such as when changing only the audio time axis length without changing the pitch during double-speed playback of a video (DVD) recorder. Other modifications can be made as appropriate without departing from the scope of the present invention.

DESCRIPTION OF SYMBOLS 1 ... Playback apparatus 2 ... Playback part 3 ... Audio | voice signal processing part 4 ... Buffer memory 5 ... Audio | voice signal output part 9 ... Audio | voice signal acquisition means 10 ... Time scaling means 11 ... Prohibition area | region detection means 12 ... 1st time scaling means 13 ... Second time scaling means 20 ... pitch shift means 21 ... SRC means CFP ... crossfade position N ... prohibited area O ... target area

Claims

A time scaling method for an audio signal processing device that performs time scaling using a cross-fade method on an acquired audio signal,
The audio signal processing device includes:
A prohibited area detecting step of analyzing the acquired audio signal and detecting a prohibited area affected by a percussion instrument sound; and
In the prohibited area detecting step, when the prohibited area is detected, a first time scaling step of performing time scaling by cross fading in another area other than the prohibited area is executed. Device time scaling method.
The time scaling method for an audio signal processing device according to claim 1, wherein, in the first time scaling step, time scaling is performed by cross-fading in a region sandwiched between the two prohibited regions.
In the forbidden area detecting step, the acquired audio signal is band-divided by wavelet transform to generate a plurality of converted signals Bi (where i = 1,..., N), and among the plurality of converted signals Bi 2. The time scaling method for an audio signal processing device according to claim 1, wherein the prohibited area is detected using one or more conversion signals Bi having a large influence degree of the percussion instrument sound.
The audio signal processing device includes:
In the prohibited area detecting step, when the prohibited area is not detected, a frequency analysis is performed on the acquired audio signal to detect a plurality of frequencies having a maximum amplitude, and the amplitude is the maximum among the plurality of frequencies. The time scaling amount corresponding to the time when the difference from the integer multiple of each cycle based on the frequency after the second frequency is among the time candidates of the integral multiple of the cycle based on the 2. The time scaling method for an audio signal processing device according to claim 1, further comprising a second time scaling step of time scaling.
A time scaling method for an audio signal processing device that performs time scaling using a cross-fade method on an acquired audio signal,
The audio signal processing device includes:
A percussion instrument sound determination step of analyzing the acquired sound signal and determining the presence or absence of percussion instrument sound;
In the percussion instrument sound determination step, when it is determined that the percussion instrument sound does not exist, a plurality of frequencies having the maximum amplitude are detected by performing frequency analysis on the acquired audio signal, and the amplitude of the plurality of frequencies is The time scaling amount corresponding to the time at which the difference from the integral multiple of each period based on each frequency after the second frequency is crossed by the time scaling amount corresponding to the smallest time among the time candidates of the integral multiple of the period based on the maximum frequency And a second time scaling step of time scaling by fading.
6. The time scaling method for an audio signal processing device according to claim 5, wherein, in the second time scaling step, the time scaling amount is calculated by an evaluation function weighted by an amplitude ratio of the plurality of frequencies.
The audio signal processing device
Each step in the time scaling method of the audio signal processing device according to any one of claims 1 to 6,
Performing a sampling rate conversion step of performing sampling rate conversion before or after each step,
In the sampling rate conversion step, the pitch shift method of the audio signal processing apparatus is characterized in that the time length change of the audio signal due to the time scaling and the sampling rate conversion is canceled and only the pitch is changed.
An audio signal acquisition means for acquiring an audio signal;
Time scaling means for performing time scaling using a cross-fade method on the acquired audio signal,
The time scaling means includes
A prohibited area detecting means for analyzing the acquired audio signal and detecting a prohibited area affected by a percussion instrument sound;
An audio signal processing apparatus comprising: a first time scaling unit that performs time scaling by cross-fading in a region other than the prohibited region when the prohibited region is detected by the prohibited region detection unit; .
An audio signal acquisition means for acquiring an audio signal;
Time scaling means for performing time scaling using a cross-fade method on the acquired audio signal,
The time scaling means includes
Percussion instrument sound discriminating means for analyzing the acquired audio signal and discriminating the presence or absence of percussion instrument sound;
When the percussion instrument sound discriminating unit determines that the percussion instrument sound does not exist, the acquired audio signal is subjected to frequency analysis to detect a plurality of frequencies having the maximum amplitude, and the amplitude of the plurality of frequencies is The time scaling amount corresponding to the time at which the difference from the integral multiple of each period based on each frequency after the second frequency is crossed by the time scaling amount corresponding to the smallest time among the time candidates of the integral multiple of the period based on the maximum frequency And a second time scaling means for time scaling by fading.
A sampling rate converting means for converting the sampling rate;
The audio signal processing apparatus according to claim 8 or 9, wherein the sampling rate converting means cancels a time length change of the audio signal due to the time scaling and the sampling rate conversion, and changes only the pitch.
A program for causing a computer to execute each step in the time scaling method of the audio signal processing device according to any one of claims 1 to 6.
A program for causing a computer to execute each step in the pitch shift method of the audio signal processing device according to claim 7.