WO2024034118A1

WO2024034118A1 - Audio signal processing device, audio signal processing method, and program

Info

Publication number: WO2024034118A1
Application number: PCT/JP2022/030733
Authority: WO
Inventors: 肇吉野
Original assignee: ＡｌｐｈａＴｈｅｔａ株式会社
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2024-02-15

Abstract

The present invention provides an audio signal processing device including an audio analysis unit that generates a representative waveform of a second part in a musical piece including a first part and the second part that can be separated in terms of audio. The audio analysis unit generates the representative waveform by synchronously adding waveforms in a sound production section of the second part in an audio signal of the musical piece.

Description

Audio signal processing device, audio signal processing method and program

The present invention relates to an audio signal processing device, an audio signal processing method, and a program.

Techniques for extracting arbitrary music sounds from music are known. For example, Patent Document 1 discloses a sounding position information acquisition unit that acquires sounding position information indicating the sounding position of an arbitrary instrument included in a song, and a sounding position information acquisition unit for searching the sounding section of an arbitrary instrument sound based on the sounding position information. A search section specifying section for specifying a search section, an extraction section for extracting an amplitude value at a predetermined position in the search section, and a processing section for processing audio data included in the search section based on the amplitude value extracted by the extraction section. An audio signal processing device including the following is described.

Patent No. 6263383

In the technology described in Patent Document 1 mentioned above, a spectrogram of an input music piece is obtained, musical instrument sounds are distinguished on the spectrogram data, and separation is performed on the frequency axis. However, in spectrograms, time resolution and frequency resolution conflict with each other due to the principle of DFT (Digital Fourier Transform), and high time resolution and high frequency resolution cannot be compatible. Furthermore, the spectrogram only has information on power for each time and frequency, and phase information cannot be used. Due to these fundamental constraints, for example, musical instrument sounds that are playing at the same frequency cannot be distinguished from each other, and if you try to remove a specific instrumental sound from a piece of music, you may end up removing sounds that should not be removed. This may result in deterioration of sound quality.

Note that Patent Document 1 also describes synchronous addition processing, but this is to average the synchronous subtraction results of spectrograms to reduce errors in the spectrogram shape, and is not to synchronously add the waveforms of audio data. . In the example of Patent Document 1, phase information is not used for audio data, so synchronous addition does not improve the S/N ratio (S: instrument sound to be extracted, N: other instrument sounds). do not have.

For example, in functions and products for DJs, latency is strictly controlled, and it is required to minimize deviations in the time direction. Therefore, the time resolution that affects the final sound generation position is set high. In this case, it is not possible to increase the frequency resolution due to the above-mentioned fundamental limitations of the spectrogram. With this technology, which relies almost entirely on information on frequency differences as information for distinguishing between musical instruments, in situations where it is not possible to increase frequency resolution, there is a high possibility that effects such as deterioration of sound quality will occur as a result. Become.

Therefore, an object of the present invention is to provide an audio signal processing device, an audio signal processing method, and a program that can achieve high sound quality and high separation performance by separating musical instrument sounds on the time axis. .

[1] A voice analysis unit that generates a representative waveform of the second part in a song that includes a first part and a second part that are phonetically separable; and a position estimation unit that performs positioning using a cross-correlation function. The audio analysis unit generates the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation unit. Processing equipment.
[2] The position estimating unit is configured to detect a band representative waveform of the second part extracted from a band audio signal obtained by extracting a predetermined frequency band from the audio signal of the song, and a length corresponding to the band representative waveform. The above-mentioned cross-correlation function is calculated using the waveform of the above-mentioned band audio signal in each section as a function of time, and the section where the peak of the above-mentioned cross-correlation function appears is detected as the sound generation position of the above-mentioned second part, and the above-mentioned speech analysis section , the audio signal processing device according to [1], wherein the representative waveform is generated by synchronously adding waveforms of audio signals of the music based on the sound generation position.
[3] The position estimating unit performs a first sounding position detection process of detecting a first sounding position using a first band audio signal obtained by extracting a first frequency band from the audio signal of the song; Based on the first sounding position, a second sounding position detection process is performed to detect a second sounding position using a second band audio signal obtained by extracting a second frequency band from the audio signal of the song. The audio signal processing device according to [2], wherein the audio analysis unit generates the representative waveform by synchronously adding waveforms of audio signals of the music piece with reference to the second sounding position.
[4] The second part is composed of a kick sound, the first frequency band is a body rumble band of the kick sound, and the second frequency band is an attack band of the kick sound. The audio signal processing device according to [3].
[5] The position estimating unit calculates a temporary representative waveform extracted from the audio signal of the song according to a predetermined rule, and a waveform of the audio signal of the song in a section with a length corresponding to the temporary representative waveform. Each of the cross-correlation functions is calculated as a function of time, and the audio analysis section generates the representative waveform by synchronously adding the waveforms of the audio signals of the music for the section where the peak of the cross-correlation function appears. 1].
[6] The position estimation unit and the audio analysis unit perform positioning using the cross-correlation function and generate the representative waveform for each classification based on the calculation result of the cross-correlation function between the waveforms of the second part. The audio signal processing device according to any one of [1] to [5].
[7] The audio analysis unit generates the representative waveform by weighting and synchronously adding waveforms of the sounding section of the second part in the audio signal of the music according to the position in the music. , the audio signal processing device according to any one of [1] to [6].
[8] The audio signal processing device according to [7], wherein the audio analysis unit weights and synchronously adds the waveform of the sounding section of the second part in the audio signal of the music according to the number of beats.
[9] The audio analysis unit weights and synchronously adds the waveform of the sounding section of the second part in the audio signal of the song according to the classification of upbeat or backbeat, [7] or [8] ] The audio signal processing device according to.
[10] The audio signal according to any one of [1] to [9], wherein the audio analysis section generates the representative waveform by applying a different fade-out curve for each band to the synchronously added waveform. Processing equipment.
[11] A voice analysis step of generating a representative waveform of the second part in a song including a first part and a second part that are phonetically separable; and a position estimation step of positioning using a cross-correlation function. The audio analysis step includes the step of generating the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation unit. , Audio signal processing method.
[12] A voice analysis unit that generates a representative waveform of the second part in a song including a first part and a second part that are phonetically separable; and a position estimation unit that performs positioning using a cross-correlation function. The audio analysis unit generates the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation unit. A program that allows a computer to function as a processing device.

1 is a diagram showing the overall configuration of a system according to an embodiment of the present invention. 2 is a block diagram showing a schematic functional configuration of the audio signal processing device in the example of FIG. 1. FIG. 3 is a flowchart showing the overall flow of processing by the speech analysis section shown in FIG. 2. FIG. FIG. 3 is a diagram schematically showing a waveform configuration of a kick sound. 4 is a flowchart showing the sound generation position detection process in milliseconds shown in FIG. 3; 4 is a flowchart showing the sound generation position detection process in units of 10 microseconds shown in FIG. 3. FIG. 7 is a flowchart showing a process for generating the band representative waveform shown in FIGS. 5 and 6. FIG. 8 is a diagram for conceptually explaining the cross-correlation function calculation process shown in FIG. 7. FIG. 8 is a diagram for conceptually explaining the weighted synchronous addition process shown in FIG. 7. FIG. 7 is a flowchart showing the sound generation position detection process shown in FIGS. 5 and 6. FIG. 11 is a diagram for conceptually explaining the sound generation position detection process shown in FIG. 10. FIG. 4 is a flowchart showing the Kick representative waveform generation process shown in FIG. 3. FIG. FIG. 3 is a diagram for conceptually explaining an example in which different fade-out curves are applied for each band. 4 is a flowchart showing the kick sound removed sound generation process shown in FIG. 3. FIG. 15 is a diagram for conceptually explaining the process of adding the opposite phase signals shown in FIG. 14. FIG.

FIG. 1 is a diagram showing the overall configuration of a system according to an embodiment of the present invention. The system 10 according to this embodiment includes a PC (Personal Computer) 100, a DJ controller 200, and speakers 300. The PC 100 is a device that stores, processes, and reproduces audio data, and is not limited to a PC, but may be a terminal device such as a tablet or a smartphone. The PC 100 includes a display 101 that displays information to the user, and an input device such as a touch panel or a mouse that obtains operation input from the user. The DJ controller 200 is connected to the PC 100 via a communication means such as a USB (Universal Serial Bus), and receives user operation input regarding music playback using a channel fader, cross fader, performance pad, jog dial, various knobs and buttons, etc. get. The audio data is reproduced using the speaker 300, for example.

In this embodiment, the PC 100 functions as an audio signal processing device in the system 10 as described above. For example, the PC 100 executes processing corresponding to a user's operational input on the stored audio data when the audio data is reproduced. Alternatively, the PC 100 may perform processing on the audio data before playback and save the processed audio data. In this case, the DJ controller 200 and speakers 300 may not be connected to the PC 100 at the time the process is executed. In this embodiment, the PC 100 functions as the audio signal processing device, but in other embodiments, DJ equipment such as a mixer or an all-in-one DJ system (digital audio player with communication and mixing functions) may function as the audio signal processing device. . Further, a server connected to a PC or DJ equipment via a network may function as the audio signal processing device.

FIG. 2 is a block diagram showing a schematic functional configuration of the audio signal processing device in the example of FIG. 1. The PC 100 functioning as an audio signal processing device includes an audio analysis section 120, a display section 140, a mix processing section 150, and an operation section 160. These functions are implemented by a processor such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor) operating according to a program. The program is read from the storage of the PC 100 or a removable recording medium, or downloaded from a server via a network, and expanded into the memory of the PC 100.

Musical piece audio data 110 including a first part and a second part that are phonetically separable is input to the audio analysis unit 120. In this embodiment, the first part is a vocal and/or instrumental sound part other than the kick sound, and the second part is a kick sound part. Here, the kick sound is a bass drum sound or a synthesized sound that imitates a bass drum sound. The audio analysis unit 120 extracts kick sound removed audio data 131, kick unit sound data 132, and kick pronunciation data 133 from the music audio data 110 using, for example, a music separation engine. Here, the kick sound removed audio data 131 is audio data obtained by removing the kick sound from the music audio data 110, that is, the audio data of the first part. The Kick unit sound data 132 is data of the Kick sound included in the music audio data 110, that is, the unit sound of the second part (hereinafter also referred to as Kick unit sound). The kick pronunciation data 133 is data indicating the pronunciation position of the kick sound in the music audio data 110. The sound generation position is the temporal position at which the kick sound is sounded in the music audio data 110, and is recorded, for example, as a time code within the music or as a count in units of bars/beats.

A unit sound is a sound extracted using one pronunciation of the sound of the second part as a unit. In the following explanation, the waveform of the unit tone is also referred to as the representative waveform of the second part. For example, the audio analysis unit 120 separates the kick sound part from the music audio data 110, further divides the kick sound part into pronunciations, and extracts unit sounds by classifying the pronunciations based on the characteristics of the audio waveform. A plurality of unit sounds having different audio waveform characteristics may be extracted. The Kick unit sound data 132 may be, for example, audio data sampled from the Kick sound part, temporal position information where the unit sound is played in the Kick sound part, or extracted It may be audio data of a sample sound similar to the sound, or an identifier of the sample sound.

The display unit 140 displays information based on the Kick unit sound data 132 or the Kick pronunciation data 133 on the display 101 of the PC 100, for example. On the other hand, the operation unit 160 obtains a user's operation input to an input device such as a touch panel or a mouse of the PC 100. Specifically, for example, the display unit 140 displays the audio waveform of the song (the waveform may be based on the song audio data 110 or the waveform may be based on the kick sound removed audio data 131) and the kick sound associated with the waveform. The operation unit 160 obtains an operation by the user to change the sound generation position of the kick sound to an arbitrary position within the song. Alternatively, the display unit 140 may display the arrangement of kick sounds according to a preset rhythm pattern, and the operation unit 160 may obtain an operation by the user to select a rhythm pattern. Note that, for example, when changing the arrangement of the kick sound according to a preset rhythm pattern, the position of the kick sound may be determined automatically without the user's operation. In this case, the display section 140 and the operation section 160 described above may not be included in the functions of the audio signal processing device.

The mix processing unit 150 generates mixed audio data 170 based on the kick sound removed audio data 131 and the kick unit sound data 132. The mixed audio data 170 is audio data in which the kick sound removed audio data 131 is mixed with the rearranged kick unit sound. The sound generation position of the Kick unit sound in the mixed audio data 170 is determined according to the user operation acquired by the operation unit 160 as described above, or according to the automatically determined rhythm pattern. Here, the pronunciation position of the Kick unit sound in the mixed audio data 170 may include a different position from the pronunciation position of the Kick sound in the original music audio data 110.

FIG. 3 is a flowchart showing the overall flow of processing by the speech analysis section shown in FIG. 2. As shown in the figure, the audio analysis unit 120 first detects kick sounds in units of sixteenth notes (step S110), and classifies the detected kick sounds (step S120). The detection process in step S110 is executed using, for example, a technique described in International Publication No. 2017/168644. The classification process in step S120 is executed, for example, by calculating a correlation function between the detected kick sound waveforms and clustering them. The processing in steps S110 and S120 is a process for specifying the rough position of the kick sound (existence or absence) in units of 16 notes and the classification of the kick sound, so that the kick sound can be pronounced with higher accuracy. This is preparation for identifying the position and representative waveform.

Below, loop processing is executed for each kick sound classification identified in step S120 (step S130). Specifically, the sound generation position detection process in millisecond units (step S140), the sound generation position detection process in 10 microsecond units (step S150), the kick representative waveform generation process (step S160), and the kick sound removal voice generation process ( Step S170) is executed for each identified kick sound classification.

Before explaining each process, the waveform configuration of the kick sound handled in this embodiment will be explained. FIG. 4 is a diagram schematically showing the waveform configuration of the kick sound. As shown in the figure, the waveform of the Kick sound includes an attack portion (ATTACK) and a body rumble portion (SUSTAIN). Since the frequency band and duration are different between the attack part and the body rumble part, by distinguishing between them, it is possible to detect the kick sound generation position with higher accuracy and to generate the kick representative waveform and the kick sound removed sound. . Note that such a waveform configuration is seen not only in kick sounds but also in other musical instrument sounds, such as percussion instrument sounds such as drum sounds including hi-hats and snares.

FIG. 5 is a flowchart showing the sound generation position detection process in milliseconds shown in FIG. In the pronunciation position detection process in milliseconds, first, the body rumble band (first frequency band) of the kick sound is extracted from the music audio signal by processing the music audio signal with a 200Hz low-pass filter (step S141). By performing band representative waveform generation processing (step S142) and sound generation position detection processing (step S143) for the sound signal in the body rumble band (first band sound signal), the sound generation position of the kick sound is determined in milliseconds. It can be detected in units.

FIG. 6 is a flowchart showing the sound generation position detection process in units of 10 microseconds shown in FIG. 3. In the sound generation position detection process in units of 10 microseconds, first, the audio signal of the song is processed with a 3kHz high-pass filter to extract the audio signal in the attack band (second frequency band) of the kick sound from the audio signal of the song. (Step S151). By performing band representative waveform generation processing (step S152) and sound generation position detection processing (step S153) for the attack band audio signal (second band audio signal), the sound generation position of the kick sound is determined by 10 microseconds. It can be detected in units.

FIG. 7 is a flowchart showing the band representative waveform generation process (steps S142 and S152) shown in FIGS. 5 and 6. In the band representative waveform generation process, a temporary band representative waveform is extracted from the band audio signal (audio signal in the body rumble band or attack band of the kick sound) in each process according to a predetermined rule (step S210). For example, if one measure of a song consists of 4 beats, the kick sound on the 2nd and 4th beats is likely to be played at the same time as the snare sound, so it is excluded, and the kick sound on the 1st beat is the same as the cymbal sound. It is excluded because there is a high possibility that it will sound at the same time. In that case, the waveform of the kick sound at the third beat is extracted as a temporary band representative waveform. If there are multiple kick sounds on the 3rd beat, you may create a histogram of the kick sound levels for each bar and select the kick sound on the 3rd beat that is included in the higher frequency class. If there is no kick sound on the third beat, a temporary band representative waveform may be extracted from the kick sound on the first beat, since the cymbal sound has relatively less frequency band overlap with the kick sound.

Note that the generation process of the band representative waveform is executed in each of the sound generation position detection process in millisecond units (step S142) and the sound generation position detection process in 10 microsecond units (step S152), but each process is performed in a different band. Since the processing is performed on the audio signal, even if the sound generation position of the tentative band representative waveform is determined in step S210 according to a common rule, the tentative band representative waveform in each process is different.

Next, using the temporary band representative waveform determined in step S210, loop processing is performed for each kick sound to be processed (step S220). Specifically, a cross-correlation function with a temporary band representative waveform is calculated for each kick sound within a predetermined search range (step S230), and the waveform of the band audio signal is weighted for the section where the peak of the cross-correlation function appears. By performing synchronous addition (step S240), a band representative waveform of the kick sound is generated.

FIG. 8 is a diagram for conceptually explaining the cross-correlation function calculation process shown in FIG. 7. In the process of step S230 shown in FIG. 7, a section _S1 having a length corresponding to a temporary band representative waveform _Stemp is extracted from the band audio signal _S0 of the song, and each waveform is converted into a function _ftemp of time t. (t) and f ₁ (t), a cross-correlation function φ ₁ (τ)=f _temp (t)*f ₁ (t+τ) is calculated in the temporal relationship once set. Here, τ is the amount of deviation caused by the provisional placement. The amount of deviation can be estimated from the position of the peak of the obtained cross-correlation function φ ₁ (τ). In other words, it is possible to specify the sound generation section of the kick sound in the band audio signal.

FIG. 9 is a diagram for conceptually explaining the weighted synchronous addition process shown in FIG. 7. As explained above with reference to FIG. 8, the process of specifying the sound generation section of the kick sound in the band audio signal is executed for all the kick sounds to be processed included in the band audio signal _S0 . In the process of step S240 shown in FIG. 7, a band representative waveform _Sp is generated by synchronously adding the waveforms of the band audio signal _S0 for each kick sound generation section. In this embodiment, the waveforms of the sounding sections of each kick sound are weighted with weighting coefficients W ₁ , W ₂ , W ₃ , . . . set according to predetermined rules, and synchronously added.

When waveforms with the same characteristics appear repeatedly in a signal, synchronous addition adds and averages the waveforms at the same time to reduce signals that are uncorrelated with the signal through phase cancellation. This is a method to obtain a waveform close to that of the signal. However, it is difficult to obtain the desired effect unless the waveform times are precisely aligned. In the synchronous addition process in step S240 shown in FIG. 7, the generation period of each kick sound is specified as the period in which the peak of the cross-correlation function with the temporary band representative waveform appears in the previous step S230. , a band representative waveform _Sp with reduced noise can be obtained.

On the other hand, the weighting coefficients W ₁ , W ₂ , W ₃ , . . . in the synchronous addition are set depending on, for example, the position of the kick sound generation section in the song. For example, similar to the generation of the temporary band representative waveform above, for kick sounds that are likely to be played at the same time as other drum sounds, the weight is lowered, and for kick sounds that are less likely to be played at the same time as other drum sounds, the weight is lowered. A weighting coefficient may be set to increase the weight of sound. For example, if one measure of a song consists of 4 beats, the kick sound is likely to be played at the same time as the snare sound on the 2nd and 4th beats, so the weight is the lowest, and the kick sound is the lowest on the 1st beat. Since there is a high possibility that it will sound at the same time as the cymbal sound, the next step is to reduce the weight. In this case, the waveforms of the sounding sections of each kick sound are weighted according to the number of beats and added synchronously. Also, even on the 1st, 2nd, and 4th beats, common-time beats are more likely to sound at the same time as other drum sounds, while half-time beats are less likely to be played at the same time as other drum sounds. , the kick sound located on the backbeat may be given a larger weight regardless of the number of beats. The weight ratio of the kick sound set from this point of view is, for example, 0.8/0.5/1.0/0 for the 1st beat/2nd beat/3rd beat/4th beat for the top beat. .5, and the backbeat may be 1.0 regardless of the number of beats. In this case, the waveforms of the sounding sections of each Kick sound are weighted and synchronously added according to the classification of upbeat or backbeat.

FIG. 10 is a flowchart showing the sound generation position detection process (steps S143 and S153) shown in FIGS. 5 and 6. In the sound generation position detection process, following the band representative waveform generation process described above with reference to FIGS. 7 to 9, a loop process is executed for each kick sound to be processed (step S310). Specifically, a cross-correlation function between the band representative waveform and the waveform of the band audio signal is calculated within a predetermined search range (step S320), and the section where the peak of the cross-correlation function appears is set as the generation position of each kick sound. Detected (step S330).

FIG. 11 is a diagram for conceptually explaining the sound generation position detection process shown in FIG. 10. In step S320, a section S ₂ of a length corresponding to the band representative waveform S _p is extracted from the band audio signal S ₀ of the music, and the respective waveforms are expressed as functions f _p (t) and f ₂ (t) of time t. , the cross-correlation function φ ₂ (τ)=f _p (t)*f ₂ (t+τ) is calculated in the temporarily established temporal relationship. Here, τ is the amount of deviation caused by the provisional placement. The amount of deviation can be estimated from the peak position of the obtained cross-correlation function φ ₂ (τ). That is, the sounding positions P ₁ , P ₂ , P ₃ , . . . of the kick sound can be specified.

In this embodiment, even in the band representative waveform generation process shown in FIGS. 7 and 8, the kick sound generation section is specified based on the cross-correlation function φ ₁ (τ) with the temporary band representative waveform S _temp . In the sound generation position detection processing shown in FIGS. 10 and 11, the sound generation position of the kick sound is detected based on the cross-correlation function φ ₂ (τ) with the band representative waveform _Sp . These processes may seem redundant at first glance, but they are not. Since the temporary band representative waveform S _temp is simply extracted from the band audio signal S ₀ of the song on a rule basis, it is a waveform that includes noise such as instrument sounds other than the kick sound. On the other hand, the band representative waveform _Sp is a waveform in which noise such as instrumental sounds other than the kick sound is reduced by synchronously adding the waveforms of the sound generation section of each kick sound, and the characteristic of the original kick sound waveform is reduced. represents more accurately. In other words, in this embodiment, the cross-correlation function φ ₁ (τ) with the tentative band representative waveform S _temp extracted on a rule basis is used to specify a somewhat accurate sound generation section of the kick sound, and the waveform of this sound generation section is Using the cross-correlation function φ ₂ (τ) with the band representative waveform S _p obtained by synchronous addition, a more accurate kick sound generation position is specified.

Here, when the above-described band representative waveform generation process and sound generation position detection process are executed in the sound generation position detection process in milliseconds (steps S142 and S143 shown in FIG. 5), the previous process (see FIG. Since the sounding position of the kick sound in 16th note units has been detected in step S110) shown in 3, for example, the cross-correlation is performed as the search range (first search range) within the range of 16th notes (approximately 100 milliseconds). Calculate the function. As explained with reference to Figure 4, the waveform of the body rumble band of the kick sound does not show a steep peak, but it has a certain duration, so the accuracy is lower than that of the attack band, which will be explained next. , the calculation load can be reduced even when the search range is wide.

On the other hand, when the band representative waveform generation process and the sound generation position detection process are executed in the sound generation position detection process in units of 10 microseconds (steps S152 and S153 shown in FIG. 6), the previous process (steps S152 and S153 shown in FIG. 3) Since the generation position of the kick sound in milliseconds is detected in step S140), for example, the search range (second search range) is set to several milliseconds before and after the generation position of the kick sound in milliseconds, and the cross-correlation function is used. Calculate. As explained with reference to FIG. 4, the waveform of the attack band of the kick sound shows a steep peak, but the duration is short, so while accuracy is high, a wide search range increases the calculation load. Therefore, if the sounding position is detected from the beginning using a waveform that includes the attack band, the calculation load may increase, but in this embodiment, the sounding position detection processing is performed in milliseconds first to determine the search range. By narrowing down the time to a few milliseconds, it is possible to detect the sounding position with high precision while suppressing the calculation load.

FIG. 12 is a flowchart showing the Kick representative waveform generation process shown in FIG. 3. In the Kick representative waveform generation process, the Kick sound in the audio signal of the song is generated based on the sound generation position in 10 microsecond units detected in the 10 microsecond unit sound generation position detection process (step S150 shown in FIG. 3). Weighted synchronous addition is performed on the waveforms of the sections (step S161). In the weighted synchronous addition, for example, similar to the process in step S240 of the band representative waveform generation process shown in FIG. This is a synchronous addition process. As a difference from step S240, in step S161 of the Kick representative waveform generation process shown in FIG. The signal waveforms are synchronously added. By performing synchronous addition based on accurate sound generation positions in units of 10 microseconds, it is possible to obtain a Kick representative waveform in which noise such as musical instrument sounds other than the Kick sound is reduced to the maximum.

In this embodiment, generation of a band representative waveform in the sound generation position detection process in millisecond units (step S142 shown in FIG. 5), and generation of a band representative waveform in the sound generation position detection process in 10 microsecond units (FIG. 6) are performed. Although weighted synchronous addition is executed in step S152) shown in FIG. 12) and Kick representative waveform generation processing (step S161 shown in FIG. The body rumble band of the kick sound, the attack band of the kick sound in step S152, and the kick sound including both the body rumble band and the attack band in step S161), the weighting coefficients used in each weighted synchronous addition process are the same. There may be one or different.

Referring again to FIG. 12, next, a different fade-out curve is applied for each band to the waveform obtained by the weighted synchronous addition in step S161 (step S162). Even if you add the waveforms synchronously, the waveform will not be completely made up of just the Kick sound, and other parts such as the snare and vocals will remain, so you can further remove the remaining snares and vocals, for example, in Figure 4. Apply a filter that extracts the kick sound waveform as shown in . As can be seen from Figure 4, the frequency of the kick sound is high in the attack part (ATTACK) at the beginning of the sound, and low in the body rumble part (SUSTAIN) that is most of the period thereafter. It is configured. Snares and vocals are in the mid-range, so in order to extract most of the low range of the kick sound, you can extract it with an LPF with a cutoff frequency at the boundary (about 200 Hz), but the waveform of the kick sound If the entire signal is filtered, as mentioned above, the attack part has a high frequency range, so this part will be lost. In order to avoid this, only the attack part should be configured with a filter that passes high frequencies (such as a high-pass filter above 640Hz), and the body rumble section should be configured with a filter that only passes low frequencies (such as a low-pass filter below 200Hz). do. In the area between them, a band pass filter of 200 Hz to 640 Hz is used. Finally, combine the outputs of each filter to extract the desired kick waveform.

More specifically, for example, as shown in FIG. 13, the kick sound waveform obtained by weighted synchronous addition is divided by three band filters, and a different fade-out curve is applied to each band. Specifically, for example, a short (for example, 20 milliseconds long) fade-out curve for the attack portion is applied to a signal that has passed a high-pass filter of 640 Hz or higher, and a fade-out curve that is applied to a signal that has passed a band-pass filter of 200 Hz to 640 Hz is applies a medium-length (e.g., 60 ms long) fade-out curve for the beginning of the rumble, and a long (e.g., 100 ms long) fade-out curve for the entire rumble section to the low-pass filtered signal below 200 Hz. Apply a fade-out curve (with a length of seconds to 500 milliseconds). Note that the length of the fade-out curve for the entire body ringing portion may be adjusted according to the amplitude envelope of the waveform after synchronous addition.

Through the Kick representative waveform generation process as described above, it is possible to generate a Kick representative waveform that minimizes noise such as other musical instrument sounds and extracts waveform features common to Kick sounds. In this embodiment, since the Kick sound and other sounds are separated on the time axis using synchronous addition of waveforms, problems such as deterioration of sound quality due to separation on the frequency axis do not occur, and high-quality Kick representative Waveforms can be generated.

FIG. 14 is a flowchart showing the kick sound removed sound generation process shown in FIG. 3. In the kick sound removal voice generation process, the sound generation position shown in FIGS. 12 and 13 is based on the sound generation position detected in 10 microsecond units in the 10 microsecond unit sound generation position detection process (step S150 shown in FIG. 3). The Kick representative waveform generated in the process is rearranged (step S171). As a result, an audio signal is obtained in which only the kick sound included in the song is extracted. By adding the reverse phase signal of this audio signal (Kick sound rearranged audio signal) to the original music audio signal (step S172), the audio signal of kick sound removed audio where only the kick sound is removed from the audio signal of the music is obtained. can be generated.

FIG. 15 is a diagram for conceptually explaining the process of adding the anti-phase signals shown in FIG. 14. In the process of step S172 shown in FIG. 14, the reverse phase signal S _{Kick_Rev} of the rearranged kick sound S _Kick is added to the audio signal S of the music. If the rearranged Kick sound S _Kick corresponds to the Kick sound component included in the audio signal S of the song, the Kick sound is canceled by adding the reverse phase signal S _{Kick_Rev} , and the Kick sound other than the Kick sound included in the song is You can get an audio signal that is just sound. At this time, in order to accurately remove only the kick sound component, the sound generation position and waveform of the kick sound must be accurately specified. In this embodiment, as described above, a highly accurate sounding position is detected by the sounding position detection process in units of 10 microseconds (step S150 shown in FIG. 3), and synchronous addition using this highly accurate sounding position Since the accurate Kick representative waveform is specified through the process (step S160 shown in FIG. 3), only the Kick sound component is accurately removed to produce a high-quality Kick sound with no sound quality deterioration in sounds other than the Kick sound. A removed voice can be generated.

In the embodiment of the present invention described above, the pronunciation position of the kick sound included in the audio signal of the music is specified as the peak position of the cross-correlation function with the band representative waveform. In this way, by accurately specifying the sound generation position of the kick sound, it becomes possible to separate the kick sound from other sounds on the time axis, making it possible to achieve high sound quality and high resolution performance.

Note that the embodiment of the present invention described above is merely an example, and various changes are possible. For example, in the above embodiment, the audio analysis unit 120 executes the process of extracting the band representative waveform from the band audio signal, but this process is an example of the process of extracting the representative waveform of the second part from the audio signal of the song. In other embodiments, the representative waveform of the second part is extracted from the audio signal of the song without extracting a specific frequency band, and the representative waveform of the second part is extracted based on the cross-correlation function between the representative waveform and the section of the audio signal of the song. The sounding position of the second part may also be detected. Similarly, the process in which the audio analysis unit 120 generates a band representative waveform by synchronously adding waveforms of band audio signals is the same as the process in which the audio analysis unit 120 generates a representative waveform for the second part by synchronously adding the audio signals of the music piece. This is an example.

Further, for example, in the above embodiment, the first part of the song is a part other than the kick sound, and the second part is the kick sound part, but the first part and the second part are There are no limitations on how the vocals and/or instrumental sounds are separated and assigned to the parts. The second part may be any part from which a representative waveform can be extracted; for example, it may be a hi-hat or snare part, or a percussion instrument sound part such as a drum sound with a hi-hat or snare added to a kick sound. As mentioned above, it is possible to extract multiple unit sounds with different audio waveform characteristics, so the second part is a drum sound part, and the kick unit sound, hi-hat and snare unit sounds are respectively May be relocated.

Further, for example, in the above embodiment, the detection result of the kick sound production position by the audio analysis unit 120 and the extraction result of the kick unit sound are used to change the kick sound production position included in the original music audio data 110. However, in other embodiments, the mixed audio data 170 may not necessarily be generated. For example, the kick unit sound data 132 may be extracted alone and used as a sample sound source for performance. Alternatively, only the Kick sound removed audio data 131 may be output without the Kick unit sound data, or the Kick sound of a song may be changed to a different song or sample sound source based on the Kick sound removed audio data 131 and the Kick pronunciation data 133. It may be replaced with a kick sound. The same applies to the case where the second part mentioned above is a sound other than the kick sound.

10...System, 100...PC, 101...Display, 110...Music audio data, 120...Speech analysis unit, 131...Kick sound removed audio data, 132...Kick unit sound data, 133...Kick pronunciation data, 140...Display unit, 150...Mix processing unit, 160...Operation unit, 170...Mix audio data, 200...DJ controller, 300...Speaker.

Claims

a voice analysis unit that generates a representative waveform of the second part in a song that includes a first part and a second part that are phonetically separable;
and a position estimation unit that performs positioning using a cross-correlation function,
The audio analysis unit generates the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation unit.
The position estimating unit includes a band representative waveform of the second part extracted from a band audio signal obtained by extracting a predetermined frequency band from the audio signal of the song, and the band representative waveform of the second part in an interval having a length corresponding to the band representative waveform. calculating the cross-correlation function using the waveform of the band audio signal as a function of time, and detecting the section where the peak of the cross-correlation function appears as the sound generation position of the second part;
The audio signal processing device according to claim 1, wherein the audio analysis section generates the representative waveform by synchronously adding waveforms of audio signals of the music piece with reference to the sound generation position.
The position estimating unit is
a first sounding position detection process of detecting a first sounding position using a first band audio signal obtained by extracting a first frequency band from the audio signal of the song;
a second sound production position detection process that detects a second sound production position using a second band audio signal obtained by extracting a second frequency band from the audio signal of the music based on the first sound production position; execute,
The audio signal processing device according to claim 2, wherein the audio analysis section generates the representative waveform by synchronously adding waveforms of audio signals of the music with the second sound generation position as a reference.
The second part is composed of a kick sound,
The first frequency band is a body rumble band of the kick sound,
The audio signal processing device according to claim 3, wherein the second frequency band is an attack band of the kick sound.
The position estimating unit is configured to calculate a temporary representative waveform extracted from the audio signal of the song according to a predetermined rule and a waveform of the audio signal of the song in a section having a length corresponding to the temporary representative waveform, respectively, over time. Calculating the cross-correlation function as a function,
The audio signal processing device according to claim 1, wherein the audio analysis unit generates the representative waveform by synchronously adding waveforms of the audio signals of the music for a section where a peak of the cross-correlation function appears.
The position estimation unit and the voice analysis unit execute positioning using the cross-correlation function and generation of the representative waveform for each classification based on the calculation result of the cross-correlation function between the waveforms of the second part. The audio signal processing device according to any one of claims 1 to 5.
The audio analysis unit generates the representative waveform by weighting and synchronously adding waveforms of the sounding section of the second part in the audio signal of the music according to the position within the music. The audio signal processing device according to any one of claims 1 to 6.
The audio signal processing device according to claim 7, wherein the audio analysis unit weights and synchronously adds the waveform of the sounding section of the second part in the audio signal of the song according to the number of beats.
9. The audio analysis section weights and synchronously adds the waveform of the sounding section of the second part in the audio signal of the music according to the classification of upbeat or backbeat. audio signal processing device.
The audio signal processing device according to any one of claims 1 to 9, wherein the audio analysis section generates the representative waveform by applying a different fade-out curve for each band to the synchronously added waveform.
a voice analysis step of generating a representative waveform of the second part in a song including a first part and a second part that are phonetically separable;
a position estimation step of positioning using a cross-correlation function;
The audio analysis step includes the step of generating the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation step. Processing method.
a voice analysis unit that generates a representative waveform of the second part in a song that includes a first part and a second part that are phonetically separable;
and a position estimation unit that performs positioning using a cross-correlation function,
The audio analysis unit is an audio signal processing device that generates the representative waveform by synchronously adding waveforms of the pronunciation section of the second part in the audio signal of the song at the position estimated by the position estimation unit. A program that allows a computer to function.