WO2024034115A1 - Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme - Google Patents

Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme Download PDF

Info

Publication number
WO2024034115A1
WO2024034115A1 PCT/JP2022/030730 JP2022030730W WO2024034115A1 WO 2024034115 A1 WO2024034115 A1 WO 2024034115A1 JP 2022030730 W JP2022030730 W JP 2022030730W WO 2024034115 A1 WO2024034115 A1 WO 2024034115A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
sound
kick
waveform
band
Prior art date
Application number
PCT/JP2022/030730
Other languages
English (en)
Japanese (ja)
Inventor
肇 吉野
Original Assignee
AlphaTheta株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AlphaTheta株式会社 filed Critical AlphaTheta株式会社
Priority to PCT/JP2022/030730 priority Critical patent/WO2024034115A1/fr
Publication of WO2024034115A1 publication Critical patent/WO2024034115A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G1/00Means for the representation of music
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to an audio signal processing device, an audio signal processing method, and a program.
  • Patent Document 1 discloses a sounding position information acquisition unit that acquires sounding position information indicating the sounding position of an arbitrary instrument included in a song, and a sounding position information acquisition unit for searching the sounding section of an arbitrary instrument sound based on the sounding position information.
  • a search section specifying section that specifies a search section; an extraction section that extracts an amplitude value at a predetermined position in the search section; and a processing section that processes audio data included in the search section based on the amplitude value extracted by the extraction section.
  • An audio signal processing device including the following is described.
  • Patent Document 1 also describes synchronous addition processing, but this is to average the synchronous subtraction results of spectrograms to reduce errors in the spectrogram shape, and is not to synchronously add the waveforms of audio data. .
  • phase information is not used for audio data, so synchronous addition does not improve the S/N ratio (S: instrument sound to be extracted, N: other instrument sounds). do not have.
  • an object of the present invention is to provide an audio signal processing device, an audio signal processing method, and a program that can achieve high sound quality and high separation performance by separating musical instrument sounds on the time axis. .
  • a voice analysis unit that detects the pronunciation position of the second part in a song including a first part and a second part that are phonetically separable, The audio analysis unit calculates a representative waveform of the second part extracted from the audio signal of the song and a waveform of the audio signal of the song in an interval corresponding to the representative waveform as a function of time.
  • An audio signal processing device that calculates a cross-correlation function and detects the section where a peak of the cross-correlation function appears as the sound generation position.
  • the speech analysis unit calculates the cross-correlation function within a predetermined search range, a first sound generation position detection process of detecting a first sound generation position in a first search range using a first representative waveform; A second sound generation position detection process that uses a second representative waveform to detect a second sound generation position in a second search range that is set based on the first sound generation position and is smaller than the first search range.
  • the audio signal processing device according to [1], which executes and.
  • the above-mentioned speech analysis section is In the first sound generation position detection process, the first frequency band is extracted from the audio signal of the song, and the first band representative waveform extracted from the first band audio signal and the section of the first band audio signal are detected.
  • a second band representative waveform extracted from a second band audio signal obtained by extracting a second frequency band from the audio signal of the song and an interval of the second band audio signal are detected.
  • the second part above is composed of kick sounds,
  • the first frequency band is the body rumble band of the kick sound,
  • the audio analysis unit calculates a temporary representative waveform extracted from the audio signal of the song according to a predetermined rule, and a waveform of the audio signal of the song in an interval having a length corresponding to the temporary representative waveform.
  • a tentative cross-correlation function is calculated as a function of time, and the representative waveform is generated by synchronously adding the waveform of the audio signal of the song for the section where the peak of the tentative cross-correlation function appears.
  • the representative waveform of the second part extracted from the audio signal of the song and the waveform of the audio signal of the song in an interval corresponding to the representative waveform are calculated as a function of time.
  • An audio signal processing method comprising: calculating a cross-correlation function; and detecting the section where a peak of the cross-correlation function appears as the sound generation position.
  • a voice analysis unit that detects the pronunciation position of the second part in a song including a first part and a second part that are phonetically separable, The audio analysis unit calculates a representative waveform of the second part extracted from the audio signal of the song and a waveform of the audio signal of the song in an interval corresponding to the representative waveform as a function of time.
  • a program for causing a computer to function as an audio signal processing device that calculates a cross-correlation function and detects the section where a peak of the cross-correlation function appears as the sound generation position.
  • FIG. 1 is a diagram showing the overall configuration of a system according to an embodiment of the present invention.
  • 2 is a block diagram showing a schematic functional configuration of the audio signal processing device in the example of FIG. 1.
  • FIG. 3 is a flowchart showing the overall flow of processing by the speech analysis section shown in FIG. 2.
  • FIG. 3 is a diagram schematically showing a waveform configuration of a kick sound.
  • 4 is a flowchart showing the sound generation position detection process in milliseconds shown in FIG. 3; 4 is a flowchart showing the sound generation position detection process in units of 10 microseconds shown in FIG. 3.
  • FIG. 7 is a flowchart showing a process for generating the band representative waveform shown in FIGS. 5 and 6.
  • FIG. 8 is a diagram for conceptually explaining the cross-correlation function calculation process shown in FIG. 7.
  • FIG. 8 is a diagram for conceptually explaining the weighted synchronous addition process shown in FIG. 7.
  • FIG. 7 is a flowchart showing the sound generation position detection process shown in FIGS. 5 and 6.
  • FIG. 11 is a diagram for conceptually explaining the sound generation position detection process shown in FIG. 10.
  • FIG. 4 is a flowchart showing the Kick representative waveform generation process shown in FIG. 3.
  • FIG. FIG. 6 is a diagram for conceptually explaining an example in which different fade-out curves are applied to each band. 4 is a flowchart showing the kick sound removed sound generation process shown in FIG. 3.
  • FIG. 15 is a diagram for conceptually explaining the process of adding the opposite phase signals shown in FIG. 14.
  • FIG. 14 is a diagram for conceptually explaining the process of adding the opposite phase signals shown in FIG. 14.
  • FIG. 1 is a diagram showing the overall configuration of a system according to an embodiment of the present invention.
  • the system 10 includes a PC (Personal Computer) 100, a DJ controller 200, and speakers 300.
  • the PC 100 is a device that stores, processes, and reproduces audio data, and is not limited to a PC, but may be a terminal device such as a tablet or a smartphone.
  • the PC 100 includes a display 101 that displays information to the user, and an input device such as a touch panel or a mouse that obtains operation input from the user.
  • the DJ controller 200 is connected to the PC 100 via a communication means such as a USB (Universal Serial Bus), and receives user operation input regarding music playback using a channel fader, cross fader, performance pad, jog dial, various knobs and buttons, etc. get.
  • the audio data is reproduced using the speaker 300, for example.
  • the PC 100 functions as an audio signal processing device in the system 10 as described above.
  • the PC 100 executes processing corresponding to a user's operational input on the stored audio data when the audio data is reproduced.
  • the PC 100 may perform processing on the audio data before playback and save the processed audio data.
  • the DJ controller 200 and speakers 300 may not be connected to the PC 100 at the time the process is executed.
  • the PC 100 functions as the audio signal processing device, but in other embodiments, DJ equipment such as a mixer or an all-in-one DJ system (digital audio player with communication and mixing functions) may function as the audio signal processing device. .
  • a server connected to a PC or DJ equipment via a network may function as the audio signal processing device.
  • FIG. 2 is a block diagram showing a schematic functional configuration of the audio signal processing device in the example of FIG. 1.
  • the PC 100 functioning as an audio signal processing device includes an audio analysis section 120, a display section 140, a mix processing section 150, and an operation section 160. These functions are implemented by a processor such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor) operating according to a program.
  • the program is read from the storage of the PC 100 or a removable recording medium, or downloaded from a server via a network, and expanded into the memory of the PC 100.
  • Musical piece audio data 110 including a first part and a second part that are phonetically separable is input to the audio analysis unit 120.
  • the first part is a vocal and/or instrumental sound part other than the kick sound
  • the second part is a kick sound part.
  • the kick sound is a bass drum sound or a synthesized sound that imitates a bass drum sound.
  • the audio analysis unit 120 extracts kick sound removed audio data 131, kick unit sound data 132, and kick pronunciation data 133 from the music audio data 110 using, for example, a music separation engine.
  • the kick sound removed audio data 131 is audio data obtained by removing the kick sound from the song audio data 110, that is, the audio data of the first part.
  • the Kick unit sound data 132 is data of the Kick sound included in the music audio data 110, that is, the unit sound of the second part (hereinafter also referred to as Kick unit sound).
  • the kick pronunciation data 133 is data indicating the pronunciation position of the kick sound in the music audio data 110.
  • the sound generation position is the temporal position at which the kick sound is sounded in the music audio data 110, and is recorded, for example, as a time code within the music or as a count in units of bars/beats.
  • a unit sound is a sound extracted using one pronunciation of the sound of the second part as a unit.
  • the waveform of the unit tone is also referred to as the representative waveform of the second part.
  • the audio analysis unit 120 separates the kick sound part from the music audio data 110, further divides the kick sound part into pronunciations, and extracts unit sounds by classifying the pronunciations based on the characteristics of the audio waveform.
  • a plurality of unit sounds having different audio waveform characteristics may be extracted.
  • the Kick unit sound data 132 may be, for example, audio data sampled from the Kick sound part, temporal position information where the unit sound is played in the Kick sound part, or extracted It may be audio data of a sample sound similar to the sound, or an identifier of the sample sound.
  • the display unit 140 displays information based on the Kick unit sound data 132 or the Kick pronunciation data 133 on the display 101 of the PC 100, for example.
  • the operation unit 160 obtains a user's operation input to an input device such as a touch panel or a mouse of the PC 100.
  • the display unit 140 displays the audio waveform of the song (the waveform may be based on the song audio data 110 or the waveform may be based on the kick sound removed audio data 131) and the kick sound associated with the waveform.
  • the operation unit 160 obtains an operation by the user to change the sound generation position of the kick sound to an arbitrary position within the song.
  • the display unit 140 may display the arrangement of kick sounds according to a preset rhythm pattern, and the operation unit 160 may obtain an operation by the user to select a rhythm pattern.
  • the position of the kick sound may be determined automatically without depending on the user's operation.
  • the display section 140 and the operation section 160 described above may not be included in the functions of the audio signal processing device.
  • the mix processing unit 150 generates mixed audio data 170 based on the kick sound removed audio data 131 and the kick unit sound data 132.
  • the mixed audio data 170 is audio data in which the kick sound removed audio data 131 is mixed with the rearranged kick unit sound.
  • the sound generation position of the Kick unit sound in the mixed audio data 170 is determined according to the user operation acquired by the operation unit 160 as described above, or according to the automatically determined rhythm pattern.
  • the pronunciation position of the Kick unit sound in the mixed audio data 170 may include a different position from the pronunciation position of the Kick sound in the original music audio data 110.
  • FIG. 3 is a flowchart showing the overall flow of processing by the speech analysis section shown in FIG. 2.
  • the audio analysis unit 120 first detects kick sounds in units of sixteenth notes (step S110), and classifies the detected kick sounds (step S120).
  • the detection process in step S110 is executed using, for example, a technique described in International Publication No. 2017/168644.
  • the classification process in step S120 is executed, for example, by calculating a correlation function between the detected kick sound waveforms and clustering them.
  • the processing in steps S110 and S120 is a process for identifying the rough position of the kick sound (existence or absence) in units of 16 notes and the classification of the kick sound, and is a process for pronouncing the kick sound with higher accuracy. This is preparation for identifying the position and representative waveform.
  • step S130 loop processing is executed for each kick sound classification identified in step S120 (step S130). Specifically, the sound generation position detection process in millisecond units (step S140), the sound generation position detection process in 10 microsecond units (step S150), the kick representative waveform generation process (step S160), and the kick sound removal voice generation process ( Step S170) is executed for each identified kick sound classification.
  • FIG. 4 is a diagram schematically showing the waveform configuration of the kick sound.
  • the waveform of the Kick sound includes an attack portion (ATTACK) and a body rumble portion (SUSTAIN). Since the frequency band and duration are different between the attack part and the body rumble part, by distinguishing between them, it is possible to detect the kick sound generation position with higher accuracy and to generate the kick representative waveform and the kick sound removed sound. . Note that such a waveform configuration is seen not only in kick sounds but also in other musical instrument sounds, such as percussion instrument sounds such as drum sounds including hi-hats and snares.
  • FIG. 5 is a flowchart showing the sound generation position detection process in milliseconds shown in FIG.
  • the body rumble band (first frequency band) of the kick sound is extracted from the music audio signal by processing the music audio signal with a 200Hz low-pass filter (step S141).
  • band representative waveform generation processing (step S142) and sound generation position detection processing (step S143) for the sound signal in the body rumble band (first band sound signal)
  • step S143 sound generation position of the kick sound is determined in milliseconds. It can be detected in units.
  • FIG. 6 is a flowchart showing the sound generation position detection process in units of 10 microseconds shown in FIG. 3.
  • the audio signal of the song is processed with a 3kHz high-pass filter to extract the audio signal in the attack band (second frequency band) of the kick sound from the audio signal of the song.
  • the attack band audio signal second band audio signal
  • the sound generation position of the kick sound is determined by 10 microseconds. It can be detected in units.
  • FIG. 7 is a flowchart showing the band representative waveform generation process (steps S142 and S152) shown in FIGS. 5 and 6.
  • a temporary band representative waveform is extracted from the band audio signal (audio signal in the body rumble band or attack band of the kick sound) in each process according to a predetermined rule (step S210). For example, if one measure of a song consists of 4 beats, the kick sound on the 2nd and 4th beats is likely to be played at the same time as the snare sound, so it is excluded, and the kick sound on the 1st beat is the same as the cymbal sound. It is excluded because there is a high possibility that it will sound at the same time.
  • the waveform of the kick sound at the third beat is extracted as a temporary band representative waveform. If there are multiple kick sounds on the 3rd beat, you may create a histogram of the kick sound levels for each bar and select the kick sound on the 3rd beat that is included in the higher frequency class. If there is no kick sound on the third beat, a temporary band representative waveform may be extracted from the kick sound on the first beat, since the cymbal sound has relatively less frequency band overlap with the kick sound.
  • step S142 the generation process of the band representative waveform is executed in each of the sound generation position detection process in millisecond units (step S142) and the sound generation position detection process in 10 microsecond units (step S152), but each process is performed in a different band. Since the processing is performed on the audio signal, even if the sound generation position of the tentative band representative waveform is determined in step S210 according to a common rule, the tentative band representative waveform in each process is different.
  • step S220 loop processing is performed for each kick sound to be processed (step S220). Specifically, a cross-correlation function with a temporary band representative waveform is calculated for each kick sound within a predetermined search range (step S230), and the waveform of the band audio signal is weighted for the section where the peak of the cross-correlation function appears. By performing synchronous addition (step S240), a band representative waveform of the kick sound is generated.
  • FIG. 8 is a diagram for conceptually explaining the cross-correlation function calculation process shown in FIG. 7.
  • is the amount of deviation caused by the provisional placement. The amount of deviation can be estimated from the position of the peak of the obtained cross-correlation function ⁇ 1 ( ⁇ ). In other words, it is possible to specify the sound generation section of the kick sound in the band audio signal.
  • FIG. 9 is a diagram for conceptually explaining the weighted synchronous addition process shown in FIG. 7.
  • the process of specifying the sound generation section of the kick sound in the band audio signal is executed for all the kick sounds to be processed included in the band audio signal S0 .
  • a band representative waveform Sp is generated by synchronously adding the waveforms of the band audio signal S0 for each kick sound generation section.
  • the waveforms of the sounding sections of each kick sound are weighted with weighting coefficients W 1 , W 2 , W 3 , . . . set according to predetermined rules, and synchronously added.
  • synchronous addition adds and averages the waveforms at the same time to reduce signals that are uncorrelated with the signal through phase cancellation. This is a method to obtain a waveform close to that of the signal. However, it is difficult to obtain the desired effect unless the waveform times are precisely aligned.
  • the generation period of each kick sound is specified as the period in which the peak of the cross-correlation function with the temporary band representative waveform appears in the previous step S230. , a band representative waveform Sp with reduced noise can be obtained.
  • the weighting coefficients W 1 , W 2 , W 3 , . . . in the synchronous addition are set depending on, for example, the position of the kick sound generation section in the song. For example, similar to the generation of the temporary band representative waveform above, for kick sounds that are likely to be played at the same time as other drum sounds, the weight is lowered, and for kick sounds that are less likely to be played at the same time as other drum sounds, the weight is lowered.
  • a weighting coefficient may be set to increase the weight of sound.
  • the kick sound is likely to be played at the same time as the snare sound on the 2nd and 4th beats, so the weight is the lowest, and the kick sound is the lowest on the 1st beat. Since there is a high possibility that it will sound at the same time as the cymbal sound, the next step is to reduce the weight. In this case, the waveforms of the sounding sections of each kick sound are weighted according to the number of beats and added synchronously.
  • the kick sound located on the backbeat may be given a larger weight regardless of the number of beats.
  • the weight ratio of the kick sound set from this point of view is, for example, 0.8/0.5/1.0/0 for the 1st beat/2nd beat/3rd beat/4th beat for the top beat. .5, and the backbeat may be 1.0 regardless of the number of beats.
  • the waveforms of the sounding sections of each Kick sound are weighted and synchronously added according to the classification of upbeat or backbeat.
  • FIG. 10 is a flowchart showing the sound generation position detection process (steps S143 and S153) shown in FIGS. 5 and 6.
  • a loop process is executed for each kick sound to be processed (step S310). Specifically, a cross-correlation function between the band representative waveform and the waveform of the band audio signal is calculated within a predetermined search range (step S320), and the section where the peak of the cross-correlation function appears is set as the generation position of each kick sound. Detected (step S330).
  • FIG. 11 is a diagram for conceptually explaining the sound generation position detection process shown in FIG. 10.
  • step S320 a section S 2 of a length corresponding to the band representative waveform S p is extracted from the band audio signal S 0 of the music, and the respective waveforms are expressed as functions f p (t) and f 2 (t) of time t.
  • is the amount of deviation caused by the provisional placement.
  • the amount of deviation can be estimated from the peak position of the obtained cross-correlation function ⁇ 2 ( ⁇ ). That is, the sounding positions P 1 , P 2 , P 3 , . . . of the kick sound can be specified.
  • the kick sound generation section is specified based on the cross-correlation function ⁇ 1 ( ⁇ ) with the temporary band representative waveform S temp .
  • the sound generation position of the kick sound is detected based on the cross-correlation function ⁇ 2 ( ⁇ ) with the band representative waveform Sp .
  • the band representative waveform Sp is a waveform in which noise such as instrumental sounds other than the kick sound is reduced by synchronously adding the waveforms of the sound generation section of each kick sound, and the characteristic of the original kick sound waveform is reduced. represents more accurately.
  • the cross-correlation function ⁇ 1 ( ⁇ ) with the tentative band representative waveform S temp extracted on a rule basis is used to specify a somewhat accurate sound generation section of the kick sound, and the waveform of this sound generation section is Using the cross-correlation function ⁇ 2 ( ⁇ ) with the band representative waveform S p obtained by synchronous addition, a more accurate kick sound generation position is specified.
  • the previous process (see FIG. Since the sounding position of the kick sound in 16th note units has been detected in step S110) shown in 3, for example, the cross-correlation is performed as the search range (first search range) within the range of 16th notes (approximately 100 milliseconds).
  • the search range first search range
  • the waveform of the body rumble band of the kick sound does not show a steep peak, but it has a certain duration, so the accuracy is lower than that of the attack band, which will be explained next. , the calculation load can be reduced even when the search range is wide.
  • the search range (second search range) is set to several milliseconds before and after the generation position of the kick sound in milliseconds, and the cross-correlation function is used. Calculate. As explained with reference to FIG. 4, the waveform of the attack band of the kick sound shows a steep peak, but the duration is short, so while accuracy is high, a wide search range increases the calculation load.
  • the calculation load may increase, but in this embodiment, the sounding position detection processing is performed in milliseconds first to determine the search range. By narrowing down the time to a few milliseconds, it is possible to detect the sounding position with high precision while suppressing the calculation load.
  • FIG. 12 is a flowchart showing the Kick representative waveform generation process shown in FIG. 3.
  • the Kick sound in the audio signal of the song is generated based on the sound generation position in 10 microsecond units detected in the 10 microsecond unit sound generation position detection process (step S150 shown in FIG. 3).
  • Weighted synchronous addition is performed on the waveforms of the sections (step S161).
  • the weighted synchronous addition for example, similar to the process in step S240 of the band representative waveform generation process shown in FIG. This is a synchronous addition process.
  • step S161 of the Kick representative waveform generation process shown in FIG. The signal waveforms are synchronously added.
  • step S142 generation of a band representative waveform in the sound generation position detection process in millisecond units
  • step S161 generation of a band representative waveform in the sound generation position detection process in 10 microsecond units
  • step S161 generation of a band representative waveform in the sound generation position detection process in 10 microsecond units
  • step S161 a different fade-out curve is applied for each band to the waveform obtained by the weighted synchronous addition in step S161 (step S162). Even if you add the waveforms synchronously, the waveform will not be completely made up of just the Kick sound, and other parts such as the snare and vocals will remain, so you can further remove the remaining snares and vocals, for example, in Figure 4.
  • the frequency of the kick sound is high in the attack part (ATTACK) at the beginning of the sound, and low in the body rumble part (SUSTAIN) that is most of the period thereafter. It is configured.
  • the kick sound waveform obtained by weighted synchronous addition is divided by three band filters, and a different fade-out curve is applied to each band.
  • a short (for example, 20 milliseconds long) fade-out curve for the attack portion is applied to a signal that has passed a high-pass filter of 640 Hz or higher
  • a fade-out curve that is applied to a signal that has passed a band-pass filter of 200 Hz to 640 Hz is applies a medium-length (e.g., 60 ms long) fade-out curve for the beginning of the rumble, and a long (e.g., 100 ms long) fade-out curve for the entire rumble section to the low-pass filtered signal below 200 Hz.
  • Apply a fade-out curve (with a length of seconds to 500 milliseconds). Note that the length of the fade-out curve for the entire body ringing portion may be adjusted according to the amplitude envelope of the waveform after
  • FIG. 14 is a flowchart showing the kick sound removed sound generation process shown in FIG. 3.
  • the sound generation position shown in FIGS. 12 and 13 is based on the sound generation position detected in 10 microsecond units in the 10 microsecond unit sound generation position detection process (step S150 shown in FIG. 3).
  • the Kick representative waveform generated in the process is rearranged (step S171).
  • an audio signal is obtained in which only the kick sound included in the song is extracted.
  • the reverse phase signal of this audio signal Kick sound rearranged audio signal
  • the audio signal of kick sound removed audio where only the kick sound is removed from the audio signal of the music is obtained. can be generated.
  • FIG. 15 is a diagram for conceptually explaining the process of adding the anti-phase signals shown in FIG. 14.
  • the reverse phase signal S Kick_Rev of the rearranged kick sound S Kick is added to the audio signal S of the music. If the rearranged Kick sound S Kick corresponds to the Kick sound component included in the audio signal S of the song, the Kick sound is canceled by adding the reverse phase signal S Kick_Rev , and the Kick sound other than the Kick sound included in the song is You can get an audio signal that is just sound. At this time, in order to accurately remove only the kick sound component, the sound generation position and waveform of the kick sound must be accurately specified.
  • a highly accurate sounding position is detected by the sounding position detection process in units of 10 microseconds (step S150 shown in FIG. 3), and synchronous addition using this highly accurate sounding position Since the accurate Kick representative waveform is specified through the process (step S160 shown in FIG. 3), only the Kick sound component is accurately removed to produce a high-quality Kick sound with no sound quality deterioration in sounds other than the Kick sound. A removed voice can be generated.
  • the pronunciation position of the kick sound included in the audio signal of the music is specified as the peak position of the cross-correlation function with the band representative waveform. In this way, by accurately specifying the sound generation position of the kick sound, it becomes possible to separate the kick sound from other sounds on the time axis, and it is possible to achieve high sound quality and high separation performance.
  • the audio analysis unit 120 executes the process of extracting the band representative waveform from the band audio signal, but this process is an example of the process of extracting the representative waveform of the second part from the audio signal of the song.
  • the representative waveform of the second part is extracted from the audio signal of the song without extracting a specific frequency band, and the representative waveform of the second part is extracted based on the cross-correlation function between the representative waveform and the section of the audio signal of the song. The sounding position of the second part may also be detected.
  • the process in which the audio analysis unit 120 generates a band representative waveform by synchronously adding waveforms of band audio signals is the same as the process in which the audio analysis unit 120 generates a representative waveform for the second part by synchronously adding the audio signals of the music piece. This is an example.
  • the first part of the song is a part other than the kick sound
  • the second part is the kick sound part
  • the first part and the second part are There are no limitations on how the vocals and/or instrumental sounds are separated and assigned to the parts.
  • the second part may be any part from which a representative waveform can be extracted; for example, it may be a hi-hat or snare part, or a percussion instrument sound part such as a drum sound with a hi-hat or snare added to a kick sound.
  • the second part is a drum sound part
  • the kick unit sound, hi-hat and snare unit sounds are respectively May be relocated.
  • the detection result of the kick sound production position by the audio analysis unit 120 and the extraction result of the kick unit sound are used to change the kick sound production position included in the original music audio data 110.
  • the mixed audio data 170 may not necessarily be generated.
  • the Kick unit sound data 132 may be extracted alone and used as a sample sound source for performance.
  • only the kick sound removed audio data 131 may be output without the kick unit sound data, or the kick sound of a song may be changed to a different song or sample sound source based on the kick sound removed audio data 131 and the kick pronunciation data 133. It may be replaced with a kick sound.
  • the second part mentioned above is a sound other than the kick sound.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

La présente invention concerne un dispositif de traitement de signaux audio comprenant une unité d'analyse audio qui, pour une pièce musicale comprenant une première partie et une seconde partie qui sont séparables acoustiquement, détecte une position de génération de son de la seconde partie. L'unité d'analyse audio calcule une fonction de corrélation croisée dans laquelle la forme d'onde représentative de la seconde partie extraite du signal audio de la pièce musicale et une forme d'onde du signal audio de la pièce musicale pendant la durée de l'intervalle correspondant à ladite forme d'onde représentative sont chacune traitées en fonction du temps, et l'intervalle dans lequel le pic de la fonction de corrélation croisée apparaît est détecté en tant que position de génération de son.
PCT/JP2022/030730 2022-08-12 2022-08-12 Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme WO2024034115A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/030730 WO2024034115A1 (fr) 2022-08-12 2022-08-12 Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/030730 WO2024034115A1 (fr) 2022-08-12 2022-08-12 Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme

Publications (1)

Publication Number Publication Date
WO2024034115A1 true WO2024034115A1 (fr) 2024-02-15

Family

ID=89851258

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/030730 WO2024034115A1 (fr) 2022-08-12 2022-08-12 Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme

Country Status (1)

Country Link
WO (1) WO2024034115A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022100A (ja) * 2001-07-09 2003-01-24 Yamaha Corp 雑音除去方法、雑音除去装置およびプログラム
JP2013076887A (ja) * 2011-09-30 2013-04-25 Brother Ind Ltd 情報処理システム,プログラム
WO2019053766A1 (fr) * 2017-09-12 2019-03-21 Pioneer DJ株式会社 Dispositif et programme d'analyse de chanson

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003022100A (ja) * 2001-07-09 2003-01-24 Yamaha Corp 雑音除去方法、雑音除去装置およびプログラム
JP2013076887A (ja) * 2011-09-30 2013-04-25 Brother Ind Ltd 情報処理システム,プログラム
WO2019053766A1 (fr) * 2017-09-12 2019-03-21 Pioneer DJ株式会社 Dispositif et programme d'analyse de chanson

Similar Documents

Publication Publication Date Title
JP4823804B2 (ja) コード名検出装置及びコード名検出用プログラム
JP4767691B2 (ja) テンポ検出装置、コード名検出装置及びプログラム
US8889976B2 (en) Musical score position estimating device, musical score position estimating method, and musical score position estimating robot
US6798886B1 (en) Method of signal shredding
US7563975B2 (en) Music production system
JP4645241B2 (ja) 音声処理装置およびプログラム
WO2007010637A1 (fr) Détecteur de rythme, détecteur de nom de corde et programme
JP2008040284A (ja) テンポ検出装置及びテンポ検出用コンピュータプログラム
JP2008275975A (ja) リズム検出装置及びリズム検出用コンピュータ・プログラム
JP5229998B2 (ja) コード名検出装置及びコード名検出用プログラム
JP3996565B2 (ja) カラオケ装置
JP4204941B2 (ja) カラオケ装置
JP6657713B2 (ja) 音響処理装置および音響処理方法
WO2024034115A1 (fr) Dispositif de traitement de signaux audio, procédé de traitement de signaux audio, et programme
WO2024034118A1 (fr) Dispositif et procédé de traitement de signal audio, et programme associé
JP6263382B2 (ja) 音声信号処理装置、音声信号処理装置の制御方法、プログラム
JP6263383B2 (ja) 音声信号処理装置、音声信号処理装置の制御方法、プログラム
JP5005445B2 (ja) コード名検出装置及びコード名検出用プログラム
JP4932614B2 (ja) コード名検出装置及びコード名検出用プログラム
JP4483561B2 (ja) 音響信号分析装置、音響信号分析方法及び音響信号分析プログラム
JP2005107332A (ja) カラオケ装置
Stöter et al. Unison Source Separation.
WO2024034117A1 (fr) Dispositif de traitement de données audio, procédé de traitement de données audio et programme
WO2024034116A1 (fr) Dispositif de traitement de données audio, procédé de traitement de données audio et programme
JP4159961B2 (ja) カラオケ装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955030

Country of ref document: EP

Kind code of ref document: A1