JP6409417B2

JP6409417B2 - Sound processor

Info

Publication number: JP6409417B2
Application number: JP2014175157A
Authority: JP
Inventors: ジェイナージョルディ; ゴルロウスタニスロウ; 慶太有元
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2018-10-24
Anticipated expiration: 2034-08-29
Also published as: JP2016050995A

Description

本発明は、音響信号を処理する技術に関する。 The present invention relates to a technique for processing an acoustic signal.

音声や楽音等の音響の音高を変更する各種の技術が従来から提案されている。例えば特許文献１には、利用者が発音した音声の音高を変更する技術が開示されている。また、例えば特許文献２には、歌唱音声を歌唱音声を調波成分と非調波成分とに分解して声質変換を実行する構成が開示されている。 Various techniques for changing the pitch of sound such as voice and music have been proposed. For example, Patent Literature 1 discloses a technique for changing the pitch of a sound produced by a user. Further, for example, Patent Document 2 discloses a configuration in which voice quality conversion is performed by decomposing a singing voice into a harmonic component and a non-harmonic component.

特開２００５−０２５２３４号公報JP 2005-025234 A 特開２０００−０１０６００号公報JP 2000-010600 A

収録済の音響信号のうち特定の音響（以下「対象音」という）と同等の音色で所望の音高（以下「目標音高」という）の音響を生成する構成としては、例えば、音響信号から抽出される対象音の音高を目標音高に変更し、変更後に音色を対象音に近付ける構成（ピッチシフト→モーフィング）が想定され得る。しかし、例えば対象音を含む複数の音響成分を音響信号が包含する場合には、処理対象の対象音のみを高精度に抽出することは困難であり、対象音以外の音響成分が不可避的に対象音に付随し得る。以上の状況では、対象音に付随する対象音以外の音響成分が音高の変更に起因して顕在化し、更に音色の変換に起因して顕在化するという問題がある。以上の事情を考慮して、本発明は、音響信号のうち特定の音響の音高を変更する場合の音質の低下を抑制することを目的とする。 As a configuration for generating a sound having a desired pitch (hereinafter referred to as “target pitch”) with a timbre equivalent to a specific sound (hereinafter referred to as “target sound”) among the recorded acoustic signals, for example, from the acoustic signal A configuration (pitch shift → morphing) in which the pitch of the extracted target sound is changed to the target pitch and the timbre approaches the target sound after the change can be assumed. However, for example, when the acoustic signal includes a plurality of acoustic components including the target sound, it is difficult to extract only the target sound to be processed with high accuracy, and acoustic components other than the target sound are inevitably targeted. Can accompany sound. In the above situation, there is a problem that acoustic components other than the target sound accompanying the target sound become obvious due to the change in the pitch, and further become apparent due to the conversion of the timbre. In view of the above circumstances, an object of the present invention is to suppress deterioration in sound quality when changing the pitch of a specific sound among acoustic signals.

以上の課題を解決するために、本発明の音響処理装置は、対象音とは相違する音色で対象音と同等の音高の第１参照音を表す第１参照信号と、対象音の音高とは相違する目標音高で第１参照音と同等の音色の第２参照音を表す第２参照信号とを取得する参照音取得手段と、対象音を表す対象信号と第１参照信号とを利用して、第１参照音を対象音の音色に近付けるための変換フィルタを生成する解析処理手段と、変換フィルタを第２参照信号に適用することで、対象音に近似した音色で目標音高の音響を表す変換信号を生成する音響処理手段とを具備する。以上の態様では、対象音と同等の音高の第１参照音を対象音の音色に近付けるための変換フィルタが対象信号と第１参照信号とに応じて生成され、目標音高の第２参照音を表す第２参照信号に変換フィルタを適用することで変換信号が生成される。すなわち、対象音の音高の変換は原理的に不要である。したがって、対象音の音高の変更に起因した音質の低下を防止できるという利点がある。 In order to solve the above problems, the sound processing apparatus of the present invention includes a first reference signal representing a first reference sound having a tone different from the target sound and a pitch equivalent to the target sound, and a pitch of the target sound. Reference sound acquisition means for acquiring a second reference signal representing a second reference sound having a target tone pitch different from that of the first reference sound, and a target signal representing the target sound and a first reference signal By using the analysis processing means for generating a conversion filter for bringing the first reference sound close to the timbre of the target sound, and applying the conversion filter to the second reference signal, the target pitch can be obtained with a timbre approximating the target sound. And a sound processing means for generating a converted signal representing the sound. In the above aspect, the conversion filter for bringing the first reference sound having the same pitch as the target sound closer to the timbre of the target sound is generated according to the target signal and the first reference signal, and the second reference of the target pitch is obtained. A conversion signal is generated by applying a conversion filter to the second reference signal representing the sound. That is, it is not necessary in principle to convert the pitch of the target sound. Therefore, there is an advantage that it is possible to prevent a decrease in sound quality due to a change in the pitch of the target sound.

本発明の好適な態様において、参照音取得手段は、対象信号および第１参照信号の一方の音高を他方の音高に調整する。以上の態様では、対象信号および第２参照信号とを同等の音高に調整したうえで変換フィルタが生成されるから、対象信号と第１参照信号とで音高が相違した状態で変換フィルタを生成する場合と比較して、参照音を対象音の音色に高精度に変換可能な変換フィルタを生成できるという利点がある。 In a preferred aspect of the present invention, the reference sound acquisition means adjusts the pitch of one of the target signal and the first reference signal to the other pitch. In the above aspect, since the conversion filter is generated after adjusting the target signal and the second reference signal to the same pitch, the conversion filter is used in a state where the pitch is different between the target signal and the first reference signal. Compared with the case where it produces | generates, there exists an advantage that the conversion filter which can convert a reference sound into the timbre of an object sound with high precision can be produced | generated.

例えば、音響信号から対象音以外の音響を抑圧することで対象信号を生成する成分抽出手段を具備する構成では、対象音以外の残差成分が対象信号に付随し得る。したがって、対象信号を第１参照信号の音高に変更する構成では、音高の変更に起因して残差成分が顕在化する可能性がある。したがって、参照音取得手段が第１参照信号を対象信号と同等の音高に調整する構成が好適である。 For example, in a configuration including a component extraction unit that generates a target signal by suppressing sound other than the target sound from the acoustic signal, a residual component other than the target sound may accompany the target signal. Therefore, in the configuration in which the target signal is changed to the pitch of the first reference signal, there is a possibility that the residual component becomes obvious due to the change in the pitch. Therefore, a configuration in which the reference sound acquisition unit adjusts the first reference signal to a pitch equivalent to that of the target signal is suitable.

本発明の構成は、音響信号の特定の音響の音高を変更する構成に好適に採用される。具体的には、音響信号の音高の時系列を解析する音高解析手段と、音高解析手段が解析した音高の時系列において音高を変更すべき対象音と変更後の目標音高との指示を利用者から受付ける指示受付手段と、外部音源が生成した参照音を表す参照信号を取得する参照音取得手段と、参照音取得手段が取得した参照信号の参照音を対象音の音色に近付けた目標音高の変換信号を生成する音色変換手段と、成分抽出手段が生成した分離信号と音色変換手段が生成した変換信号とを混合する混合処理手段とを具備する音響処理装置において、音色変換手段に前述の各形態が利用され得る。 The configuration of the present invention is suitably employed for a configuration that changes the pitch of a specific sound of an acoustic signal. Specifically, the pitch analysis means for analyzing the time series of the pitch of the acoustic signal, the target sound whose pitch should be changed in the time series of the pitch analyzed by the pitch analysis means, and the target pitch after the change The instruction accepting means for accepting the instruction from the user, the reference sound obtaining means for obtaining the reference signal representing the reference sound generated by the external sound source, and the reference sound of the reference signal obtained by the reference sound obtaining means for the timbre of the target sound In a sound processing apparatus comprising: a timbre conversion unit that generates a conversion signal of a target pitch that is close to the sound signal; and a mixing processing unit that mixes the separated signal generated by the component extraction unit and the conversion signal generated by the timbre conversion unit. The above-described embodiments can be used for the timbre conversion means.

以上の各態様に係る音響処理装置は、音響信号の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。 The sound processing apparatus according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of an acoustic signal, or a general-purpose operation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.

本発明の第１実施形態に係る音響処理装置の構成図である。1 is a configuration diagram of a sound processing apparatus according to a first embodiment of the present invention. 音高系列を生成する処理（非負値行列因子分解）の説明図である。It is explanatory drawing of the process (nonnegative matrix factorization) which produces | generates a pitch series. 音高遷移画像の模式図である。It is a schematic diagram of a pitch transition image. 音響編集処理のフローチャートである。It is a flowchart of an acoustic edit process. 音響加工部の構成図である。It is a block diagram of an acoustic processing part. 音色変換処理のフローチャートである。It is a flowchart of a timbre conversion process. 音色変換処理の説明図である。It is explanatory drawing of a timbre conversion process. 第２実施形態における発音範囲の説明図である。It is explanatory drawing of the pronunciation range in 2nd Embodiment. 第３実施形態における音色変換処理のフローチャートである。It is a flowchart of the timbre conversion process in 3rd Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音響処理装置１００の構成図である。図１に例示される通り、音響処理装置１００は、演算処理装置１０と記憶装置１２と表示装置１４と入力装置１６と信号供給装置２２と音源装置２４と放音装置２６とを具備するコンピュータシステムで実現される。例えば携帯電話機またはスマートフォン等の可搬型の情報処理装置やパーソナルコンピュータ等の可搬型または据置型の情報処理装置が音響処理装置１００として利用され得る。 <First Embodiment>
FIG. 1 is a configuration diagram of a sound processing apparatus 100 according to the first embodiment of the present invention. As illustrated in FIG. 1, the sound processing device 100 includes a computer processing device 10, a storage device 12, a display device 14, an input device 16, a signal supply device 22, a sound source device 24, and a sound emission device 26. It is realized with. For example, a portable information processing device such as a mobile phone or a smartphone, or a portable or stationary information processing device such as a personal computer can be used as the sound processing device 100.

信号供給装置２２は、音響の時間波形を表す音響信号Ｘを出力する。第１実施形態の音響信号Ｘは、例えばライブハウスやコンサートホール等の固有の音響特性の音響空間で収録された信号であり、楽曲の歌唱音と楽器（以下「対象楽器」という）の演奏音との混合音の波形を表現する。なお、対象楽器以外の楽器の演奏音を包含する音響信号Ｘも処理可能である。可搬型または内蔵型の記録媒体から音響信号Ｘを取得して出力する再生装置や、通信網から音響信号Ｘを受信して出力する通信装置が信号供給装置２２として利用され得る。第１実施形態の音響処理装置１００は、信号供給装置２２が出力する音響信号Ｘのうち対象楽器の演奏音の特定の箇所（例えば演奏者が対象楽器の演奏を失敗した箇所）を変更することで音響信号Ｚを生成する信号処理装置である。 The signal supply device 22 outputs an acoustic signal X representing an acoustic time waveform. The acoustic signal X of the first embodiment is a signal recorded in an acoustic space having a specific acoustic characteristic such as a live house or a concert hall, for example, and the singing sound of a song and the performance sound of an instrument (hereinafter referred to as “target instrument”). To express the mixed sound waveform. Note that an acoustic signal X including performance sounds of musical instruments other than the target musical instrument can also be processed. A playback device that acquires and outputs the acoustic signal X from a portable or built-in recording medium, and a communication device that receives and outputs the acoustic signal X from a communication network can be used as the signal supply device 22. The acoustic processing device 100 according to the first embodiment changes a specific portion of the performance sound of the target instrument in the acoustic signal X output from the signal supply device 22 (for example, a location where the performer has failed to perform the target instrument). This is a signal processing device that generates an acoustic signal Z.

表示装置１４（例えば液晶表示パネル）は、演算処理装置１０から指示された画像を表示する。入力装置１６は、音響処理装置１００に対する各種の指示のために利用者が操作する操作機器であり、例えば利用者が操作する複数の操作子を包含する。表示装置１４と一体に構成されたタッチパネルを入力装置１６として利用することも可能である。放音装置２６（例えばスピーカやヘッドホン）は、演算処理装置１０が生成した音響信号Ｚに応じた音響を放音する。 The display device 14 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 10. The input device 16 is an operation device operated by a user for various instructions to the sound processing device 100, and includes, for example, a plurality of operators operated by the user. A touch panel configured integrally with the display device 14 can also be used as the input device 16. The sound emitting device 26 (for example, a speaker or headphones) emits sound according to the acoustic signal Z generated by the arithmetic processing device 10.

音源装置２４は、対象楽器の演奏音を表す音響信号（以下「参照信号」という）Ｒを生成する外部音源である。第１実施形態の音源装置２４は、任意の音高の参照信号Ｒを生成可能である。例えばＰＣＭ（Pulse Code Modulation）音源等の公知の音源が音源装置２４として任意に採用され得る。また、記憶装置１２に記憶されたプログラムを演算処理装置１０が実行することで音源装置２４の機能を実現することも可能である。 The sound source device 24 is an external sound source that generates an acoustic signal (hereinafter referred to as “reference signal”) R representing the performance sound of the target musical instrument. The sound source device 24 of the first embodiment can generate a reference signal R having an arbitrary pitch. For example, a known sound source such as a PCM (Pulse Code Modulation) sound source can be arbitrarily employed as the sound source device 24. Further, the function of the sound source device 24 can be realized by the arithmetic processing device 10 executing the program stored in the storage device 12.

記憶装置１２は、演算処理装置１０が実行するプログラムや演算処理装置１０が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。演算処理装置１０は、記憶装置１２に記憶されたプログラムを実行することで、音響信号Ｘから音響信号Ｚを生成するための複数の機能（音源分離部３２，音高解析部３４，表示制御部３６，指示受付部３８，成分抽出部４０，音響加工部４２，混合処理部４４）を実現する。なお、演算処理装置１０の各機能を複数の装置に分散した構成や、演算処理装置１０の機能の一部を専用の電子回路が実現する構成も採用され得る。 The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The arithmetic processing device 10 executes a program stored in the storage device 12 to generate a plurality of functions (sound source separation unit 32, pitch analysis unit 34, display control unit) for generating the acoustic signal Z from the acoustic signal X. 36, an instruction receiving unit 38, a component extracting unit 40, an acoustic processing unit 42, and a mixing processing unit 44). A configuration in which each function of the arithmetic processing device 10 is distributed to a plurality of devices or a configuration in which a dedicated electronic circuit realizes a part of the function of the arithmetic processing device 10 may be employed.

音源分離部３２は、信号供給装置２２が出力する音響信号Ｘから音響信号ＸAと音響信号ＸBとを生成する。音響信号ＸAは、音響信号Ｘのうち歌唱音が強調された信号（理想的には対象楽器の演奏音が除去された信号）であり、音響信号ＸBは、音響信号Ｘのうち対象楽器の演奏音が強調された信号（理想的には歌唱音が除去された信号）である。音響信号ＸAおよび音響信号ＸBの生成には公知の技術が任意に採用され得る。例えば、歌唱音および演奏音の音像が定位する位置の相違を利用して歌唱音と演奏音とを分離する音源分離処理が音響信号ＸAおよび音響信号ＸBの生成に好適に利用される。 The sound source separation unit 32 generates an acoustic signal XA and an acoustic signal XB from the acoustic signal X output from the signal supply device 22. The acoustic signal XA is a signal in which the singing sound is emphasized in the acoustic signal X (ideally, a signal from which the performance sound of the target instrument is removed), and the acoustic signal XB is a performance of the target instrument in the acoustic signal X. This is a signal in which the sound is emphasized (ideally, a signal from which the singing sound has been removed). A known technique can be arbitrarily employed for generating the acoustic signal XA and the acoustic signal XB. For example, a sound source separation process that separates a singing sound and a performance sound using a difference in position where the sound images of the singing sound and the performance sound are localized is preferably used for generating the acoustic signal XA and the acoustic signal XB.

音高解析部３４は、音源分離部３２による分離後の音響信号ＸBにおける音高の時系列（以下「音高系列」という）Ｓを解析する。音高系列Ｓは、対象楽器の演奏音の音高の時間的な遷移とも換言され得る。第１実施形態の音高解析部３４は、音響信号ＸBに対する非負値行列因子分解（NMF：Nonnegative Matrix Factorization）で音高系列Ｓを生成する。 The pitch analysis unit 34 analyzes a time series (hereinafter referred to as “pitch sequence”) S of pitches in the acoustic signal XB after separation by the sound source separation unit 32. The pitch series S can be rephrased as a temporal transition of the pitch of the performance sound of the target musical instrument. The pitch analysis unit 34 of the first embodiment generates a pitch series S by non-negative matrix factorization (NMF) for the acoustic signal XB.

図２は、第１実施形態における非負値行列因子分解の説明図である。図２に例示される通り、音高解析部３４は、音響信号ＸBを表現する観測行列Ｗを基底行列Ｂと係数行列Ｇとに分解する。観測行列Ｗは、音響信号ＸBを時間軸上で区分したＮ個のフレームの各々の強度スペクトルを時系列に配列したＭ行Ｎ列の非負値行列である。任意の１個のフレームの強度スペクトルは、周波数軸上のＭ個の周波数の各々における強度（振幅やパワー）の系列である。以上の説明から理解される通り、観測行列Ｗは、音響信号ＸBのスペクトログラムを表現する。 FIG. 2 is an explanatory diagram of non-negative matrix factorization in the first embodiment. As illustrated in FIG. 2, the pitch analysis unit 34 decomposes the observation matrix W representing the acoustic signal XB into a base matrix B and a coefficient matrix G. The observation matrix W is a non-negative matrix of M rows and N columns in which the intensity spectra of N frames obtained by dividing the acoustic signal XB on the time axis are arranged in time series. The intensity spectrum of any one frame is a series of intensity (amplitude and power) at each of M frequencies on the frequency axis. As understood from the above description, the observation matrix W represents a spectrogram of the acoustic signal XB.

基底行列Ｂは、対象楽器の演奏音の音響特性を表現する。第１実施形態の基底行列Ｂは、図２に例示される通り、対象楽器の相異なる音高の演奏音に対応するＫ個の基底ベクトルｂ[1]〜ｂ[K]を横方向に配列したＭ行Ｋ列の非負値行列である。任意の１個の基底ベクトルｂ[k]（ｋ＝１〜Ｋ）は、対象楽器が発音可能なＫ種類（例えばピアノの８８音）の音高のうち第ｋ番目の音高の演奏音の強度スペクトルに相当し、周波数軸上のＭ個の周波数の各々における強度の系列である。基底行列Ｂは、対象楽器の演奏音の解析で生成されて記憶装置１２に事前に格納される。第１実施形態の音高解析部３４は、記憶装置１２に記憶された基底行列Ｂを教師情報（事前情報）として利用した音響信号ＸBの教師あり非負値行列因子分解（Supervised NMF）で係数行列Ｇを生成する。 The base matrix B expresses the acoustic characteristics of the performance sound of the target musical instrument. As illustrated in FIG. 2, the base matrix B of the first embodiment arranges K base vectors b [1] to b [K] corresponding to performance sounds of different pitches of the target musical instrument in the horizontal direction. This is a non-negative matrix of M rows and K columns. An arbitrary one basis vector b [k] (k = 1 to K) represents the performance sound of the kth pitch among the K types of pitches (for example, 88 notes of piano) that can be generated by the target musical instrument. It corresponds to an intensity spectrum and is a series of intensities at each of M frequencies on the frequency axis. The base matrix B is generated by analyzing the performance sound of the target instrument and stored in the storage device 12 in advance. The pitch analysis unit 34 of the first embodiment is a coefficient matrix by supervised non-negative matrix factorization (Supervised NMF) of the acoustic signal XB using the base matrix B stored in the storage device 12 as teacher information (prior information). G is generated.

係数行列Ｇは、図２に例示される通り、基底行列Ｂの相異なる基底ベクトルｂ[k]に対応するＫ個の係数ベクトルｇ[1]〜ｇ[K]を縦方向に配列したＫ行Ｎ列の非負値行列である。係数行列Ｇの第ｋ行の係数ベクトルｇ[k]は、時間軸上の相異なるフレームに対応するＮ個の係数ａ[k,1]〜ａ[k,N]で構成される。係数ベクトルｇ[k]の任意の１個の係数ａ[k,n]（ｎ＝１〜Ｎ）は、基底行列Ｂの基底ベクトルｂ[k]に対する加重値を意味する。具体的には、係数ベクトルｇ[k]を構成するＮ個の係数ａ[k,1]〜ａ[k,N]は、対象楽器のＫ種類の音高のうち基底ベクトルｂ[k]に対応する第ｋ番目の音高の音響成分の強度（活性度）の時系列に相当する。すなわち、係数ａ[k,n]が大きい第ｎ番目のフレームでは、対象楽器の第ｋ番目の音高の音響成分が優勢に存在する。以上の傾向を考慮して、第１実施形態の音高解析部３４は、係数行列Ｇを音高系列Ｓとして算定する。具体的には、音高解析部３４は、基底行列Ｂと係数行列Ｇとの行列積が観測行列Ｗに接近するように係数行列Ｇを更新する演算処理の反復で係数行列Ｇを逐次的に更新し、所定の収束条件が成立した時点（例えば更新演算の所定値に到達した時点）の係数行列Ｇを音高系列Ｓとして確定する。第１回目の演算処理に適用される係数行列Ｇの各係数ａ[k,n]（初期値）は、例えば乱数に設定される。 As illustrated in FIG. 2, the coefficient matrix G includes K rows in which K coefficient vectors g [1] to g [K] corresponding to different base vectors b [k] of the base matrix B are arranged in the vertical direction. It is a non-negative matrix with N columns. The coefficient vector g [k] in the k-th row of the coefficient matrix G is configured with N coefficients a [k, 1] to a [k, N] corresponding to different frames on the time axis. Any one coefficient a [k, n] (n = 1 to N) of the coefficient vector g [k] means a weight value for the base vector b [k] of the base matrix B. Specifically, the N coefficients a [k, 1] to a [k, N] constituting the coefficient vector g [k] are set to the base vector b [k] among the K pitches of the target musical instrument. This corresponds to the time series of the intensity (activity) of the acoustic component of the corresponding kth pitch. That is, in the nth frame having a large coefficient a [k, n], the acoustic component of the kth pitch of the target musical instrument is dominant. In consideration of the above tendency, the pitch analysis unit 34 of the first embodiment calculates the coefficient matrix G as the pitch series S. Specifically, the pitch analysis unit 34 sequentially calculates the coefficient matrix G by repeating the arithmetic processing for updating the coefficient matrix G so that the matrix product of the base matrix B and the coefficient matrix G approaches the observation matrix W. The coefficient matrix G at the time when the predetermined convergence condition is satisfied (for example, when the predetermined value of the update calculation is reached) is determined as the pitch series S. Each coefficient a [k, n] (initial value) of the coefficient matrix G applied to the first calculation process is set to, for example, a random number.

図１の表示制御部３６は、音高解析部３４が解析した音高系列Ｓを表象する図３の音高遷移画像１４２を表示装置１４に表示させる。図３に例示される通り、音高遷移画像１４２は、時間軸（横軸）と音高軸（縦軸）とが設定された座標平面に音高系列Ｓを描画したピアノロール状の画像である。時間軸上の各地点はＮ個のフレームの各々に対応し、音高軸上の各地点はＫ個の音高の各々に対応する。時間軸上の第ｎ番目のフレームと音高軸上の第ｋ番目の音高とに対応する地点は、音高系列Ｓ（係数行列Ｇ）の係数ａ[k,n]の大小に応じた態様（例えば階調や色彩）で表示される。すなわち、音響信号ＸBに包含される各音響（音符毎の単音）の音高と発音期間とが音高遷移画像１４２で表現される。したがって、利用者は、音高遷移画像１４２を視認することで対象楽器の演奏音の時系列（各音高の発音期間や発音強度）を直観的に把握することが可能である。 The display control unit 36 in FIG. 1 causes the display device 14 to display the pitch transition image 142 in FIG. 3 representing the pitch series S analyzed by the pitch analysis unit 34. As illustrated in FIG. 3, the pitch transition image 142 is a piano roll-like image in which the pitch series S is drawn on the coordinate plane in which the time axis (horizontal axis) and the pitch axis (vertical axis) are set. is there. Each point on the time axis corresponds to each of the N frames, and each point on the pitch axis corresponds to each of the K pitches. The points corresponding to the nth frame on the time axis and the kth pitch on the pitch axis depend on the magnitude of the coefficient a [k, n] of the pitch sequence S (coefficient matrix G). It is displayed in a mode (for example, gradation or color). That is, the pitch and sound generation period of each sound (single note for each note) included in the sound signal XB are represented by the pitch transition image 142. Accordingly, the user can intuitively grasp the time series of the performance sound of the target musical instrument (the sound generation period and sound intensity) of the target musical instrument by visually recognizing the pitch transition image 142.

図１の指示受付部３８は、入力装置１６に対する利用者からの指示を受付ける。第１実施形態の指示受付部３８は、音高解析部３４が解析した音高系列Ｓ（表示制御部３６が表示装置１４に表示させた音高遷移画像１４２）から音高を変更すべき任意の演奏音（以下「対象音」という）Ｔの指示を利用者から受付ける。図３に例示されるとおり、利用者は、例えば音高遷移画像１４２を視認しながら入力装置１６を適宜に操作することで、音高遷移画像１４２で表現された複数の演奏音のうち音高の変更を希望する対象音Ｔを選択するとともに、当該対象音Ｔの変更後の音高（以下「目標音高」という）Ｐを指定することが可能である。指示受付部３８は、音高遷移画像１４２に対する対象音Ｔの指示と目標音高Ｐの指示とを利用者から受付ける。なお、相異なる複数の対象音Ｔの指示と対象音Ｔ毎の目標音高Ｐの指示とを指示受付部３８が受付けることも可能である。 The instruction receiving unit 38 in FIG. 1 receives an instruction from the user for the input device 16. The instruction receiving unit 38 according to the first embodiment can arbitrarily change the pitch from the pitch sequence S analyzed by the pitch analysis unit 34 (the pitch transition image 142 displayed on the display device 14 by the display control unit 36). The instruction of the performance sound (hereinafter referred to as “target sound”) T is received from the user. As illustrated in FIG. 3, for example, the user appropriately operates the input device 16 while visually recognizing the pitch transition image 142, so that the pitch among the plurality of performance sounds expressed by the pitch transition image 142 is displayed. It is possible to select a target sound T that is desired to be changed and to specify a pitch P (hereinafter referred to as “target pitch”) P after the change of the target sound T. The instruction receiving unit 38 receives an instruction of the target sound T and an instruction of the target pitch P for the pitch transition image 142 from the user. It is also possible for the instruction receiving unit 38 to receive an instruction for a plurality of different target sounds T and an instruction for a target pitch P for each target sound T.

図１の成分抽出部４０は、対象楽器の演奏音が強調された音響信号ＸBから分離信号ＹAと対象信号ＹBとを生成する。分離信号ＹAは、音響信号ＸBのうち利用者が指示した対象音Ｔを抑圧（理想的には除去）した音響信号であり、対象信号ＹBは、音響信号ＸBのうち対象音Ｔを強調した音響信号（理想的には対象音Ｔ以外の演奏音が除去された音響信号）である。分離信号ＹAおよび対象信号ＹBの生成には公知の技術が任意に採用され得るが、例えばウィナー（Wiener）フィルター等を利用した周波数領域での音源分離処理（対象音Ｔの分離）が好適である。 The component extraction unit 40 in FIG. 1 generates a separation signal YA and a target signal YB from the acoustic signal XB in which the performance sound of the target instrument is emphasized. The separated signal YA is an acoustic signal obtained by suppressing (ideally removing) the target sound T instructed by the user in the acoustic signal XB, and the target signal YB is an acoustic signal that emphasizes the target sound T in the acoustic signal XB. A signal (ideally an acoustic signal from which performance sounds other than the target sound T have been removed). A known technique can be arbitrarily employed to generate the separated signal YA and the target signal YB. For example, sound source separation processing (separation of the target sound T) in the frequency domain using a Wiener filter or the like is preferable. .

音響加工部４２は、対象楽器による目標音高Ｐの演奏音を表す音響信号（以下「変換信号」という）ＹCを生成する。具体的には、音響加工部４２は、音源装置２４が生成する参照信号Ｒに対する処理で目標音高Ｐの変換信号ＹCを生成する。図１に例示される通り、第１実施形態の音響加工部４２は、参照音取得部５２と音色変換部５４とを包含する。参照音取得部５２は、音源装置２４が生成した参照信号Ｒを取得する。 The acoustic processing unit 42 generates an acoustic signal (hereinafter referred to as “conversion signal”) YC representing the performance sound of the target pitch P by the target musical instrument. Specifically, the acoustic processing unit 42 generates the converted signal YC of the target pitch P by processing the reference signal R generated by the sound source device 24. As illustrated in FIG. 1, the acoustic processing unit 42 of the first embodiment includes a reference sound acquisition unit 52 and a timbre conversion unit 54. The reference sound acquisition unit 52 acquires the reference signal R generated by the sound source device 24.

音源装置２４が生成する目標音高Ｐの参照音で音響信号ＸBの対象音Ｔを置換すれば、形式的には対象音Ｔを目標音高Ｐに変更した音響信号Ｚを生成することも可能である。しかし、音響信号ＸBには収録環境（例えばライブハウス等の音響空間）に固有の音響特性が付随するから、音源装置２４が生成する参照音で音響信号ＸBの対象音Ｔを単純に置換しただけでは、音響信号ＸBの既存の演奏音と置換後の演奏音（参照音）とで音響特性が顕著に相違する。したがって、再生音の受聴者が聴覚的な違和感を知覚する可能性がある。以上の事情を考慮して、第１実施形態の音色変換部５４は、参照音取得部５２が取得した参照信号Ｒの音色を音響信号ＸBの対象音Ｔの音色に近付けた目標音高Ｐの変換信号ＹCを生成する。参照信号Ｒの音色を対象音Ｔの音色に変換する処理（以下「音色変換処理」という）の具体的な内容については後述する。 If the target sound T of the sound signal XB is replaced with a reference sound of the target pitch P generated by the sound source device 24, it is possible to form an acoustic signal Z in which the target sound T is changed to the target pitch P in form. It is. However, since the acoustic signal XB has acoustic characteristics inherent to the recording environment (for example, an acoustic space such as a live house), the target sound T of the acoustic signal XB is simply replaced with the reference sound generated by the sound source device 24. Then, the acoustic characteristics are significantly different between the existing performance sound of the acoustic signal XB and the replacement performance sound (reference sound). Therefore, there is a possibility that the listener of the reproduced sound perceives a sense of incongruity. Considering the above circumstances, the timbre conversion unit 54 of the first embodiment has a target pitch P obtained by bringing the timbre of the reference signal R acquired by the reference sound acquisition unit 52 close to the timbre of the target sound T of the acoustic signal XB. A conversion signal YC is generated. The specific content of the process of converting the timbre of the reference signal R into the timbre of the target sound T (hereinafter referred to as “timbre conversion process”) will be described later.

図１の混合処理部４４は、音源分離部３２が生成した歌唱音の音響信号ＸAと、成分抽出部４０が生成した対象音Ｔ以外の分離信号ＹAと、音響加工部４２（音色変換部５４）が生成した変換信号ＹCとを混合（例えば加重和）することで音響信号Ｚを生成する。すなわち、音響信号Ｘのうち対象楽器の対象音Ｔの音高を目標音高Ｐに変更した音響信号Ｚが生成される。 1 includes a sound signal XA of the singing sound generated by the sound source separation unit 32, a separation signal YA other than the target sound T generated by the component extraction unit 40, and an acoustic processing unit 42 (tone conversion unit 54). The acoustic signal Z is generated by mixing (for example, a weighted sum) with the conversion signal YC generated by (1). That is, an acoustic signal Z in which the pitch of the target sound T of the target musical instrument in the acoustic signal X is changed to the target pitch P is generated.

第１実施形態の混合処理部４４は、音響信号ＸAと分離信号ＹAと変換信号ＹCとの混合の前後に各種の音響処理を実行する。例えば、各信号の周波数特性を調整する調整処理（イコライジング）が実行される。なお、音響信号ＸAおよび分離信号ＹAと変換信号ＹCとでは残響の度合が相違し得る。したがって、混合前の各信号から残響成分を抑圧する残響抑圧処理と、混合後の音響信号Ｚに適度な残響成分を付与する残響付与処理とを順次に実行することで、残響感が統一された音響信号Ｚを生成することが可能である。混合処理部４４が生成した音響信号Ｚの再生音が放音装置２６から放音される。以上の説明から理解される通り、音響信号Ｘが表現する音響のうち利用者が指示した対象音Ｔの音高を目標音高Ｐに変更した再生音が放音装置２６から放音される。 The mixing processing unit 44 of the first embodiment executes various types of acoustic processing before and after mixing the acoustic signal XA, the separated signal YA, and the converted signal YC. For example, adjustment processing (equalizing) for adjusting the frequency characteristics of each signal is executed. Note that the degree of reverberation may be different between the acoustic signal XA and the separated signal YA and the converted signal YC. Therefore, the reverberation feeling is unified by sequentially executing the reverberation suppression process for suppressing the reverberation component from each signal before mixing and the reverberation providing process for adding an appropriate reverberation component to the mixed acoustic signal Z. An acoustic signal Z can be generated. The reproduced sound of the acoustic signal Z generated by the mixing processing unit 44 is emitted from the sound emitting device 26. As can be understood from the above description, the reproduced sound obtained by changing the pitch of the target sound T indicated by the user to the target pitch P among the sounds represented by the acoustic signal X is emitted from the sound emitting device 26.

図４は、演算処理装置１０が音響信号Ｘから音響信号Ｚを生成する動作（以下「音響編集処理」という）のフローチャートである。入力装置１６に対する利用者からの指示（音響処理の開始指示）を契機として音響編集処理が開始される。 FIG. 4 is a flowchart of an operation (hereinafter referred to as “acoustic editing process”) in which the arithmetic processing device 10 generates the acoustic signal Z from the acoustic signal X. The sound editing process is started in response to an instruction (acoustic process start instruction) from the user to the input device 16.

音響編集処理を開始すると、音源分離部３２は、信号供給装置２２が出力する音響信号Ｘから歌唱音の音響信号ＸAと対象楽器の演奏音の音響信号ＸBとを生成する（ＳA1）。音高解析部３４は、記憶装置１２に記憶された基底行列Ｂを教師情報とする非負値行列因子分解を音響信号ＸBの観測行列Ｗに対して実行することで音高系列Ｓ（係数行列Ｇ）を生成し（ＳA2）、表示制御部３６は、音高系列Ｓを表象する音高遷移画像１４２を表示装置１４に表示させる（ＳA3）。 When the sound editing process is started, the sound source separation unit 32 generates the sound signal XA of the singing sound and the sound signal XB of the performance sound of the target instrument from the sound signal X output from the signal supply device 22 (SA1). The pitch analysis unit 34 performs a non-negative matrix factorization using the base matrix B stored in the storage device 12 as teacher information on the observation matrix W of the acoustic signal XB, thereby generating a pitch sequence S (coefficient matrix G ) Is generated (SA2), and the display control unit 36 causes the display device 14 to display a pitch transition image 142 representing the pitch series S (SA3).

音高遷移画像１４２に対する対象音Ｔおよび目標音高Ｐの指示を指示受付部３８が利用者から受付けると（ＳA4：YES）、成分抽出部４０は、音源分離部３２が生成した音響信号ＸBから対象音Ｔ以外の分離信号ＹAと対象音Ｔの対象信号ＹBとを生成する（ＳA5）。音響加工部４２は、音源装置２４が生成する参照信号Ｒに対象音Ｔの音色に近付ける音色変換処理（モーフィング）で変換信号ＹCを生成する（ＳA6）。混合処理部４４は、音響信号ＸAと分離信号ＹAと変換信号ＹCとの混合で音響信号Ｚを生成する（ＳA7）。 When the instruction receiving unit 38 receives an instruction for the target sound T and the target pitch P for the pitch transition image 142 from the user (SA4: YES), the component extracting unit 40 uses the acoustic signal XB generated by the sound source separating unit 32. A separation signal YA other than the target sound T and a target signal YB of the target sound T are generated (SA5). The acoustic processing unit 42 generates the conversion signal YC by timbre conversion processing (morphing) that brings the reference signal R generated by the sound source device 24 close to the timbre of the target sound T (SA6). The mixing processing unit 44 generates the acoustic signal Z by mixing the acoustic signal XA, the separated signal YA, and the converted signal YC (SA7).

＜音響加工部４２＞
図５は、音響加工部４２の具体的な構成図である。図５に例示される通り、第１実施形態における音響加工部４２の音色変換部５４は、解析処理部６２と音響処理部６４とを包含する。図６は、第１実施形態の音響加工部４２（参照音取得部５２，音色変換部５４）が実行する音色変換処理ＳA6のフローチャートであり、図７は音色変換処理ＳA6の説明図である。 <Acoustic processing part 42>
FIG. 5 is a specific configuration diagram of the acoustic processing unit 42. As illustrated in FIG. 5, the timbre conversion unit 54 of the acoustic processing unit 42 in the first embodiment includes an analysis processing unit 62 and an acoustic processing unit 64. FIG. 6 is a flowchart of the timbre conversion process SA6 executed by the acoustic processing unit 42 (reference sound acquisition unit 52, timbre conversion unit 54) of the first embodiment, and FIG. 7 is an explanatory diagram of the timbre conversion process SA6.

音色変換処理ＳA6を開始すると、参照音取得部５２は、対象信号ＹBの対象音Ｔの音高を特定し（ＳB1）、対象音Ｔと同等の音高の参照音Ｑ1を表す参照信号Ｒ1を音源装置２４から取得する（ＳB2）。前述の通り、参照音Ｑ1の音色は音響信号ＸBの対象音Ｔとは相違する。図５および図７に例示される通り、解析処理部６２は、成分抽出部４０が生成した対象信号ＹBと参照音取得部５２がステップＳB2で取得した参照信号Ｒ1とを利用して変換フィルタＨを生成する（ＳB3）。変換フィルタＨは、音源装置２４が生成した参照音Ｑ1の音色を対象音Ｔの音色に近付けるためのフィルタである。 When the tone color conversion process SA6 is started, the reference sound acquisition unit 52 specifies the pitch of the target sound T of the target signal YB (SB1), and generates a reference signal R1 representing the reference sound Q1 having the same pitch as the target sound T. Obtained from the sound source device 24 (SB2). As described above, the tone color of the reference sound Q1 is different from the target sound T of the acoustic signal XB. As illustrated in FIGS. 5 and 7, the analysis processing unit 62 uses the target signal YB generated by the component extraction unit 40 and the reference signal R1 acquired by the reference sound acquisition unit 52 in step SB2 to convert the conversion filter H. Is generated (SB3). The conversion filter H is a filter for bringing the timbre of the reference sound Q1 generated by the sound source device 24 closer to the timbre of the target sound T.

具体的には、解析処理部６２は、対象信号ＹBと参照信号Ｒ1との間で相互に対応する各フレーム（例えば音響的な特徴量が相互に類似するフレーム）の対毎に変換フィルタＨを生成する。対象信号ＹBと参照信号Ｒ1との間の各フレームの対応の解析には動的計画法等の公知の技術が任意に採用される。第１実施形態の変換フィルタＨは、周波数軸上に設定された複数の帯域（以下「解析帯域」という）の各々に対応する調整値（ゲイン）ｈの系列である。各解析帯域は、単純には相等しい帯域幅に設定されるが、人間の聴覚特性の傾向が反映されるように各解析帯域の帯域幅を対数的な関係に設定することも可能である。変換フィルタＨのうち任意の１個の解析帯域の調整値ｈは、例えば、参照信号Ｒ1の強度ＶRに対する対象信号ＹBの強度ＶYの相対比（ｈ＝ＶY／ＶR）として算定される。参照信号Ｒ1の強度ＶRは、参照信号Ｒ1の強度スペクトルのうち解析帯域内の複数の周波数にわたる強度の総和であり、対象信号ＹBの強度ＶYは、対象信号ＹBの強度スペクトルのうち解析帯域内の複数の周波数にわたる強度の総和である。変換フィルタＨを構成する複数の調整値ｈの平均がゼロとなるように（ゼロ平均）、各調整値ｈを調整する構成も採用され得る。 Specifically, the analysis processing unit 62 sets the conversion filter H for each pair of frames (for example, frames whose acoustic feature values are similar to each other) corresponding to each other between the target signal YB and the reference signal R1. Generate. A known technique such as dynamic programming is arbitrarily employed for analyzing the correspondence of each frame between the target signal YB and the reference signal R1. The conversion filter H of the first embodiment is a series of adjustment values (gains) h corresponding to each of a plurality of bands (hereinafter referred to as “analysis bands”) set on the frequency axis. Each analysis band is simply set to an equal bandwidth, but the bandwidth of each analysis band can be set to a logarithmic relationship so that the tendency of human auditory characteristics is reflected. The adjustment value h of an arbitrary analysis band of the conversion filter H is calculated as, for example, the relative ratio (h = VY / VR) of the intensity VY of the target signal YB to the intensity VR of the reference signal R1. The intensity VR of the reference signal R1 is the sum of the intensity over a plurality of frequencies in the analysis band in the intensity spectrum of the reference signal R1, and the intensity VY of the target signal YB is in the analysis band of the intensity spectrum of the target signal YB. The sum of intensities over multiple frequencies. A configuration in which each adjustment value h is adjusted so that the average of the plurality of adjustment values h constituting the conversion filter H becomes zero (zero average) may be employed.

以上に例示された手順で解析処理部６２が変換フィルタＨを生成すると、参照音取得部５２は、目標音高Ｐ（対象音Ｔとは相違する音高）の参照音Ｑ2を表す参照信号Ｒ2を音源装置２４から取得する（ＳB4）。参照音Ｑ2の音色は参照音Ｑ1と同等である。図５および図７に例示される通り、音響処理部６４は、解析処理部６２がステップＳB3で生成した変換フィルタＨを参照信号Ｒ2に適用することで変換信号ＹCを生成する（ＳB5）。具体的には、音響処理部６４は、参照信号Ｒ2の各フレームの強度スペクトルを周波数軸上で区分した各解析帯域に変換フィルタＨの各調整値ｈを乗算する。前述の通り、変換フィルタＨは、参照音Ｑ1の音色を対象音Ｔの音色に近付けるように作用するから、変換フィルタＨを参照信号Ｒ2に適用することで、対象音Ｔに近似した音色で目標音高Ｐの音響を表す変換信号ＹCが生成される。以上が音色変換処理ＳA6の具体的な内容である。 When the analysis processing unit 62 generates the conversion filter H in the procedure exemplified above, the reference sound acquisition unit 52 represents the reference signal R2 representing the reference sound Q2 having the target pitch P (pitch different from the target sound T). Is acquired from the sound source device 24 (SB4). The tone of the reference sound Q2 is equivalent to the reference sound Q1. As illustrated in FIGS. 5 and 7, the acoustic processing unit 64 generates the converted signal YC by applying the conversion filter H generated by the analysis processing unit 62 in step SB3 to the reference signal R2 (SB5). Specifically, the acoustic processing unit 64 multiplies each analysis band obtained by dividing the intensity spectrum of each frame of the reference signal R2 on the frequency axis by each adjustment value h of the conversion filter H. As described above, the conversion filter H acts to bring the timbre of the reference sound Q1 closer to the timbre of the target sound T. Therefore, by applying the conversion filter H to the reference signal R2, the target timbre approximates the target sound T. A conversion signal YC representing the sound of the pitch P is generated. The above is the specific content of the timbre conversion process SA6.

以上の説明から理解される通り、第１実施形態では、音源装置２４から取得した参照信号Ｒの加工で生成された目標音高Ｐの変換信号ＹCが対象音Ｔの抑圧後の分離信号ＹAに混合されるから、対象音Ｔの対象信号ＹBを目標音高Ｐに変換する構成と比較して音響信号Ｚの音質の低下を抑制することが可能である。成分抽出部４０が生成する対象信号ＹBは、理想的には対象音のみで構成されるが、実際には対象音以外の音響（以下「残差成分」という）も含有する。対象信号ＹBの音高を目標音高Ｐに変換する構成では、音高の変更に起因して残差成分が特に顕在化する。他方、参照信号Ｒから生成された目標音高Ｐの変換信号ＹCを分離信号ＹAに混合する第１実施形態では、対象信号ＹBの音高の変更が不要であるから、成分抽出部４０の処理精度が低い場合（対象信号ＹBに残差成分が含有される場合）でも高音質の音響信号Ｚを生成できるという利点がある。他方、音響信号ＸBとは無関係に生成された参照信号Ｒを単純に分離信号ＹAに混合する構成では、両者間の音色の相違に起因した聴覚的な違和感が問題となるが、第１実施形態では、参照信号Ｒの参照音が対象音Ｔの音色に変換されるから、音響信号ＸBの音色と参照音の音色との相違に起因した聴覚的な違和感を解消することが可能である。 As understood from the above description, in the first embodiment, the converted signal YC of the target pitch P generated by processing the reference signal R acquired from the sound source device 24 is converted into the separated signal YA after the suppression of the target sound T. Since they are mixed, it is possible to suppress a decrease in sound quality of the acoustic signal Z as compared with the configuration in which the target signal YB of the target sound T is converted into the target pitch P. The target signal YB generated by the component extraction unit 40 is ideally composed only of the target sound, but actually includes sound other than the target sound (hereinafter referred to as “residual component”). In the configuration in which the pitch of the target signal YB is converted to the target pitch P, the residual component becomes particularly apparent due to the change in the pitch. On the other hand, in the first embodiment in which the converted signal YC of the target pitch P generated from the reference signal R is mixed with the separated signal YA, it is not necessary to change the pitch of the target signal YB. Even when the accuracy is low (when a residual component is included in the target signal YB), there is an advantage that a high-quality sound signal Z can be generated. On the other hand, in the configuration in which the reference signal R generated independently of the acoustic signal XB is simply mixed with the separated signal YA, the auditory discomfort due to the difference in timbre between the two becomes a problem, but the first embodiment Then, since the reference sound of the reference signal R is converted into the timbre of the target sound T, it is possible to eliminate the auditory discomfort caused by the difference between the timbre of the acoustic signal XB and the timbre of the reference sound.

ところで、対象音Ｔと同等の音色で目標音高Ｐの音響を生成する構成としては、例えば、対象音Ｔの音高を目標音高Ｐに変更し、変更後に音色を対象音Ｔに近付ける構成（ピッチシフト→モーフィング）が想定され得る。しかし、前述の通り、対象音Ｔの音高の変更に起因して顕在化した残差成分が音色の変換で更に顕在化するという問題がある。以上の事情に対し、第１実施形態では、対象音Ｔと同等の音高の参照音Ｑ1を対象音Ｔの音色に近付けるための変換フィルタＨを対象信号ＹBと参照信号Ｒ1とから生成し、目標音高Ｐの参照音Ｑ2の参照信号Ｒ2に変換フィルタＨに適用することで変換信号ＹCを生成する。すなわち、対象音Ｔの音高の変換は原理的に不要である。したがって、第１実施形態によれば、対象音Ｔの音高の変更に起因した音質の低下を防止できるという利点がある。 By the way, as a structure which produces | generates the sound of the target pitch P by the timbre equivalent to the target sound T, for example, the structure which changes the pitch of the target sound T to the target pitch P and makes the timbre approach the target sound T after the change. (Pitch shift → morphing) can be assumed. However, as described above, there is a problem that the residual component that has become apparent due to the change in the pitch of the target sound T becomes more apparent due to the conversion of the timbre. For the above circumstances, in the first embodiment, a conversion filter H for making the reference sound Q1 having the same pitch as the target sound T close to the timbre of the target sound T is generated from the target signal YB and the reference signal R1, By applying the reference signal R2 of the reference sound Q2 having the target pitch P to the conversion filter H, the conversion signal YC is generated. That is, it is not necessary in principle to convert the pitch of the target sound T. Therefore, according to the first embodiment, there is an advantage that it is possible to prevent a decrease in sound quality due to a change in the pitch of the target sound T.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下に例示する各構成において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each structure illustrated below, the code | symbol used by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

音高解析部３４が生成する係数行列Ｇ（音高系列Ｓ）では、理想的には、対象楽器の実際の演奏音に対応する係数ａ[k,n]のみが有意な数値に設定されるが、現実的には、例えば対象楽器の演奏音に対して特定の関係（例えば５度の音程）にある音高の係数ａ[k,n]が、実際には当該音高が演奏されていないのに有意な数値となる可能性がある。すなわち、音響信号ＸBにおける対象楽器の演奏音の実際の音高が分布する音高範囲の外側にも、有意な数値の係数ａ[k,n]が存在し得る。利用者は、入力装置１６を適宜に操作することで、図８に例示される通り、表示装置１４に表示された音高遷移画像１４２のうち音響信号ＸBの音響（対象楽器の演奏音）が存在すると推測される時間軸上および音高軸上の範囲（以下「発音範囲」という）Ａを指示することが可能である。例えば、対象楽器として鍵盤楽器（例えばピアノ）を想定すると、演奏者の右手で演奏される高域側の音高範囲と左手で演奏される低域側の音高範囲とが発音範囲Ａとして指示される。第２実施形態の指示受付部３８は、以上に説明した発音範囲Ａの指示を利用者から受付ける。 In the coefficient matrix G (pitch series S) generated by the pitch analysis unit 34, ideally, only the coefficient a [k, n] corresponding to the actual performance sound of the target musical instrument is set to a significant numerical value. However, in reality, for example, the pitch coefficient a [k, n] having a specific relationship (for example, a pitch of 5 degrees) with the performance sound of the target musical instrument is actually played. There is a possibility that it will be a significant number even if it is not. That is, a significant numerical coefficient a [k, n] may exist outside the pitch range in which the actual pitch of the performance sound of the target instrument in the acoustic signal XB is distributed. The user appropriately operates the input device 16 so that the sound of the acoustic signal XB (the performance sound of the target instrument) in the pitch transition image 142 displayed on the display device 14 is displayed as illustrated in FIG. It is possible to designate a range A (hereinafter referred to as “sound generation range”) A on the time axis and the pitch axis that is presumed to exist. For example, assuming a keyboard instrument (for example, a piano) as the target instrument, the high pitch range played with the right hand of the performer and the low pitch range played with the left hand are designated as the pronunciation range A. Is done. The instruction receiving unit 38 of the second embodiment receives an instruction for the pronunciation range A described above from the user.

第２実施形態の音高解析部３４は、指示受付部３８が受付けた発音範囲Ａを加味して音高系列Ｓを再解析する。具体的には、音高解析部３４は、図８に例示される通り、利用者から指示された発音範囲Ａの外側の各係数ａ[k,n]がゼロに設定されるとともに発音範囲Ａの内側の各係数ａ[k,n]がゼロ以外の有意な数値λに設定された行列を、係数行列Ｇの初期値（初期行列）として利用した非負値行列因子分解で音高系列Ｓを算定する。数値λは例えば乱数に設定される。表示制御部３６は、音高解析部３４が再解析した音高系列Ｓを表象する音高遷移画像１４２を表示装置１４に表示させる。音高遷移画像１４２に対する利用者からの指示に応じて音響信号Ｚを生成する処理は第１実施形態と同様である。 The pitch analysis unit 34 according to the second embodiment reanalyzes the pitch series S in consideration of the pronunciation range A received by the instruction receiving unit 38. Specifically, the pitch analysis unit 34 sets each coefficient a [k, n] outside the sound generation range A instructed by the user to zero and the sound generation range A as illustrated in FIG. The pitch sequence S is obtained by non-negative matrix factorization using a matrix in which each coefficient a [k, n] inside is set to a significant numerical value λ other than zero as an initial value (initial matrix) of the coefficient matrix G. Calculate. The numerical value λ is set to a random number, for example. The display control unit 36 causes the display device 14 to display a pitch transition image 142 representing the pitch series S reanalyzed by the pitch analysis unit 34. The process of generating the acoustic signal Z in response to an instruction from the user with respect to the pitch transition image 142 is the same as in the first embodiment.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、発音範囲Ａの外側の各係数ａ[k,n]がゼロに設定された行列を係数行列Ｇの初期値として利用した非負値行列因子分解で音高系列Ｓが生成される。すなわち、利用者が指示した発音範囲Ａが反映されるように音高系列Ｓが更新される。したがって、音高系列Ｓに発音範囲Ａの指示を反映させない構成と比較して音高系列Ｓを高精度に生成できるという利点がある。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, the pitch sequence S is generated by non-negative matrix factorization using a matrix in which each coefficient a [k, n] outside the sound generation range A is set to zero as an initial value of the coefficient matrix G. Generated. That is, the pitch series S is updated so that the sound generation range A instructed by the user is reflected. Therefore, there is an advantage that the pitch sequence S can be generated with higher accuracy than the configuration in which the instruction of the sound generation range A is not reflected in the pitch sequence S.

＜第３実施形態＞
図９は、第３実施形態の音響加工部４２（参照音取得部５２，音色変換部５４）が実行する音色変換処理ＳA6のフローチャートである。第１実施形態では、対象音Ｔと参照音Ｑ1とが同等の音高である場合を想定して対象信号ＹBと参照信号Ｒ1とに応じた変換フィルタＨの生成を例示したが、実際には、例えば音響信号ＸBにおける対象楽器の調律や調弦の状況に起因して、対象音Ｔと参照音Ｑ1とで音高が相違する可能性がある。以上の事情を考慮して、第３実施形態の参照音取得部５２は、図９に例示される通り、対象音Ｔと参照音Ｑ1とを同等の音高に調整する処理（ＳB10）を、参照信号Ｒ1の取得（ＳB2）と変換フィルタＨの生成（ＳB3）との間に実行する。具体的には、第３実施形態の参照音取得部５２は、参照音Ｑ1の参照信号Ｒ1を処理することで参照音Ｑ1を対象音Ｔの音高に調整する。参照信号Ｒ1の音高の変更には公知の技術（ピッチシフト）が任意に採用される。解析処理部６２は、調整後の参照信号Ｒ1と対象音Ｔの対象信号ＹBとを利用して、第１実施形態と同様の方法で変換フィルタＨを生成する（ＳB3）。 <Third Embodiment>
FIG. 9 is a flowchart of the timbre conversion process SA6 executed by the acoustic processing unit 42 (reference sound acquisition unit 52, timbre conversion unit 54) of the third embodiment. In the first embodiment, the generation of the conversion filter H corresponding to the target signal YB and the reference signal R1 is exemplified on the assumption that the target sound T and the reference sound Q1 have the same pitch. For example, the pitch of the target sound T and the reference sound Q1 may be different due to the tuning of the target musical instrument or the state of tuning of the acoustic signal XB. Considering the above circumstances, as illustrated in FIG. 9, the reference sound acquisition unit 52 of the third embodiment performs a process (SB10) of adjusting the target sound T and the reference sound Q1 to equivalent pitches. It is executed between the acquisition of the reference signal R1 (SB2) and the generation of the conversion filter H (SB3). Specifically, the reference sound acquisition unit 52 of the third embodiment adjusts the reference sound Q1 to the pitch of the target sound T by processing the reference signal R1 of the reference sound Q1. A known technique (pitch shift) is arbitrarily employed to change the pitch of the reference signal R1. The analysis processing unit 62 uses the adjusted reference signal R1 and the target signal YB of the target sound T to generate the conversion filter H by the same method as in the first embodiment (SB3).

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、対象音Ｔと参照音Ｑ1とを同等の音高に調整したうえで変換フィルタＨを生成するから、対象音Ｔと参照音Ｑ1とで音高が相違した状態で変換フィルタＨを生成する場合と比較して、参照音Ｑ1（ひいては参照音Ｑ2）を対象音Ｔの音色に高精度に変換可能な変換フィルタＨを生成できるという利点がある。なお、以上の説明では、参照音Ｑ1を対象音Ｔの音高に調整する構成を例示したが、対象音Ｔを参照音Ｑ1と同等の音高に調整することも可能である。ただし、前述の通り、対象音Ｔには対象音以外の残差成分が包含され、対象音Ｔの音高の変更に起因して残差成分が顕在化する可能性がある。以上の事情を考慮すると、参照信号Ｒ1の参照音Ｑ1を対象音Ｔの音高に調整する構成が格別に好適である。 In the third embodiment, the same effect as in the first embodiment is realized. In the third embodiment, since the conversion filter H is generated after adjusting the target sound T and the reference sound Q1 to the same pitch, the target sound T and the reference sound Q1 have different pitches. Compared with the case where the conversion filter H is generated, there is an advantage that the conversion filter H that can convert the reference sound Q1 (and thus the reference sound Q2) into the timbre of the target sound T with high accuracy can be generated. In the above description, the configuration in which the reference sound Q1 is adjusted to the pitch of the target sound T is exemplified, but the target sound T can also be adjusted to a pitch equivalent to the reference sound Q1. However, as described above, the target sound T includes a residual component other than the target sound, and the residual component may become obvious due to a change in the pitch of the target sound T. Considering the above circumstances, a configuration in which the reference sound Q1 of the reference signal R1 is adjusted to the pitch of the target sound T is particularly suitable.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）前述の各形態では、音響信号ＸBに対する非負値行列因子分解で音高系列Ｓを生成したた、音高系列Ｓを生成する方法は以上の例示に限定されない。例えば、自動採譜等の公知の解析技術を音高系列Ｓの生成に利用することも可能である。また、第２実施形態では、非負値行列因子分解以外の方法で暫定的な音高系列Ｓを生成し、当該音高系列Ｓの音高遷移画像１４２のうち発音範囲Ａの外側に対応する各係数ａ[k,n]がゼロに設定された係数行列Ｇを初期値として観測行列Ｗの非負値行列因子分解を実行することで確定的な音高系列Ｓを再解析することも可能である。すなわち、発音範囲Ａの指示前の暫定的な音高系列Ｓを生成する方法と発音範囲Ａを反映した確定的な音高系列Ｓを生成する方法とは相違し得る。なお、発音範囲Ａの指示と音高系列Ｓの再解析とを複数回にわたり反復することも可能である。 (1) In each of the above embodiments, the pitch sequence S generated by the non-negative matrix factorization for the acoustic signal XB is not limited to the above examples. For example, a well-known analysis technique such as automatic music transcription can be used to generate the pitch sequence S. In the second embodiment, a provisional pitch sequence S is generated by a method other than non-negative matrix factorization, and each of the pitch transition images 142 of the pitch sequence S corresponding to the outside of the pronunciation range A. It is also possible to re-analyze the definite pitch sequence S by executing non-negative matrix factorization of the observation matrix W using the coefficient matrix G in which the coefficient a [k, n] is set to zero as an initial value. . That is, the method for generating the provisional pitch sequence S before the sounding range A is instructed may be different from the method for generating the definite pitch sequence S reflecting the sounding range A. It is also possible to repeat the instruction of the sound generation range A and the reanalysis of the pitch series S a plurality of times.

（２）前述の各形態では、対象楽器の相異なる音高の演奏音に対応するＫ個の基底行列Ｂを利用した観測行列Ｗの非負値行列因子分解で係数行列Ｇを算定したが、観測行列Ｗに対して実行される非負値行列因子分解の内容は適宜に変更され得る。例えば、各要素が乱数で初期化されたＫ個の基底ベクトル（以下「暫定基底ベクトル」という）で構成される基底行列Ｂを非負値行列因子分解の反復的な演算で係数行列Ｇとともに順次に更新する構成も採用される。 (2) In each of the above-described embodiments, the coefficient matrix G is calculated by non-negative matrix factorization of the observation matrix W using K basis matrices B corresponding to performance sounds of different pitches of the target musical instrument. The content of the non-negative matrix factorization performed on the matrix W can be changed as appropriate. For example, a basis matrix B composed of K basis vectors (hereinafter referred to as “provisional basis vectors”) in which each element is initialized with a random number is sequentially converted together with a coefficient matrix G by non-negative matrix factorization repetitive calculation. The structure to update is also employ | adopted.

また、対象楽器の演奏音について事前に用意された基底ベクトルと任意の暫定基底ベクトルとを混在させた基底行列Ｂを非負値行列因子分解に利用することも可能である。対象楽器の基底ベクトルと任意の暫定基底ベクトルとを基底行列Ｂに混在させた構成では、例えば対象楽器のほかに対象楽器以外の楽器（以下「他楽器」という）の演奏音が音響信号ＸBに包含される場合に、他楽器の演奏音が暫定基底ベクトルに反映されるように基底行列Ｂが順次に更新される。したがって、他楽器の演奏音が音響信号ＸBに包含される場合でも対象楽器の音高系列Ｓを高精度に特定できるという利点がある。なお、対象楽器の基底ベクトルと任意の暫定基底ベクトルとを基底行列Ｂに混在させた以上の構成に第２実施形態を適用する場合には、初期的な係数行列Ｇのうち対象楽器の各基底ベクトルに対応する係数ベクトルｇ[k]のみについて、発音範囲Ａの外側の各係数ａ[k,n]をゼロに設定する構成（各暫定基底ベクトルに対応する係数ベクトルｇ[k]については各係数ａ[k,n]をゼロとしない構成）が好適である。また、観測行列Ｗの非負値行列因子分解には、例えば特開２０１３−０３３１９６号公報に例示された拘束条件を適用することも可能である。 It is also possible to use a base matrix B in which a base vector prepared in advance for a performance sound of the target musical instrument and an arbitrary provisional base vector are mixed for non-negative matrix factorization. In the configuration in which the base vector of the target instrument and an arbitrary provisional base vector are mixed in the base matrix B, for example, the performance sound of an instrument other than the target instrument (hereinafter referred to as “other instrument”) in addition to the target instrument is included in the acoustic signal XB. When included, the basis matrix B is sequentially updated so that the performance sound of the other musical instrument is reflected in the provisional basis vector. Therefore, there is an advantage that the pitch series S of the target musical instrument can be specified with high accuracy even when the performance sound of the other musical instrument is included in the acoustic signal XB. When the second embodiment is applied to the above configuration in which the base vector of the target instrument and an arbitrary provisional base vector are mixed in the base matrix B, each base of the target instrument in the initial coefficient matrix G is used. A configuration in which only the coefficient a [k, n] outside the sounding range A is set to zero for only the coefficient vector g [k] corresponding to the vector (each coefficient vector g [k] corresponding to each provisional base vector A configuration in which the coefficient a [k, n] is not zero) is preferable. In addition, for example, the constraint conditions exemplified in Japanese Patent Application Laid-Open No. 2013-033196 can be applied to the non-negative matrix factorization of the observation matrix W.

（３）音源装置２４が複数種の楽器（同種だが音色が相違する楽器は別種と区別され得る）の演奏音の参照信号Ｒを生成可能な構成では、複数種の楽器のうち利用者が選択した楽器（音響信号Ｘの再生音から音響特性が近似すると推測される楽器）の演奏音の参照信号Ｒを参照音取得部５２が取得することも可能である。 (3) In a configuration in which the sound source device 24 can generate a reference signal R for a performance sound of a plurality of types of musical instruments (the same type but different types of musical instruments can be distinguished from different types), the user selects among the multiple types of musical instruments It is also possible for the reference sound acquisition unit 52 to acquire the reference signal R of the performance sound of the musical instrument (the musical instrument whose acoustic characteristics are estimated to be approximated from the reproduced sound of the acoustic signal X).

（４）第２実施形態では、利用者が発音範囲Ａを指示する構成を例示したが、発音範囲Ａを設定する方法は以上の例示に限定されない。例えば、音響信号Ｘの楽曲の演奏内容（音符の時系列）を指定する楽曲データ（例えばMIDI規格に準拠した時系列データ）を参照することで時間軸上および音高軸上の各音符の分布範囲を特定し、音高解析部３４が当該範囲を発音範囲Ａとして設定することも可能である。また、実際に演奏音が存在する地点の係数ａ[k,n]は相対的に大きい数値に設定されるという傾向を前提とすれば、係数行列Ｇ（音高系列Ｓ）のうち閾値を上回る係数ａ[k,n]が分布する範囲を発音範囲Ａとして設定することも可能である。なお、第２実施形態では音高軸上の範囲および時間軸上の範囲の双方で発音範囲Ａを画定したが、音高軸上の範囲（時間軸上は全範囲）を発音範囲Ａとして設定する構成や、時間軸上の範囲（音高軸上は全範囲）を発音範囲Ａとして設定する構成も採用され得る。 (4) In the second embodiment, the configuration in which the user instructs the sound generation range A is exemplified, but the method of setting the sound generation range A is not limited to the above illustration. For example, the distribution of each note on the time axis and the pitch axis by referring to music data (for example, time series data compliant with the MIDI standard) that specifies the musical performance of the music of the acoustic signal X (time series of notes) The range can be specified, and the pitch analysis unit 34 can set the range as the sound generation range A. Further, if it is assumed that the coefficient a [k, n] where the performance sound actually exists is set to a relatively large value, the coefficient matrix G (pitch series S) exceeds the threshold value. It is also possible to set the range in which the coefficient a [k, n] is distributed as the sound generation range A. In the second embodiment, the sound generation range A is defined by both the range on the pitch axis and the range on the time axis. However, the range on the pitch axis (the entire range on the time axis) is set as the sound generation range A. And a configuration in which a range on the time axis (a whole range on the pitch axis) is set as the sound generation range A may be employed.

（５）前述の各形態では、対象音の音高を変更する場合を便宜的に例示したが、対象音の発音期間（始点および終点）を音高とともに変更することも可能である。例えば、参照音取得部５２が取得した参照信号Ｒ2を音色変換部５４（変換処理部６４）が目標の継続長に伸縮したうえで変換フィルタＨを適用する構成や、参照信号Ｒ2に対する変換フィルタＨの適用で生成した変換信号ＹCを音色変換部５４（変換処理部６４）が目標の継続長に伸縮する構成が採用され得る。 (5) In each of the above-described embodiments, the case where the pitch of the target sound is changed is illustrated for convenience. However, the sound generation period (start point and end point) of the target sound can be changed together with the pitch. For example, a configuration in which the reference signal R2 acquired by the reference sound acquisition unit 52 is applied to the reference filter R after the timbre conversion unit 54 (conversion processing unit 64) expands or contracts to the target duration, or a conversion filter H for the reference signal R2 A configuration in which the timbre conversion unit 54 (conversion processing unit 64) expands and contracts the conversion signal YC generated by applying the above to the target continuation length may be employed.

（６）音高遷移画像１４２において対象音Ｔと目標音高Ｐとが暫定的に指示された場合に、変換信号ＹCを生成して放音装置２６から放音することも可能である。以上の構成によれば、対象音Ｔの変更結果を利用者が事前に試聴できるという利点がある。 (6) When the target sound T and the target pitch P are tentatively indicated in the pitch transition image 142, the conversion signal YC can be generated and emitted from the sound emitting device 26. According to the above configuration, there is an advantage that the user can audition the change result of the target sound T in advance.

（７）第３実施形態では対象信号ＹBおよび参照信号Ｒ1の一方を他方の音高に調整する構成を例示したが、対象信号ＹBおよび参照信号Ｒ1の音高を、事前に設定された複数の音高のうち最も近似する音高に変更（クオンタイズ）する構成も採用され得る。また、対象信号ＹBの対象音Ｔや参照信号Ｒ1の参照音Ｑ1に音高の微小変動（揺れ）が存在する場合には、微小変動を抑制（理想的には除去）したうえで変換フィルタＨを生成することも可能である。例えば、音声合成で生成された歌唱音の音響信号ＸBにはビブラート等の微小変動が付随し得るから、対象信号ＹBから音高の微小変動を抑制する構成が格別に好適である。また、残差成分や雑音成分を対象信号ＹBから除去したうえで変換フィルタＨを生成することも可能である。 (7) In the third embodiment, the configuration in which one of the target signal YB and the reference signal R1 is adjusted to the other pitch is exemplified, but the pitches of the target signal YB and the reference signal R1 are set to a plurality of preset pitches. A configuration for changing (quantizing) the pitch to the closest pitch among the pitches may be employed. Further, if there is a minute pitch fluctuation (swing) in the target sound T of the target signal YB or the reference sound Q1 of the reference signal R1, the conversion filter H is suppressed after suppressing the minute fluctuation (ideally removed). Can also be generated. For example, since the acoustic signal XB of the singing sound generated by speech synthesis can be accompanied by minute fluctuations such as vibrato, a configuration that suppresses minute fluctuations in the pitch from the target signal YB is particularly suitable. It is also possible to generate the conversion filter H after removing residual components and noise components from the target signal YB.

（８）前述の各形態では、音源装置２４が生成した参照信号Ｒを参照音取得部５２が取得する構成を例示したが、音源装置２４が生成した参照信号Ｒを事前に記憶装置１２に格納し、参照音取得部５２が記憶装置１２から参照信号Ｒを取得する構成も採用され得る。また、音源装置２４が生成した各音高の参照信号Ｒを周波数領域に変換することで基底行列Ｂ（各基底ベクトルｂ[k]）を生成することも可能である。 (8) In each of the above-described embodiments, the reference sound acquisition unit 52 acquires the reference signal R generated by the sound source device 24. However, the reference signal R generated by the sound source device 24 is stored in the storage device 12 in advance. In addition, a configuration in which the reference sound acquisition unit 52 acquires the reference signal R from the storage device 12 may be employed. It is also possible to generate the basis matrix B (each basis vector b [k]) by converting the reference signal R of each pitch generated by the sound source device 24 into the frequency domain.

（９）前述の各形態では、音響信号Ｘを歌唱音の音響信号ＸAと対象楽器の演奏音の音響信号ＸBとに分離したが、歌唱音の音響信号ＸAを分離する構成は省略され得る。例えば、歌唱音を含まない音響信号Ｘを処理する構成では、音源分離部３２が省略されるとともに、混合処理部４４は、分離信号ＹAと変換信号ＹCとの混合で音響信号Ｚを生成する。 (9) In each of the above-described embodiments, the acoustic signal X is separated into the singing sound acoustic signal XA and the performance sound acoustic signal XB of the target instrument, but the configuration for separating the singing sound acoustic signal XA may be omitted. For example, in the configuration for processing the acoustic signal X that does not include the singing sound, the sound source separation unit 32 is omitted, and the mixing processing unit 44 generates the acoustic signal Z by mixing the separated signal YA and the converted signal YC.

（１０）携帯電話機等の端末装置と通信するサーバ装置で音響処理装置１００を実現することも可能である。例えば、音響処理装置１００は、端末装置から受信した音響信号Ｘから音響信号Ｚを生成して端末装置に送信する。 (10) The sound processing apparatus 100 can also be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the acoustic processing device 100 generates an acoustic signal Z from the acoustic signal X received from the terminal device and transmits the acoustic signal Z to the terminal device.

１００……音響処理装置、１０……演算処理装置、１２……記憶装置、１４……表示装置、１６……入力装置、２２……信号供給装置、２４……音源装置、２６……放音装置、３２……音源分離部、３４……音高解析部、３６……表示制御部、３８……指示受付部、４０……成分抽出部、４２……音響加工部、４４……混合処理部、５２……参照音取得部、５４……音色変換部、６２……解析処理部、６４……音響処理部。

DESCRIPTION OF SYMBOLS 100 ... Sound processing device, 10 ... Arithmetic processing device, 12 ... Memory | storage device, 14 ... Display device, 16 ... Input device, 22 ... Signal supply device, 24 ... Sound source device, 26 ... Sound emission Device 32. Sound source separation unit 34. Pitch analysis unit 36 ... Display control unit 38 ... Instruction receiving unit 40 ... Component extraction unit 42 ... Acoustic processing unit 44 ... Mixing process , 52... Reference sound acquisition unit, 54... Timbre conversion unit, 62.

Claims

A first reference signal representing a first reference sound having a tone different from the target sound and having a pitch equivalent to the target sound, and a target pitch different from the pitch of the target sound, equivalent to the first reference sound Reference sound acquisition means for acquiring a second reference signal representing the second reference sound of the timbre of
Analysis processing means for generating a conversion filter for bringing the first reference sound closer to the timbre of the target sound using the target signal representing the target sound and the first reference signal;
An acoustic processing device comprising: an acoustic processing unit configured to apply the conversion filter to the second reference signal to generate a converted signal representing the sound of the target pitch with a tone color approximate to the target sound.

The sound processing apparatus according to claim 1, wherein the reference sound acquisition unit adjusts the pitch of one of the target signal and the first reference signal to the other pitch.

Comprising component extraction means for generating the target signal by suppressing sound other than the target sound from the acoustic signal;
The sound processing device according to claim 2, wherein the reference sound acquisition unit adjusts the first reference signal to a pitch equivalent to that of the target signal.