JP6299140B2

JP6299140B2 - Sound processing apparatus and sound processing method

Info

Publication number: JP6299140B2
Application number: JP2013216141A
Authority: JP
Inventors: 慶太有元; 近藤　多伸; 多伸近藤; 祐高橋
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2018-03-28
Anticipated expiration: 2033-10-17
Also published as: JP2015079122A

Description

本発明は、音響を表す音響信号を処理する技術に関する。 The present invention relates to a technique for processing an acoustic signal representing sound.

相異なる音源が発音した複数の音響成分の混合音から特定の音響成分を分離する音源分離技術が従来から提案されている。例えば非特許文献１や非特許文献２には、調波特性（source）と包絡特性（filter）とで歌唱音の周波数特性を表現するソースフィルタモデルを利用して、楽曲の歌唱音と伴奏音との混合音の音響信号から歌唱音を分離する技術が開示されている。 Conventionally, a sound source separation technique for separating a specific sound component from a mixed sound of a plurality of sound components generated by different sound sources has been proposed. For example, in Non-Patent Document 1 and Non-Patent Document 2, using a source filter model that expresses the frequency characteristics of a singing sound with harmonic characteristics (source) and envelope characteristics (filter), the singing sound and accompaniment of the music A technique for separating a singing sound from an acoustic signal of a mixed sound with sound is disclosed.

Jean-Louis Durrieu, et al., "MAIN INSTRUMENT SEPARATION FROM STEREOPHONIC AUDIO SIGNALS USING A SOURCE/FILTER MODEL", in Proc. EUSIPCO, p.15-18, 2009Jean-Louis Durrieu, et al., "MAIN INSTRUMENT SEPARATION FROM STEREOPHONIC AUDIO SIGNALS USING A SOURCE / FILTER MODEL", in Proc. EUSIPCO, p.15-18, 2009 Jean-Louis Durrieu, et al., "A musically motivated mid-level representation for pitch estimation and musical audio source separation", IEEE Journal of Selected Topics on Signal Processing 5(6), p.1180-1191, 2011Jean-Louis Durrieu, et al., "A musically motivated mid-level representation for pitch estimation and musical audio source separation", IEEE Journal of Selected Topics on Signal Processing 5 (6), p.1180-1191, 2011

しかし、非特許文献１や非特許文献２の技術のもとでは、実際には歌唱音が存在しない区間において、音響特性が歌唱音に類似する音響成分（例えば子音の音響特性に類似する打楽器の演奏音の音響成分）が歌唱音として誤抽出される可能性がある。以上の事情を考慮して、本発明は、音響信号から高精度に歌唱音を分離することを目的とする。 However, under the techniques of Non-Patent Document 1 and Non-Patent Document 2, an acoustic component whose acoustic characteristics are similar to the singing sound (for example, a percussion instrument similar to the acoustic characteristics of consonants) in a section where there is no actual singing sound. There is a possibility that the acoustic component of the performance sound) is erroneously extracted as a singing sound. In view of the above circumstances, an object of the present invention is to separate a singing sound from an acoustic signal with high accuracy.

以上の課題を解決するために、本発明の音響処理装置は、利用者が楽曲を歌唱した歌唱音を表す参照音響信号について有声区間と無声区間とを特定する音声解析手段と、楽曲の歌唱音と伴奏音との混合音の音響信号のうち音声解析手段が特定した有声区間について歌唱音の有声成分を分離する有声分離手段と、音響信号のうち音声解析手段が特定した無声区間について歌唱音の無声成分を分離する無声分離手段と、有声分離手段が分離した有声成分と無声分離手段が分離した無声成分とを合成する合成処理手段とを具備する。以上の構成では、利用者が楽曲を歌唱した歌唱音の参照音響信号から有声区間と無声区間とが特定され、音響信号の有声区間から有声成分が分離されるとともに無声区間から無声成分が分離される。したがって、参照音響信号を利用せずに音響信号から有声成分および無声成分を分離する構成と比較して高精度に歌唱音を分離できるという利点がある。 In order to solve the above problems, the sound processing apparatus of the present invention includes a voice analysis means for identifying a voiced section and a voiceless section with respect to a reference sound signal representing a singing sound by a user singing a song, and a song singing sound. Voiced separation means for separating the voiced component of the singing sound for the voiced section specified by the voice analysis means in the acoustic signal of the mixed sound of the sound and the accompaniment sound, and the voice of the singing sound for the unvoiced section specified by the voice analysis means in the acoustic signal There are provided unvoiced separation means for separating the unvoiced component, and synthesis processing means for synthesizing the voiced component separated by the voiced separation means and the unvoiced component separated by the unvoiced separation means. In the above configuration, the voiced section and the unvoiced section are identified from the reference acoustic signal of the singing sound of the user singing the song, the voiced component is separated from the voiced section of the acoustic signal, and the unvoiced component is separated from the unvoiced section. The Therefore, there is an advantage that the singing sound can be separated with high accuracy as compared with the configuration in which the voiced component and the unvoiced component are separated from the acoustic signal without using the reference acoustic signal.

なお、有声区間とは、調波構造が明確に観測される有声成分が優勢に存在する区間を意味する。他方、無声区間とは、調波構造が明確に観測されない無声成分が優勢に存在する区間であり、音声が存在しない無音区間とは区別される。 The voiced section means a section where a voiced component in which the harmonic structure is clearly observed is dominant. On the other hand, the silent section is a section in which a silent component in which the harmonic structure is not clearly observed is dominant, and is distinguished from a silent section in which no voice is present.

本発明の好適な態様において、音声解析手段は、参照音響信号の有声区間と無声区間とを特定する区間特定手段と、参照音響信号に対する歌詞認識で無声区間のうち歌唱音の子音に対応する子音区間を特定する歌詞認識手段とを含み、無声分離手段は、音響信号のうち歌詞認識手段が特定した子音区間について歌唱音の無声成分を分離する。以上の態様では、参照音響信号の無声区間のうち歌唱音の子音に対応する子音区間が歌詞認識で特定され、音響信号の子音区間から無声成分が分離される。したがって、参照音響信号が歌唱音の子音以外の無声音を包含する場合でも、音響信号から高精度に歌唱音を分離できるという利点がある。 In a preferred aspect of the present invention, the voice analysis means includes a section specifying means for specifying the voiced section and the unvoiced section of the reference sound signal, and a consonant corresponding to the consonant of the singing sound in the unvoiced section in the lyrics recognition for the reference sound signal. A voice recognition means for identifying a section, and the voice separation means separates the voiceless component of the singing sound for the consonant section specified by the text recognition means in the acoustic signal. In the above aspect, the consonant section corresponding to the consonant of the singing sound among the unvoiced sections of the reference acoustic signal is specified by the lyrics recognition, and the unvoiced component is separated from the consonant section of the acoustic signal. Therefore, even when the reference acoustic signal includes an unvoiced sound other than the consonant of the singing sound, there is an advantage that the singing sound can be separated from the acoustic signal with high accuracy.

本発明の好適な態様において、歌詞認識手段は、参照音響信号の子音を歌詞認識で特定し、無声分離手段は、相異なる子音の周波数特性を表す複数の基底行列のうち、歌詞認識手段が特定した子音の基底行列を教師情報として利用した教師あり非負値行列因子分解で無声成分を分離する。以上の態様では、参照音響信号の歌詞認識で特定された子音に対応する基底行列が、音響信号から無声成分を分離するための教師あり非負値行列因子分解に適用される。したがって、音響信号のうち歌唱音の子音を高精度に無声成分として分離できるという利点がある。 In a preferred aspect of the present invention, the lyrics recognizing means identifies the consonant of the reference sound signal by the lyrics recognition, and the unvoiced separating means is identified by the lyrics recognizing means among a plurality of base matrices representing frequency characteristics of different consonants. The unvoiced components are separated by supervised non-negative matrix factorization using the base matrix of the consonant as teacher information. In the above aspect, the base matrix corresponding to the consonant specified by the lyrics recognition of the reference acoustic signal is applied to the supervised non-negative matrix factorization for separating the unvoiced component from the acoustic signal. Therefore, there is an advantage that the consonant of the singing sound can be separated as an unvoiced component in the acoustic signal.

本発明の好適な態様に係る音響処理装置は、参照音響信号のうち区間特定手段が特定した無声区間を利用した学習処理で基底行列を生成する学習処理手段を具備する。以上の態様では、無声成分の分離に適用される基底行列が参照音響信号に対する学習処理で生成されるから、基底行列を事前に用意する必要がないという利点がある。なお、学習処理手段を具備する構成にとって歌詞認識手段の有無は不問である。 The sound processing apparatus according to a preferred aspect of the present invention comprises learning processing means for generating a base matrix by learning processing using a silent section specified by the section specifying means in the reference sound signal. In the above aspect, since the base matrix applied to the separation of the unvoiced component is generated by the learning process for the reference acoustic signal, there is an advantage that it is not necessary to prepare the base matrix in advance. In addition, the presence or absence of the lyric recognition means is not required for the configuration including the learning processing means.

以上の各態様に係る音響処理装置は、音響信号の処理に専用されるＤＳＰ（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る音響処理装置の動作方法（音響処理方法）としても特定される。 The acoustic processing device according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of an acoustic signal, or a general-purpose calculation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (acoustic processing method) of the acoustic processing device according to each aspect described above.

本発明の第１実施形態に係る音響処理装置の構成図である。1 is a configuration diagram of a sound processing apparatus according to a first embodiment of the present invention. 音響処理装置の具体的な構成図である。It is a specific block diagram of a sound processing apparatus. 音響処理装置の動作のフローチャートである。It is a flowchart of operation | movement of a sound processing apparatus. 第２実施形態に係る音響処理装置の構成図である。It is a block diagram of the sound processing apparatus which concerns on 2nd Embodiment.

図１は、本発明の第１実施形態に係る音響処理装置１００の構成図である。図１に例示される通り、音響処理装置１００には信号供給装置１２と放音装置１４と収音装置１６とが接続される。信号供給装置１２は、音響信号ＳAを音響処理装置１００に供給する。音響信号ＳAは、音響特性が相違する複数の音響成分（例えば音声や楽音）の混合音の波形を表す時間領域信号である。例えば可搬型または内蔵型の記録媒体（典型的には音楽ＣＤ）から音響信号ＳAを取得して音響処理装置１００に供給する再生装置が信号供給装置１２として採用され得る。なお、信号供給装置１２を音響処理装置１００と一体に構成することも可能である。 FIG. 1 is a configuration diagram of a sound processing apparatus 100 according to the first embodiment of the present invention. As illustrated in FIG. 1, a signal supply device 12, a sound emission device 14, and a sound collection device 16 are connected to the sound processing device 100. The signal supply device 12 supplies the acoustic signal SA to the acoustic processing device 100. The acoustic signal SA is a time-domain signal representing a mixed sound waveform of a plurality of acoustic components (for example, voice and musical sound) having different acoustic characteristics. For example, a reproduction apparatus that acquires the acoustic signal SA from a portable or built-in recording medium (typically a music CD) and supplies the acoustic signal SA to the acoustic processing apparatus 100 may be employed as the signal supply apparatus 12. Note that the signal supply device 12 may be integrated with the sound processing device 100.

第１実施形態では、特定の楽曲（以下「対象楽曲」という）の歌唱音と伴奏音との混合音の音響信号ＳAが信号供給装置１２から音響処理装置１００に供給される。歌唱音は、有声成分と無声成分とを包含し得る。有声成分は、基音成分と複数の倍音成分とを周波数軸上で基本周波数の整数倍の周波数に配列した調波構造（倍音構造）が観測される音響成分である。無声成分は、明確な調波構造が観測されない音響成分である。典型的には、歌唱音の母音が有声成分に相当し、摩擦音や破裂音等の子音（無声子音）が無声成分に相当する。他方、伴奏音は、相異なる複数種の楽器の楽音を含んで構成される。 In the first embodiment, a sound signal SA of a mixed sound of a singing sound and an accompaniment sound of a specific music (hereinafter referred to as “target music”) is supplied from the signal supply device 12 to the sound processing device 100. The singing sound can include a voiced component and an unvoiced component. The voiced component is an acoustic component in which a harmonic structure (harmonic structure) in which a fundamental component and a plurality of harmonic components are arranged on the frequency axis at a frequency that is an integral multiple of the fundamental frequency is observed. An unvoiced component is an acoustic component in which no clear harmonic structure is observed. Typically, vowels of singing sounds correspond to voiced components, and consonants (unvoiced consonants) such as friction sounds and burst sounds correspond to unvoiced components. On the other hand, the accompaniment sounds are configured to include different types of musical tones.

第１実施形態の音響処理装置１００は、信号供給装置１２から供給される音響信号ＳAに対する音響処理で音響信号ＳBを生成する信号処理装置（音源分離装置）である。音響信号ＳBは、音響信号ＳAに包含される歌唱音を分離した音響（すなわち楽曲の伴奏音を抑制した音響）の波形を表す時間領域信号である。放音装置１４（例えばスピーカやヘッドホン）は、音響処理装置１００が生成した音響信号ＳBに応じた音波を放射する。なお、音響信号ＳBをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略した。 The acoustic processing device 100 according to the first embodiment is a signal processing device (sound source separation device) that generates an acoustic signal SB through acoustic processing on the acoustic signal SA supplied from the signal supply device 12. The acoustic signal SB is a time-domain signal representing the waveform of the sound obtained by separating the singing sound included in the acoustic signal SA (that is, the sound in which the accompaniment sound of the music is suppressed). The sound emitting device 14 (for example, a speaker or a headphone) emits a sound wave corresponding to the acoustic signal SB generated by the acoustic processing device 100. The D / A converter that converts the acoustic signal SB from digital to analog is not shown for convenience.

収音装置１６は、周囲の音響を収音して音響の時間波形を表す音響信号を生成する。第１実施形態の収音装置１６は、利用者が対象楽曲（歌唱パート）を歌唱した歌唱音の音響信号（以下「参照音響信号」という）ＳREFを音響処理装置１００に供給する。収音装置１６から音響処理装置１００に対する参照音響信号ＳREFの供給（利用者による対象楽曲の歌唱）と、音響信号ＳAの音響処理および処理後の音響信号ＳBの再生とが、実時間的に並行して実行される。なお、参照音響信号ＳREFをアナログからデジタルに変換するＡ/Ｄ変換器の図示は便宜的に省略した。 The sound collection device 16 collects ambient sound and generates an acoustic signal representing a time waveform of the sound. The sound collection device 16 of the first embodiment supplies an acoustic signal (hereinafter referred to as “reference acoustic signal”) SREF of the singing sound in which the user sang the target song (singing part) to the acoustic processing device 100. The supply of the reference sound signal SREF from the sound collection device 16 to the sound processing device 100 (singing of the target music by the user) and the sound processing of the sound signal SA and the reproduction of the processed sound signal SB in parallel in real time. And executed. The A / D converter for converting the reference acoustic signal SREF from analog to digital is not shown for convenience.

図１に例示される通り、音響処理装置１００は、演算処理装置２２と記憶装置２４とを具備するコンピュータシステムで実現される。記憶装置２４は、演算処理装置２２が実行するプログラムや演算処理装置２２が使用する各種のデータを記憶する。半導体記録媒体または磁気記録媒体等の公知の記録媒体や複数種の記録媒体の組合せが記憶装置２４として任意に採用される。音響信号ＳAを記憶装置２４に記憶した構成（したがって信号供給装置１２は省略され得る）も好適である。 As illustrated in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 22 and a storage device 24. The storage device 24 stores a program executed by the arithmetic processing device 22 and various data used by the arithmetic processing device 22. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily adopted as the storage device 24. A configuration in which the acoustic signal SA is stored in the storage device 24 (therefore, the signal supply device 12 can be omitted) is also suitable.

第１実施形態の記憶装置２４は、相異なる子音に対応する複数の基底行列Ｍを記憶する。任意の１種類の子音に対応する基底行列Ｍは、当該子音の周波数特性を表現する音響モデル（子音モデル）である。第１実施形態の基底行列Ｍは、図１に例示される通り、典型的な子音の周波数特性（周波数スペクトル）を表す複数の基底ベクトルｍを列方向に配列した非負値行列であり、音響信号ＳAに対する非負値行列因子分解（ＮＭＦ：Non-negative Matrix Factorization）にて教師情報（事前情報）として利用される。 The storage device 24 of the first embodiment stores a plurality of basis matrices M corresponding to different consonants. The base matrix M corresponding to any one type of consonant is an acoustic model (consonant model) that expresses the frequency characteristics of the consonant. The base matrix M of the first embodiment is a non-negative matrix in which a plurality of base vectors m representing a typical consonant frequency characteristic (frequency spectrum) are arranged in the column direction as illustrated in FIG. It is used as teacher information (prior information) in non-negative matrix factorization (NMF) for SA.

演算処理装置２２は、記憶装置２４に記憶されたプログラムを実行することで、音響信号ＳAから音響信号ＳBを生成するための複数の機能（音声解析部３２，信号処理部３４）を実現する。音声解析部３２は、収音装置１６から供給される参照音響信号ＳREFの音響特性を解析する。信号処理部３４は、音声解析部３２による参照音響信号ＳREFの解析の結果を利用して音響信号ＳAから音響信号ＳBを生成する。すなわち、第１実施形態では、利用者が対象楽曲を歌唱した歌唱音の参照音響信号ＳREFが、音響信号ＳAに対する音源分離を補助する情報として利用される。なお、演算処理装置２２の各機能を複数の集積回路に分散した構成や、専用の電子回路（例えばＤＳＰ）が演算処理装置２２の一部の機能を実現する構成も採用され得る。また、実際には、時間領域の音響信号ＳAを例えば離散フーリエ変換で周波数領域に変換する要素や、周波数領域の音響信号ＳBを例えば離散逆フーリエ変換で時間領域に変換する要素が設置されるが、以下では説明や図示を便宜的に省略する。 The arithmetic processing unit 22 implements a plurality of functions (speech analysis unit 32, signal processing unit 34) for generating the acoustic signal SB from the acoustic signal SA by executing a program stored in the storage device 24. The voice analysis unit 32 analyzes the acoustic characteristics of the reference acoustic signal SREF supplied from the sound collection device 16. The signal processing unit 34 generates an acoustic signal SB from the acoustic signal SA using the result of the analysis of the reference acoustic signal SREF by the speech analysis unit 32. That is, in the first embodiment, the reference sound signal SREF of the singing sound that the user sang the target song is used as information for assisting sound source separation with respect to the sound signal SA. A configuration in which each function of the arithmetic processing device 22 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, a DSP) realizes a part of the functions of the arithmetic processing device 22 may be employed. In practice, an element for converting the time domain acoustic signal SA into the frequency domain by, for example, discrete Fourier transform, and an element for converting the frequency domain acoustic signal SB into the time domain by, for example, discrete inverse Fourier transform are installed. Hereinafter, description and illustration are omitted for the sake of convenience.

図２は、音響処理装置１００の具体的な構成図である。音声解析部３２は、収音装置１６から供給される参照音響信号ＳREFから時間軸上の有声区間ＱV（Ｖ：Voiced）と無声区間ＱU（Ｕ：Unvoiced）とを順次に特定する。有声区間ＱVは、参照音響信号ＳREFの音声区間（音声が存在する区間）のうち有声成分が優勢に存在する区間であり、無声区間ＱUは、参照音響信号ＳREFの音声区間のうち無声成分が優勢に存在する区間である。図２に例示される通り、第１実施形態の音声解析部３２は、区間特定部４２と歌詞認識部４４とを含んで構成される。 FIG. 2 is a specific configuration diagram of the sound processing apparatus 100. The voice analysis unit 32 sequentially specifies a voiced section QV (V: Voiced) and a voiceless section QU (U: Unvoiced) on the time axis from the reference acoustic signal SREF supplied from the sound collection device 16. The voiced section QV is a section in which the voiced component predominates in the voice section of the reference acoustic signal SREF (the section in which voice is present), and the unvoiced section QU is the unvoiced component in the voice section of the reference acoustic signal SREF. It is a section existing in. As illustrated in FIG. 2, the voice analysis unit 32 according to the first embodiment includes a section identification unit 42 and a lyrics recognition unit 44.

区間特定部４２は、参照音響信号ＳREFの有声区間ＱVと無声区間ＱU0（無声区間ＱUの基礎となる区間）とを順次に特定する。有声区間ＱVおよび無声区間ＱU0の特定には公知の技術が任意に採用される。例えば、区間特定部４２は、参照音響信号ＳREFのうち歌唱音が存在する音声区間を公知の音声区間検出（ＶＡＤ：Voice Activity Detection）で検出し、音声区間のうち有意な音高（ピッチ）が観測される区間（すなわち明確な調波構造が存在する区間）を有声区間ＱVとして特定するとともに音声区間のうち有意な音高が観測されない区間（すなわち明確な調波構造が存在しない区間）を無声区間ＱU0として特定する。 The section specifying unit 42 sequentially specifies the voiced section QV and the unvoiced section QU0 (the section that is the basis of the unvoiced section QU) of the reference acoustic signal SREF. A known technique is arbitrarily adopted to specify the voiced section QV and the unvoiced section QU0. For example, the section specifying unit 42 detects a voice section in which the singing sound is present in the reference sound signal SREF by known voice section detection (VAD: Voice Activity Detection), and a significant pitch (pitch) in the voice section is detected. An observed section (ie, a section in which a clear harmonic structure exists) is specified as a voiced section QV, and a section in which no significant pitch is observed in the voice section (ie, a section in which no clear harmonic structure exists) is silent. It is specified as section QU0.

歌詞認識部４４は、参照音響信号ＳREFに対する歌詞認識を実行する。第１実施形態の歌詞認識部４４は、参照音響信号ＳREFに対する歌詞認識で、区間特定部４２が特定した無声区間ＱU0のうち歌唱音の子音に対応する区間（子音区間）を無声区間ＱUとして順次に特定するとともに、参照音響信号ＳREFの無声区間ＱU0内に存在する子音（発音内容）Ｃを順次に特定する。無声区間ＱUは、区間特定部４２が特定した初期的な無声区間ＱU0のうち歌唱音以外の無声音（例えば打楽器の演奏音等）が優勢に存在する区間を除外した区間である。すなわち、無声区間ＱU0の一部が無声区間ＱU（子音区間）として特定される。歌詞認識部４４による歌詞認識（音声認識）には公知の音声認識技術が任意に採用される。 The lyrics recognizing unit 44 performs lyrics recognition on the reference sound signal SREF. The lyrics recognizing unit 44 of the first embodiment recognizes the reference acoustic signal SREF in the lyrics, and sequentially selects the sections (consonant sections) corresponding to the consonant of the singing sound among the unvoiced sections QU0 specified by the section specifying unit 42 as the unvoiced sections QU. And consonants (pronunciation contents) C existing in the unvoiced section QU0 of the reference sound signal SREF are sequentially specified. The unvoiced section QU is a section that excludes sections in which an unvoiced sound other than the singing sound (for example, a percussion instrument performance sound) predominates from the initial unvoiced section QU0 specified by the section specifying unit 42. That is, a part of the unvoiced section QU0 is specified as the unvoiced section QU (consonant section). A known speech recognition technique is arbitrarily employed for the lyrics recognition (voice recognition) by the lyrics recognition unit 44.

信号処理部３４は、音声解析部３２が特定した有声区間ＱVおよび無声区間ＱUと子音Ｃとを適用した信号処理（音源分離）で音響信号ＳAから音響信号ＳBを生成する。信号処理部３４による音響信号ＳBの生成は、利用者による歌唱（歌唱音の参照音響信号ＳREFに対する音声解析部３２の処理）に並行して実時間的に実行される。図２に例示される通り、第１実施形態の信号処理部３４は、有声分離部５２と無声分離部５４と合成処理部５６とを含んで構成される。 The signal processing unit 34 generates the acoustic signal SB from the acoustic signal SA by signal processing (sound source separation) to which the voiced section QV and unvoiced section QU and the consonant C specified by the speech analysis unit 32 are applied. The generation of the acoustic signal SB by the signal processing unit 34 is executed in real time in parallel with the singing by the user (processing of the voice analysis unit 32 with respect to the reference acoustic signal SREF of the singing sound). As illustrated in FIG. 2, the signal processing unit 34 of the first embodiment includes a voiced separation unit 52, an unvoiced separation unit 54, and a synthesis processing unit 56.

有声分離部５２は、信号供給装置１２から供給される音響信号ＳAのうち音声解析部３２（区間特定部４２）が特定した各有声区間ＱVから歌唱音の有声成分Ｖを分離（強調ないし抽出）する。有声成分Ｖの分離には公知の音源分離技術が任意に採用される。具体的には、調波特性と包絡特性とで歌唱音の有声成分を表現するソースフィルタモデルを利用した非特許文献１や非特許文献２の音源分離技術（Ｖ-ＩＭＭ："Voiced"-Instantaneous Mixture Model）が、有声分離部５２による有声成分Ｖの分離に好適に採用される。すなわち、声帯等の発音源の振動に由来する調波特性（source）の時系列に相当する非負値行列と、声道等の共鳴管内での変調に由来する包絡特性（filter）の時系列に相当する非負値行列との要素毎の乗算（アダマール積）で有声成分Ｖを表現し、有声成分Ｖと有声成分Ｖ以外の音響成分との加算が音響信号ＳAの周波数特性（スペクトログラムを表現する観測行列）に近似するように所定の更新式の演算を反復することで有声成分Ｖが推定される。 The voiced separation unit 52 separates (emphasizes or extracts) the voiced component V of the singing sound from each voiced section QV specified by the voice analysis section 32 (section specifying section 42) in the acoustic signal SA supplied from the signal supply device 12. To do. For the separation of the voiced component V, a known sound source separation technique is arbitrarily employed. Specifically, sound source separation technology (V-IMM: "Voiced"-) using a source filter model that expresses a voiced component of a singing sound with harmonic characteristics and envelope characteristics. Instantaneous Mixture Model) is preferably used for the separation of the voiced component V by the voiced separation unit 52. That is, a non-negative matrix corresponding to a time series of harmonic characteristics (source) derived from vibration of a sound source such as a vocal cord and a time series of envelope characteristics (filter) derived from modulation in a resonance tube such as a vocal tract The voiced component V is expressed by element-by-element multiplication (Hadamard product) corresponding to the non-negative matrix, and the addition of the voiced component V and the acoustic component other than the voiced component V expresses the frequency characteristic (spectrogram of the acoustic signal SA. The voiced component V is estimated by iterating a predetermined update equation so as to approximate the observation matrix.

無声分離部５４は、信号供給装置１２から供給される音響信号ＳAのうち音声解析部３２（歌詞認識部４４）が特定した各無声区間ＱUから歌唱音の無声成分Ｕを分離（強調ないし抽出）する。無声成分Ｕの分離には公知の音源分離技術が任意に採用され得るが、第１実施形態の無声分離部５４は、記憶装置２４に記憶された基底行列Ｍを利用した非負値行列因子分解で無声成分Ｕを推定する。具体的には、無声分離部５４は、記憶装置２４に記憶された複数の基底行列Ｍのうち歌詞認識部４４が特定した子音Ｃに対応する基底行列Ｍを探索し、当該基底行列Ｍを教師情報（事前情報）として利用した教師あり非負値行列因子分解（Supervised-NMF）で無声成分Ｕを分離する。無声分離部５４による教師あり非負値行列因子分解には、例えば特開２０１３−３３１９６号公報に開示された技術が好適に採用される。具体的には、子音Ｃの基底行列Ｍと各基底ベクトルｍの加重値の時系列を意味する係数行列との行列積で表現される無声成分Ｕと、無声成分Ｕ以外の音響成分との加算が音響信号ＳAの周波数特性（観測行列）に近似するように所定の更新式の演算を反復することで無声成分Ｕが推定される。以上の説明から理解される通り、第１実施形態では、音響信号ＳAの歌唱音の有声成分Ｖと無声成分Ｕとが相異なる方法で分離される。 The unvoiced separation unit 54 separates (emphasizes or extracts) the unvoiced component U of the singing sound from each unvoiced section QU specified by the speech analysis unit 32 (lyric recognition unit 44) in the acoustic signal SA supplied from the signal supply device 12. To do. For the separation of the unvoiced component U, a known sound source separation technique can be arbitrarily employed. However, the unvoiced separation unit 54 of the first embodiment performs non-negative matrix factorization using the base matrix M stored in the storage device 24. An unvoiced component U is estimated. Specifically, the unvoiced separation unit 54 searches the basis matrix M corresponding to the consonant C specified by the lyrics recognition unit 44 among the plurality of basis matrices M stored in the storage device 24, and uses the basis matrix M as a teacher. The unvoiced component U is separated by supervised non-negative matrix factorization (Supervised-NMF) used as information (prior information). For the supervised non-negative matrix factorization by the silent separation unit 54, for example, a technique disclosed in Japanese Patent Application Laid-Open No. 2013-33196 is suitably employed. Specifically, an unvoiced component U expressed by a matrix product of a base matrix M of a consonant C and a coefficient matrix that represents a time series of weighted values of each basis vector m, and an acoustic component other than the unvoiced component U are added. The unvoiced component U is estimated by repeating the calculation of a predetermined update formula so as to approximate the frequency characteristic (observation matrix) of the acoustic signal SA. As understood from the above description, in the first embodiment, the voiced component V and the unvoiced component U of the singing sound of the acoustic signal SA are separated by different methods.

合成処理部５６は、有声分離部５２が分離した有声成分Ｖと無声分離部５４が分離した無声成分Ｕとを合成することで音響信号ＳBを生成する。具体的には、合成処理部５６は、有声分離部５２が有声区間ＱV毎に生成した有声成分Ｖと無声分離部５４が無声区間ＱU毎に生成した無声成分Ｕとを時間軸上に配列することで時間領域の音響信号ＳBを生成する。したがって、対象楽曲の歌唱音と伴奏音との混合音の音響信号ＳAから歌唱音を選択的に抽出した音響信号ＳBが生成される。合成処理部５６が生成した音響信号ＳBが放音装置１４に供給されることで音波として放射される。 The synthesis processing unit 56 generates the acoustic signal SB by synthesizing the voiced component V separated by the voiced separation unit 52 and the unvoiced component U separated by the unvoiced separation unit 54. Specifically, the synthesis processing unit 56 arranges the voiced component V generated by the voiced separation unit 52 for each voiced section QV and the voiced component U generated by the voiceless separation unit 54 for each voiced section QU on the time axis. Thus, the time domain acoustic signal SB is generated. Therefore, the acoustic signal SB is generated by selectively extracting the singing sound from the acoustic signal SA of the mixed sound of the singing sound and the accompaniment sound of the target music. The acoustic signal SB generated by the synthesis processing unit 56 is supplied to the sound emitting device 14 and is emitted as a sound wave.

図３は、演算処理装置２２が実行する動作のフローチャートである。参照音響信号ＳREFおよび音響信号ＳAを時間軸上で区分した単位区間毎に図３の処理が反復的に実行される。図３の処理を開始すると、演算処理装置２２（音声解析部３２）は、参照音響信号ＳREFの単位区間から有声区間ＱVと無声区間ＱUとを特定する音声解析ＰAを実行する。具体的には、演算処理装置２２（区間特定部４２）は、参照音響信号ＳREFの単位区間から有声区間ＱVと無声区間ＱU0とを特定する（ＰA1）。そして、演算処理装置２２（歌詞認識部４４）は、参照音響信号ＳREFの単位区間に対する歌詞認識で無声区間ＱU0の無声区間ＱUと子音Ｃとを特定する（ＰA2）。 FIG. 3 is a flowchart of the operation executed by the arithmetic processing unit 22. The processing of FIG. 3 is repeatedly executed for each unit section obtained by dividing the reference acoustic signal SREF and the acoustic signal SA on the time axis. When the processing of FIG. 3 is started, the arithmetic processing unit 22 (speech analysis unit 32) executes a speech analysis PA that identifies the voiced section QV and the unvoiced section QU from the unit section of the reference acoustic signal SREF. Specifically, the arithmetic processing unit 22 (section specifying unit 42) specifies the voiced section QV and the unvoiced section QU0 from the unit section of the reference acoustic signal SREF (PA1). Then, the arithmetic processing unit 22 (lyric recognition unit 44) identifies the unvoiced section QU and the consonant C of the unvoiced section QU0 by recognizing the lyrics for the unit section of the reference sound signal SREF (PA2).

音声解析ＰAを実行すると、演算処理装置２２（信号処理部３４）は、音声解析ＰAで特定した有声区間ＱVと無声区間ＱUと子音Ｃを利用して音響信号ＳAの単位区間から音響信号ＳBを生成する信号処理ＰBを実行する。具体的には、演算処理装置２２は、単位区間内の有声区間ＱVから有声成分Ｖを分離する処理（ＰB1／有声分離部５２）と、単位区間内の無声区間ＱUから無声成分Ｕを分離する処理（ＰB2／無声分離部５４）とを実行する。無声成分Ｕの分離には子音Ｃが適用される。そして、演算処理装置２２（合成処理部５６）は、単位区間内の有声成分Ｖと無声成分Ｕとを合成することで音響信号ＳBを生成する（ＰB3）。 When the voice analysis PA is executed, the arithmetic processing unit 22 (signal processing unit 34) uses the voiced section QV, the unvoiced section QU, and the consonant C specified by the voice analysis PA to generate the acoustic signal SB from the unit section of the acoustic signal SA. The signal processing PB to be generated is executed. Specifically, the arithmetic processing unit 22 separates the voiced component V from the voiced section QV in the unit section (PB1 / voiced separation unit 52), and separates the voiced component U from the voiceless section QU in the unit section. Processing (PB2 / voiceless separation unit 54) is executed. The consonant C is applied to the separation of the unvoiced component U. Then, the arithmetic processing unit 22 (synthesis processing unit 56) generates the acoustic signal SB by synthesizing the voiced component V and the unvoiced component U in the unit section (PB3).

以上に説明した通り、第１実施形態では、利用者が対象楽曲を歌唱した歌唱音の参照音響信号ＳREFから有声区間ＱVと無声区間ＱUとが特定され、音響信号ＳAの有声区間ＱVから有声成分Ｖが分離されるとともに音響信号ＳAの無声区間ＱUから無声成分Ｕが分離される。すなわち、参照音響信号ＳREFが補助的な情報として有声成分Ｖおよび無声成分Ｕの音源分離に適用される。したがって、参照音響信号ＳREFを利用せずに音響信号ＳAのみから有声成分Ｖおよび無声成分Ｕを分離する構成と比較して、対象楽曲の音響信号ＳAから高精度に歌唱音（有声成分Ｖおよび無声成分Ｕ）を分離できるという利点がある。 As described above, in the first embodiment, the voiced section QV and the unvoiced section QU are specified from the reference acoustic signal SREF of the singing sound that the user sang the target song, and the voiced component from the voiced section QV of the acoustic signal SA. V is separated and the unvoiced component U is separated from the unvoiced section QU of the acoustic signal SA. That is, the reference acoustic signal SREF is applied to sound source separation of the voiced component V and the unvoiced component U as auxiliary information. Therefore, compared with the configuration in which the voiced component V and the unvoiced component U are separated from only the acoustic signal SA without using the reference acoustic signal SREF, the singing sound (voiced component V and unvoiced component) is accurately obtained from the acoustic signal SA of the target music. There is the advantage that component U) can be separated.

第１実施形態では、参照音響信号ＳREFの無声区間ＱU0のうち歌唱音の子音に対応する無声区間（子音区間）ＱUが歌詞認識で特定され、音響信号ＳAの無声区間ＱUから無声成分Ｕが分離される。したがって、参照音響信号ＳREFの無声区間ＱU0に歌唱音の子音以外の無声音（例えば打楽器の演奏音）が包含される場合でも、歌唱音の子音のみが選択的に無声成分Ｕとして分離される。すなわち、音響信号ＳAの歌唱音を高精度に分離できるという効果は格別に顕著である。 In the first embodiment, the unvoiced section (consonant section) QU corresponding to the consonant of the singing sound among the unvoiced section QU0 of the reference acoustic signal SREF is specified by the lyrics recognition, and the unvoiced component U is separated from the unvoiced section QU of the acoustic signal SA. Is done. Therefore, even when an unvoiced sound other than the consonant of the singing sound (for example, a percussion instrument performance sound) is included in the unvoiced section QU0 of the reference sound signal SREF, only the consonant of the singing sound is selectively separated as the unvoiced component U. That is, the effect that the singing sound of the acoustic signal SA can be separated with high accuracy is particularly remarkable.

第１実施形態では、記憶装置２４に記憶された複数の基底行列Ｍのうち、参照音響信号ＳREFの無声区間ＱU0に対する歌詞認識で特定された子音Ｃに対応する基底行列Ｍが、無声分離部５４による無声成分Ｕの分離（教師あり非負値行列因子分解）に適用される。したがって、音響信号ＳAのうち歌唱音の子音を高精度に無声成分Ｕとして分離できる（ひいては音響信号ＳAの歌唱音を高精度に分離できる）という利点がある。 In the first embodiment, among the plurality of basis matrices M stored in the storage device 24, the basis matrix M corresponding to the consonant C identified by the lyrics recognition for the unvoiced section QU 0 of the reference acoustic signal SREF is the unvoiced separation unit 54. Applied to separation of unvoiced component U by (supervised non-negative matrix factorization). Therefore, there is an advantage that the consonant of the singing sound in the acoustic signal SA can be separated as the unvoiced component U with high accuracy (and the singing sound of the acoustic signal SA can be separated with high accuracy).

＜第２実施形態＞
本発明の第２実施形態を説明する。以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In each of the embodiments exemplified below, elements having the same functions and functions as those of the first embodiment will be referred to in the description of the first embodiment, and detailed descriptions thereof will be appropriately omitted.

図４は、第２実施形態における音響処理装置１００の構成図である。図４に例示される通り、第２実施形態の演算処理装置２２は、第１実施形態と同様の要素（音声解析部３２，信号処理部３４）に加えて学習処理部３６として機能する。学習処理部３６は、収音装置１６から供給される参照音響信号ＳREFのうち音声解析部３２（区間特定部４２）が特定した無声区間ＱUを適用した学習処理で基底行列Ｍを順次に生成する。学習処理には公知の機械学習技術が任意に採用される。信号処理部３４の無声分離部５４は、学習処理部３６が順次に生成する基底行列Ｍを教師情報として利用した教師あり非負値行列因子分解で、音響信号ＳAのうち音声解析部３２（歌詞認識部４４）が特定した無声区間ＱUから無声成分Ｕを分離する。 FIG. 4 is a configuration diagram of the sound processing apparatus 100 according to the second embodiment. As illustrated in FIG. 4, the arithmetic processing device 22 of the second embodiment functions as a learning processing unit 36 in addition to the same elements (speech analysis unit 32 and signal processing unit 34) as in the first embodiment. The learning processing unit 36 sequentially generates a base matrix M by a learning process using the unvoiced section QU specified by the speech analysis unit 32 (section specifying unit 42) in the reference acoustic signal SREF supplied from the sound collection device 16. . A known machine learning technique is arbitrarily employed for the learning process. The unvoiced separation unit 54 of the signal processing unit 34 is a supervised non-negative matrix factorization using the base matrix M sequentially generated by the learning processing unit 36 as teacher information, and the speech analysis unit 32 (lyric recognition) of the acoustic signal SA. The unvoiced component U is separated from the unvoiced section QU specified by the section 44).

なお、第１実施形態では、利用者による歌唱（参照音響信号ＳREFの生成）と音響信号ＳBの生成とを実時間的に並行して実行した。第２実施形態では、参照音響信号ＳREFを利用した基底行列Ｍの生成後に、各基底行列Ｍを適用した音響信号ＳBの生成を実行する構成（すなわち、利用者による対象楽曲の歌唱後に音響信号ＳAから音響信号ＳBを生成する構成）が好適である。 In the first embodiment, the singing by the user (generation of the reference acoustic signal SREF) and the generation of the acoustic signal SB are executed in parallel in real time. In the second embodiment, after generating the base matrix M using the reference acoustic signal SREF, the acoustic signal SB is generated by applying each base matrix M (that is, the acoustic signal SA after the user sings the target song). The configuration of generating the acoustic signal SB from the above is preferable.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、無声成分Ｕの分離に適用される基底行列Ｍが参照音響信号ＳREFに対する学習処理で生成されるから、基底行列Ｍを事前に用意する必要がないという利点がある。なお、前述の例示では、音響信号ＳAのうち歌詞認識部４４が特定した無声区間ＱUから無声成分Ｕを分離したが、音響信号ＳAのうち区間特定部４２が特定した無声区間ＱU0から無声成分Ｕを分離することも可能である。したがって、第２実施形態では歌詞認識部４４は省略され得る。 In the second embodiment, the same effect as in the first embodiment is realized. Further, in the second embodiment, since the base matrix M applied to the separation of the unvoiced component U is generated by the learning process for the reference acoustic signal SREF, there is an advantage that it is not necessary to prepare the base matrix M in advance. In the above example, the unvoiced component U is separated from the unvoiced section QU specified by the lyrics recognizing unit 44 in the acoustic signal SA, but the unvoiced component U from the unvoiced section QU0 specified by the section specifying unit 42 in the acoustic signal SA. Can also be separated. Therefore, the lyrics recognition unit 44 can be omitted in the second embodiment.

＜変形例＞
以上に例示した各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <Modification>
Each form illustrated above can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）第１実施形態では、対象楽曲の音響信号ＳAのうち歌詞認識部４４が特定した無声区間ＱUから無声成分Ｕを分離したが、音響信号ＳAのうち区間特定部４２が特定した無声区間ＱU0から無声分離部５４が無声成分Ｕを分離することも可能である。すなわち、歌唱音の子音に対応する無声区間ＱUを歌詞認識部４４が特定する処理は省略され得る。以上の説明から理解される通り、前述の各形態の音声解析部３２が参照音響信号ＳREFから特定する無声区間は、参照音響信号ＳREFの音声区間のうち有声区間ＱV以外の無声区間ＱU0と、無声区間ＱU0の一部の区間（無声区間ＱU0のうち歌唱音の子音に対応する子音区間）ＱUとの双方を包含する。 (1) In the first embodiment, the unvoiced component U is separated from the unvoiced section QU specified by the lyrics recognizing unit 44 in the acoustic signal SA of the target music, but the unvoiced section specified by the section specifying unit 42 in the acoustic signal SA. It is also possible for the unvoiced separation unit 54 to separate the unvoiced component U from QU0. That is, the process in which the lyrics recognizing unit 44 specifies the silent section QU corresponding to the consonant of the singing sound can be omitted. As understood from the above description, the unvoiced section identified from the reference acoustic signal SREF by the speech analysis unit 32 of each embodiment described above is the unvoiced section QU0 other than the voiced section QV in the speech section of the reference acoustic signal SREF, and the unvoiced section. It includes both a part of the section QU0 (consonant section corresponding to the consonant of the singing sound in the unvoiced section QU0) QU.

（２）信号処理部３４による音源分離を補助するために参照音響信号ＳREFから抽出される情報は、前述の各形態で例示した情報（有声区間ＱV，無声区間ＱU，子音Ｃ）に限定されない。例えば、参照音響信号ＳREFから抽出される音高（ピッチ）を、有声分離部５２による有声成分Ｖの分離に利用することも可能である。例えば、対象楽曲の音響信号ＳAのうち、参照音響信号ＳREFから抽出される音高に対して所定の範囲内にある音響成分を有声成分Ｖの候補として抽出すれば、参照音響信号ＳREFの音高を利用しない構成と比較して高精度に有声成分Ｖを分離することが可能である。なお、参照音響信号ＳREFの音高の推定には公知の音高推定技術が任意に採用され得る。 (2) Information extracted from the reference acoustic signal SREF to assist sound source separation by the signal processing unit 34 is not limited to the information (voiced section QV, unvoiced section QU, consonant C) exemplified in the above-described embodiments. For example, the pitch (pitch) extracted from the reference acoustic signal SREF can be used for separation of the voiced component V by the voiced separation unit 52. For example, if an acoustic component within a predetermined range with respect to the pitch extracted from the reference acoustic signal SREF is extracted as a candidate for the voiced component V from the acoustic signal SA of the target music, the pitch of the reference acoustic signal SREF is extracted. It is possible to separate the voiced component V with higher accuracy compared to a configuration that does not use the. Note that a known pitch estimation technique can be arbitrarily employed for estimating the pitch of the reference acoustic signal SREF.

（３）利用者の歌唱が下手な場合には、参照音響信号ＳREFと音響信号ＳAとの間で有声区間ＱVや無声区間ＱUが合致しない可能性がある。したがって、参照音響信号ＳREFを調整したうえで有声区間ＱVや無声区間ＱUを特定する構成が好適である。例えば、音声解析部３２（区間特定部４２）は、参照音響信号ＳREFの時間軸上の各時点が対象楽曲内で対応する時点に時点に合致するように参照音響信号ＳREFを時間軸上で調整（アライメント）したうえで有声区間ＱVや無声区間ＱU（ＱU0）を特定する。以上の構成によれば、利用者の歌唱が下手な場合でも高精度に対象楽曲の歌唱音を分離できるという利点がある。 (3) When the user's singing is poor, the voiced section QV and the unvoiced section QU may not match between the reference acoustic signal SREF and the acoustic signal SA. Therefore, a configuration that specifies the voiced section QV and the unvoiced section QU after adjusting the reference acoustic signal SREF is preferable. For example, the voice analysis unit 32 (section specifying unit 42) adjusts the reference sound signal SREF on the time axis so that each time point on the time axis of the reference sound signal SREF matches the time point corresponding to the target music. After (alignment), the voiced section QV and the unvoiced section QU (QU0) are specified. According to the above structure, even when a user's song is not good, there exists an advantage that the song sound of an object music can be isolate | separated with high precision.

（４）前述の各形態では、有声分離部５２による有声成分Ｖの分離と無声分離部５４による無声成分Ｕの分離とを個別に実行したが、有声成分Ｖおよび無声成分Ｕの双方を音響信号ＳAから一括的に分離する構成も採用され得る。すなわち、有声分離部５２と無声分離部５４とを一体の要素として把握することも可能である。 (4) In each of the above-described embodiments, the separation of the voiced component V by the voiced separation unit 52 and the separation of the unvoiced component U by the unvoiced separation unit 54 are performed separately, but both the voiced component V and the unvoiced component U are acoustic signals. A configuration that separates from SA in a lump can also be employed. That is, the voiced separation unit 52 and the unvoiced separation unit 54 can be grasped as an integral element.

（５）前述の各形態では、歌唱音と伴奏音との混合音の音響信号ＳAから歌唱音を抽出したが、音響信号ＳAから伴奏音を抽出することも可能である。例えば、前述の各形態で生成された音響信号ＳBを音響信号ＳAから減算することで、対象楽曲の伴奏音を分離（強調または抽出）した音響信号を生成することが可能である。 (5) In each embodiment described above, the singing sound is extracted from the acoustic signal SA of the mixed sound of the singing sound and the accompaniment sound, but the accompaniment sound can be extracted from the acoustic signal SA. For example, it is possible to generate an acoustic signal obtained by separating (emphasizing or extracting) the accompaniment sound of the target music by subtracting the acoustic signal SB generated in each form described above from the acoustic signal SA.

（６）携帯電話機等の端末装置と通信するサーバ装置で音響処理装置１００を実現することも可能である。例えば、音響処理装置１００は、端末装置から通信網を介して受信した参照音響信号ＳREFを利用して音響信号ＳAから音響信号ＳBを生成して端末装置に送信する。処理対象の音響信号ＳAは、音響処理装置１００に接続された信号供給装置１２から供給された信号、または、音響処理装置１００が端末装置から通信網を介して受信した信号である。 (6) The sound processing apparatus 100 can be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the acoustic processing device 100 generates the acoustic signal SB from the acoustic signal SA using the reference acoustic signal SREF received from the terminal device via the communication network, and transmits the acoustic signal SB to the terminal device. The acoustic signal SA to be processed is a signal supplied from the signal supply device 12 connected to the sound processing device 100 or a signal received by the sound processing device 100 from the terminal device via the communication network.

１００……音響処理装置、１２……信号供給装置、１４……放音装置、１６……収音装置、２２……演算処理装置、２４……記憶装置、３２……音声解析部、３４……信号処理部、３６……学習処理部、４２……区間特定部、４４……歌詞認識部、５２……有声分離部、５４……無声分離部、５６……合成処理部。
DESCRIPTION OF SYMBOLS 100 ... Sound processing device, 12 ... Signal supply device, 14 ... Sound emission device, 16 ... Sound collection device, 22 ... Arithmetic processing device, 24 ... Memory | storage device, 32 ... Voice analysis part, 34 ... A signal processing unit, 36 a learning processing unit, 42 a section specifying unit, 44 a lyrics recognition unit, 52 a voiced separation unit, 54 a voiceless separation unit, and 56 a synthesis processing unit.

Claims

Voice analysis means for identifying voiced and unvoiced sections for a reference acoustic signal representing a singing sound of a user singing a song;
Voiced separation means for separating the voiced component of the singing sound from the acoustic signal of the mixed sound of the singing sound of the music and the accompaniment sound for the voiced section specified by the voice analysis means;
Unvoiced separation means for separating the unvoiced component of the singing sound for the unvoiced section identified by the voice analysis means of the acoustic signal;
An acoustic processing apparatus comprising: a synthesis processing unit that synthesizes the voiced component separated by the voiced separation unit and the voiced component separated by the voiceless separation unit.

The voice analysis means includes
Section specifying means for specifying a voiced section and an unvoiced section of the reference acoustic signal;
Lyrics recognition means for identifying a consonant section corresponding to a consonant of a singing sound among the unvoiced sections in the lyrics recognition for the reference acoustic signal,
The sound processing apparatus according to claim 1, wherein the unvoiced separating unit separates unvoiced components of a singing sound for a consonant section specified by the lyrics recognizing unit in the acoustic signal.

The lyric recognition means identifies a consonant of the reference acoustic signal by the lyric recognition,
The unvoiced separation means separates unvoiced components by supervised non-negative matrix factorization using the basis matrix of the consonant specified by the lyrics recognition means as teacher information among a plurality of basis matrices representing frequency characteristics of different consonants. The sound processing apparatus according to claim 2.

The acoustic processing device according to claim 3, further comprising learning processing means for generating the base matrix by learning processing using a silent section specified by the section specifying means in the reference acoustic signal.

Comprising learning processing means for generating the basis matrix in learning processing using a silent section specified by the section specifying means in the reference acoustic signal;
The unvoiced separating means separates unvoiced components by supervised non-negative matrix factorization using the basis matrix generated by the learning processing means as teacher information.
The sound processing apparatus according to claim 3 .

  Computer
  Identify voiced and unvoiced intervals for the reference acoustic signal representing the singing sound of the user singing the song,
  Separating the voiced component of the singing sound for the voiced section of the mixed sound of the singing sound and the accompaniment sound of the music;
  Separating the unvoiced component of the singing sound for the unvoiced section of the acoustic signal;
  Combining the voiced component and the unvoiced component
  Sound processing method.