JP6299140B2 - Sound processing apparatus and sound processing method - Google Patents

Sound processing apparatus and sound processing method Download PDF

Info

Publication number
JP6299140B2
JP6299140B2 JP2013216141A JP2013216141A JP6299140B2 JP 6299140 B2 JP6299140 B2 JP 6299140B2 JP 2013216141 A JP2013216141 A JP 2013216141A JP 2013216141 A JP2013216141 A JP 2013216141A JP 6299140 B2 JP6299140 B2 JP 6299140B2
Authority
JP
Japan
Prior art keywords
unvoiced
sound
section
acoustic signal
voiced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2013216141A
Other languages
Japanese (ja)
Other versions
JP2015079122A (en
Inventor
慶太 有元
慶太 有元
近藤 多伸
多伸 近藤
祐 高橋
祐 高橋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Priority to JP2013216141A priority Critical patent/JP6299140B2/en
Publication of JP2015079122A publication Critical patent/JP2015079122A/en
Application granted granted Critical
Publication of JP6299140B2 publication Critical patent/JP6299140B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

本発明は、音響を表す音響信号を処理する技術に関する。   The present invention relates to a technique for processing an acoustic signal representing sound.

相異なる音源が発音した複数の音響成分の混合音から特定の音響成分を分離する音源分離技術が従来から提案されている。例えば非特許文献1や非特許文献2には、調波特性(source)と包絡特性(filter)とで歌唱音の周波数特性を表現するソースフィルタモデルを利用して、楽曲の歌唱音と伴奏音との混合音の音響信号から歌唱音を分離する技術が開示されている。   Conventionally, a sound source separation technique for separating a specific sound component from a mixed sound of a plurality of sound components generated by different sound sources has been proposed. For example, in Non-Patent Document 1 and Non-Patent Document 2, using a source filter model that expresses the frequency characteristics of a singing sound with harmonic characteristics (source) and envelope characteristics (filter), the singing sound and accompaniment of the music A technique for separating a singing sound from an acoustic signal of a mixed sound with sound is disclosed.

Jean-Louis Durrieu, et al., "MAIN INSTRUMENT SEPARATION FROM STEREOPHONIC AUDIO SIGNALS USING A SOURCE/FILTER MODEL", in Proc. EUSIPCO, p.15-18, 2009Jean-Louis Durrieu, et al., "MAIN INSTRUMENT SEPARATION FROM STEREOPHONIC AUDIO SIGNALS USING A SOURCE / FILTER MODEL", in Proc. EUSIPCO, p.15-18, 2009 Jean-Louis Durrieu, et al., "A musically motivated mid-level representation for pitch estimation and musical audio source separation", IEEE Journal of Selected Topics on Signal Processing 5(6), p.1180-1191, 2011Jean-Louis Durrieu, et al., "A musically motivated mid-level representation for pitch estimation and musical audio source separation", IEEE Journal of Selected Topics on Signal Processing 5 (6), p.1180-1191, 2011

しかし、非特許文献1や非特許文献2の技術のもとでは、実際には歌唱音が存在しない区間において、音響特性が歌唱音に類似する音響成分(例えば子音の音響特性に類似する打楽器の演奏音の音響成分)が歌唱音として誤抽出される可能性がある。以上の事情を考慮して、本発明は、音響信号から高精度に歌唱音を分離することを目的とする。   However, under the techniques of Non-Patent Document 1 and Non-Patent Document 2, an acoustic component whose acoustic characteristics are similar to the singing sound (for example, a percussion instrument similar to the acoustic characteristics of consonants) in a section where there is no actual singing sound. There is a possibility that the acoustic component of the performance sound) is erroneously extracted as a singing sound. In view of the above circumstances, an object of the present invention is to separate a singing sound from an acoustic signal with high accuracy.

以上の課題を解決するために、本発明の音響処理装置は、利用者が楽曲を歌唱した歌唱音を表す参照音響信号について有声区間と無声区間とを特定する音声解析手段と、楽曲の歌唱音と伴奏音との混合音の音響信号のうち音声解析手段が特定した有声区間について歌唱音の有声成分を分離する有声分離手段と、音響信号のうち音声解析手段が特定した無声区間について歌唱音の無声成分を分離する無声分離手段と、有声分離手段が分離した有声成分と無声分離手段が分離した無声成分とを合成する合成処理手段とを具備する。以上の構成では、利用者が楽曲を歌唱した歌唱音の参照音響信号から有声区間と無声区間とが特定され、音響信号の有声区間から有声成分が分離されるとともに無声区間から無声成分が分離される。したがって、参照音響信号を利用せずに音響信号から有声成分および無声成分を分離する構成と比較して高精度に歌唱音を分離できるという利点がある。   In order to solve the above problems, the sound processing apparatus of the present invention includes a voice analysis means for identifying a voiced section and a voiceless section with respect to a reference sound signal representing a singing sound by a user singing a song, and a song singing sound. Voiced separation means for separating the voiced component of the singing sound for the voiced section specified by the voice analysis means in the acoustic signal of the mixed sound of the sound and the accompaniment sound, and the voice of the singing sound for the unvoiced section specified by the voice analysis means in the acoustic signal There are provided unvoiced separation means for separating the unvoiced component, and synthesis processing means for synthesizing the voiced component separated by the voiced separation means and the unvoiced component separated by the unvoiced separation means. In the above configuration, the voiced section and the unvoiced section are identified from the reference acoustic signal of the singing sound of the user singing the song, the voiced component is separated from the voiced section of the acoustic signal, and the unvoiced component is separated from the unvoiced section. The Therefore, there is an advantage that the singing sound can be separated with high accuracy as compared with the configuration in which the voiced component and the unvoiced component are separated from the acoustic signal without using the reference acoustic signal.

なお、有声区間とは、調波構造が明確に観測される有声成分が優勢に存在する区間を意味する。他方、無声区間とは、調波構造が明確に観測されない無声成分が優勢に存在する区間であり、音声が存在しない無音区間とは区別される。   The voiced section means a section where a voiced component in which the harmonic structure is clearly observed is dominant. On the other hand, the silent section is a section in which a silent component in which the harmonic structure is not clearly observed is dominant, and is distinguished from a silent section in which no voice is present.

本発明の好適な態様において、音声解析手段は、参照音響信号の有声区間と無声区間とを特定する区間特定手段と、参照音響信号に対する歌詞認識で無声区間のうち歌唱音の子音に対応する子音区間を特定する歌詞認識手段とを含み、無声分離手段は、音響信号のうち歌詞認識手段が特定した子音区間について歌唱音の無声成分を分離する。以上の態様では、参照音響信号の無声区間のうち歌唱音の子音に対応する子音区間が歌詞認識で特定され、音響信号の子音区間から無声成分が分離される。したがって、参照音響信号が歌唱音の子音以外の無声音を包含する場合でも、音響信号から高精度に歌唱音を分離できるという利点がある。   In a preferred aspect of the present invention, the voice analysis means includes a section specifying means for specifying the voiced section and the unvoiced section of the reference sound signal, and a consonant corresponding to the consonant of the singing sound in the unvoiced section in the lyrics recognition for the reference sound signal. A voice recognition means for identifying a section, and the voice separation means separates the voiceless component of the singing sound for the consonant section specified by the text recognition means in the acoustic signal. In the above aspect, the consonant section corresponding to the consonant of the singing sound among the unvoiced sections of the reference acoustic signal is specified by the lyrics recognition, and the unvoiced component is separated from the consonant section of the acoustic signal. Therefore, even when the reference acoustic signal includes an unvoiced sound other than the consonant of the singing sound, there is an advantage that the singing sound can be separated from the acoustic signal with high accuracy.

本発明の好適な態様において、歌詞認識手段は、参照音響信号の子音を歌詞認識で特定し、無声分離手段は、相異なる子音の周波数特性を表す複数の基底行列のうち、歌詞認識手段が特定した子音の基底行列を教師情報として利用した教師あり非負値行列因子分解で無声成分を分離する。以上の態様では、参照音響信号の歌詞認識で特定された子音に対応する基底行列が、音響信号から無声成分を分離するための教師あり非負値行列因子分解に適用される。したがって、音響信号のうち歌唱音の子音を高精度に無声成分として分離できるという利点がある。   In a preferred aspect of the present invention, the lyrics recognizing means identifies the consonant of the reference sound signal by the lyrics recognition, and the unvoiced separating means is identified by the lyrics recognizing means among a plurality of base matrices representing frequency characteristics of different consonants. The unvoiced components are separated by supervised non-negative matrix factorization using the base matrix of the consonant as teacher information. In the above aspect, the base matrix corresponding to the consonant specified by the lyrics recognition of the reference acoustic signal is applied to the supervised non-negative matrix factorization for separating the unvoiced component from the acoustic signal. Therefore, there is an advantage that the consonant of the singing sound can be separated as an unvoiced component in the acoustic signal.

本発明の好適な態様に係る音響処理装置は、参照音響信号のうち区間特定手段が特定した無声区間を利用した学習処理で基底行列を生成する学習処理手段を具備する。以上の態様では、無声成分の分離に適用される基底行列が参照音響信号に対する学習処理で生成されるから、基底行列を事前に用意する必要がないという利点がある。なお、学習処理手段を具備する構成にとって歌詞認識手段の有無は不問である。   The sound processing apparatus according to a preferred aspect of the present invention comprises learning processing means for generating a base matrix by learning processing using a silent section specified by the section specifying means in the reference sound signal. In the above aspect, since the base matrix applied to the separation of the unvoiced component is generated by the learning process for the reference acoustic signal, there is an advantage that it is not necessary to prepare the base matrix in advance. In addition, the presence or absence of the lyric recognition means is not required for the configuration including the learning processing means.

以上の各態様に係る音響処理装置は、音響信号の処理に専用されるDSP(Digital Signal Processor)等のハードウェア(電子回路)によって実現されるほか、CPU(Central Processing Unit)等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る音響処理装置の動作方法(音響処理方法)としても特定される。   The acoustic processing device according to each of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of an acoustic signal, or a general-purpose calculation such as a CPU (Central Processing Unit). This is also realized by cooperation between the processing device and the program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (acoustic processing method) of the acoustic processing device according to each aspect described above.

本発明の第1実施形態に係る音響処理装置の構成図である。1 is a configuration diagram of a sound processing apparatus according to a first embodiment of the present invention. 音響処理装置の具体的な構成図である。It is a specific block diagram of a sound processing apparatus. 音響処理装置の動作のフローチャートである。It is a flowchart of operation | movement of a sound processing apparatus. 第2実施形態に係る音響処理装置の構成図である。It is a block diagram of the sound processing apparatus which concerns on 2nd Embodiment.

図1は、本発明の第1実施形態に係る音響処理装置100の構成図である。図1に例示される通り、音響処理装置100には信号供給装置12と放音装置14と収音装置16とが接続される。信号供給装置12は、音響信号SAを音響処理装置100に供給する。音響信号SAは、音響特性が相違する複数の音響成分(例えば音声や楽音)の混合音の波形を表す時間領域信号である。例えば可搬型または内蔵型の記録媒体(典型的には音楽CD)から音響信号SAを取得して音響処理装置100に供給する再生装置が信号供給装置12として採用され得る。なお、信号供給装置12を音響処理装置100と一体に構成することも可能である。   FIG. 1 is a configuration diagram of a sound processing apparatus 100 according to the first embodiment of the present invention. As illustrated in FIG. 1, a signal supply device 12, a sound emission device 14, and a sound collection device 16 are connected to the sound processing device 100. The signal supply device 12 supplies the acoustic signal SA to the acoustic processing device 100. The acoustic signal SA is a time-domain signal representing a mixed sound waveform of a plurality of acoustic components (for example, voice and musical sound) having different acoustic characteristics. For example, a reproduction apparatus that acquires the acoustic signal SA from a portable or built-in recording medium (typically a music CD) and supplies the acoustic signal SA to the acoustic processing apparatus 100 may be employed as the signal supply apparatus 12. Note that the signal supply device 12 may be integrated with the sound processing device 100.

第1実施形態では、特定の楽曲(以下「対象楽曲」という)の歌唱音と伴奏音との混合音の音響信号SAが信号供給装置12から音響処理装置100に供給される。歌唱音は、有声成分と無声成分とを包含し得る。有声成分は、基音成分と複数の倍音成分とを周波数軸上で基本周波数の整数倍の周波数に配列した調波構造(倍音構造)が観測される音響成分である。無声成分は、明確な調波構造が観測されない音響成分である。典型的には、歌唱音の母音が有声成分に相当し、摩擦音や破裂音等の子音(無声子音)が無声成分に相当する。他方、伴奏音は、相異なる複数種の楽器の楽音を含んで構成される。   In the first embodiment, a sound signal SA of a mixed sound of a singing sound and an accompaniment sound of a specific music (hereinafter referred to as “target music”) is supplied from the signal supply device 12 to the sound processing device 100. The singing sound can include a voiced component and an unvoiced component. The voiced component is an acoustic component in which a harmonic structure (harmonic structure) in which a fundamental component and a plurality of harmonic components are arranged on the frequency axis at a frequency that is an integral multiple of the fundamental frequency is observed. An unvoiced component is an acoustic component in which no clear harmonic structure is observed. Typically, vowels of singing sounds correspond to voiced components, and consonants (unvoiced consonants) such as friction sounds and burst sounds correspond to unvoiced components. On the other hand, the accompaniment sounds are configured to include different types of musical tones.

第1実施形態の音響処理装置100は、信号供給装置12から供給される音響信号SAに対する音響処理で音響信号SBを生成する信号処理装置(音源分離装置)である。音響信号SBは、音響信号SAに包含される歌唱音を分離した音響(すなわち楽曲の伴奏音を抑制した音響)の波形を表す時間領域信号である。放音装置14(例えばスピーカやヘッドホン)は、音響処理装置100が生成した音響信号SBに応じた音波を放射する。なお、音響信号SBをデジタルからアナログに変換するD/A変換器の図示は便宜的に省略した。   The acoustic processing device 100 according to the first embodiment is a signal processing device (sound source separation device) that generates an acoustic signal SB through acoustic processing on the acoustic signal SA supplied from the signal supply device 12. The acoustic signal SB is a time-domain signal representing the waveform of the sound obtained by separating the singing sound included in the acoustic signal SA (that is, the sound in which the accompaniment sound of the music is suppressed). The sound emitting device 14 (for example, a speaker or a headphone) emits a sound wave corresponding to the acoustic signal SB generated by the acoustic processing device 100. The D / A converter that converts the acoustic signal SB from digital to analog is not shown for convenience.

収音装置16は、周囲の音響を収音して音響の時間波形を表す音響信号を生成する。第1実施形態の収音装置16は、利用者が対象楽曲(歌唱パート)を歌唱した歌唱音の音響信号(以下「参照音響信号」という)SREFを音響処理装置100に供給する。収音装置16から音響処理装置100に対する参照音響信号SREFの供給(利用者による対象楽曲の歌唱)と、音響信号SAの音響処理および処理後の音響信号SBの再生とが、実時間的に並行して実行される。なお、参照音響信号SREFをアナログからデジタルに変換するA/D変換器の図示は便宜的に省略した。   The sound collection device 16 collects ambient sound and generates an acoustic signal representing a time waveform of the sound. The sound collection device 16 of the first embodiment supplies an acoustic signal (hereinafter referred to as “reference acoustic signal”) SREF of the singing sound in which the user sang the target song (singing part) to the acoustic processing device 100. The supply of the reference sound signal SREF from the sound collection device 16 to the sound processing device 100 (singing of the target music by the user) and the sound processing of the sound signal SA and the reproduction of the processed sound signal SB in parallel in real time. And executed. The A / D converter for converting the reference acoustic signal SREF from analog to digital is not shown for convenience.

図1に例示される通り、音響処理装置100は、演算処理装置22と記憶装置24とを具備するコンピュータシステムで実現される。記憶装置24は、演算処理装置22が実行するプログラムや演算処理装置22が使用する各種のデータを記憶する。半導体記録媒体または磁気記録媒体等の公知の記録媒体や複数種の記録媒体の組合せが記憶装置24として任意に採用される。音響信号SAを記憶装置24に記憶した構成(したがって信号供給装置12は省略され得る)も好適である。   As illustrated in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 22 and a storage device 24. The storage device 24 stores a program executed by the arithmetic processing device 22 and various data used by the arithmetic processing device 22. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily adopted as the storage device 24. A configuration in which the acoustic signal SA is stored in the storage device 24 (therefore, the signal supply device 12 can be omitted) is also suitable.

第1実施形態の記憶装置24は、相異なる子音に対応する複数の基底行列Mを記憶する。任意の1種類の子音に対応する基底行列Mは、当該子音の周波数特性を表現する音響モデル(子音モデル)である。第1実施形態の基底行列Mは、図1に例示される通り、典型的な子音の周波数特性(周波数スペクトル)を表す複数の基底ベクトルmを列方向に配列した非負値行列であり、音響信号SAに対する非負値行列因子分解(NMF:Non-negative Matrix Factorization)にて教師情報(事前情報)として利用される。   The storage device 24 of the first embodiment stores a plurality of basis matrices M corresponding to different consonants. The base matrix M corresponding to any one type of consonant is an acoustic model (consonant model) that expresses the frequency characteristics of the consonant. The base matrix M of the first embodiment is a non-negative matrix in which a plurality of base vectors m representing a typical consonant frequency characteristic (frequency spectrum) are arranged in the column direction as illustrated in FIG. It is used as teacher information (prior information) in non-negative matrix factorization (NMF) for SA.

演算処理装置22は、記憶装置24に記憶されたプログラムを実行することで、音響信号SAから音響信号SBを生成するための複数の機能(音声解析部32,信号処理部34)を実現する。音声解析部32は、収音装置16から供給される参照音響信号SREFの音響特性を解析する。信号処理部34は、音声解析部32による参照音響信号SREFの解析の結果を利用して音響信号SAから音響信号SBを生成する。すなわち、第1実施形態では、利用者が対象楽曲を歌唱した歌唱音の参照音響信号SREFが、音響信号SAに対する音源分離を補助する情報として利用される。なお、演算処理装置22の各機能を複数の集積回路に分散した構成や、専用の電子回路(例えばDSP)が演算処理装置22の一部の機能を実現する構成も採用され得る。また、実際には、時間領域の音響信号SAを例えば離散フーリエ変換で周波数領域に変換する要素や、周波数領域の音響信号SBを例えば離散逆フーリエ変換で時間領域に変換する要素が設置されるが、以下では説明や図示を便宜的に省略する。   The arithmetic processing unit 22 implements a plurality of functions (speech analysis unit 32, signal processing unit 34) for generating the acoustic signal SB from the acoustic signal SA by executing a program stored in the storage device 24. The voice analysis unit 32 analyzes the acoustic characteristics of the reference acoustic signal SREF supplied from the sound collection device 16. The signal processing unit 34 generates an acoustic signal SB from the acoustic signal SA using the result of the analysis of the reference acoustic signal SREF by the speech analysis unit 32. That is, in the first embodiment, the reference sound signal SREF of the singing sound that the user sang the target song is used as information for assisting sound source separation with respect to the sound signal SA. A configuration in which each function of the arithmetic processing device 22 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, a DSP) realizes a part of the functions of the arithmetic processing device 22 may be employed. In practice, an element for converting the time domain acoustic signal SA into the frequency domain by, for example, discrete Fourier transform, and an element for converting the frequency domain acoustic signal SB into the time domain by, for example, discrete inverse Fourier transform are installed. Hereinafter, description and illustration are omitted for the sake of convenience.

図2は、音響処理装置100の具体的な構成図である。音声解析部32は、収音装置16から供給される参照音響信号SREFから時間軸上の有声区間QV(V:Voiced)と無声区間QU(U:Unvoiced)とを順次に特定する。有声区間QVは、参照音響信号SREFの音声区間(音声が存在する区間)のうち有声成分が優勢に存在する区間であり、無声区間QUは、参照音響信号SREFの音声区間のうち無声成分が優勢に存在する区間である。図2に例示される通り、第1実施形態の音声解析部32は、区間特定部42と歌詞認識部44とを含んで構成される。   FIG. 2 is a specific configuration diagram of the sound processing apparatus 100. The voice analysis unit 32 sequentially specifies a voiced section QV (V: Voiced) and a voiceless section QU (U: Unvoiced) on the time axis from the reference acoustic signal SREF supplied from the sound collection device 16. The voiced section QV is a section in which the voiced component predominates in the voice section of the reference acoustic signal SREF (the section in which voice is present), and the unvoiced section QU is the unvoiced component in the voice section of the reference acoustic signal SREF. It is a section existing in. As illustrated in FIG. 2, the voice analysis unit 32 according to the first embodiment includes a section identification unit 42 and a lyrics recognition unit 44.

区間特定部42は、参照音響信号SREFの有声区間QVと無声区間QU0(無声区間QUの基礎となる区間)とを順次に特定する。有声区間QVおよび無声区間QU0の特定には公知の技術が任意に採用される。例えば、区間特定部42は、参照音響信号SREFのうち歌唱音が存在する音声区間を公知の音声区間検出(VAD:Voice Activity Detection)で検出し、音声区間のうち有意な音高(ピッチ)が観測される区間(すなわち明確な調波構造が存在する区間)を有声区間QVとして特定するとともに音声区間のうち有意な音高が観測されない区間(すなわち明確な調波構造が存在しない区間)を無声区間QU0として特定する。   The section specifying unit 42 sequentially specifies the voiced section QV and the unvoiced section QU0 (the section that is the basis of the unvoiced section QU) of the reference acoustic signal SREF. A known technique is arbitrarily adopted to specify the voiced section QV and the unvoiced section QU0. For example, the section specifying unit 42 detects a voice section in which the singing sound is present in the reference sound signal SREF by known voice section detection (VAD: Voice Activity Detection), and a significant pitch (pitch) in the voice section is detected. An observed section (ie, a section in which a clear harmonic structure exists) is specified as a voiced section QV, and a section in which no significant pitch is observed in the voice section (ie, a section in which no clear harmonic structure exists) is silent. It is specified as section QU0.

歌詞認識部44は、参照音響信号SREFに対する歌詞認識を実行する。第1実施形態の歌詞認識部44は、参照音響信号SREFに対する歌詞認識で、区間特定部42が特定した無声区間QU0のうち歌唱音の子音に対応する区間(子音区間)を無声区間QUとして順次に特定するとともに、参照音響信号SREFの無声区間QU0内に存在する子音(発音内容)Cを順次に特定する。無声区間QUは、区間特定部42が特定した初期的な無声区間QU0のうち歌唱音以外の無声音(例えば打楽器の演奏音等)が優勢に存在する区間を除外した区間である。すなわち、無声区間QU0の一部が無声区間QU(子音区間)として特定される。歌詞認識部44による歌詞認識(音声認識)には公知の音声認識技術が任意に採用される。   The lyrics recognizing unit 44 performs lyrics recognition on the reference sound signal SREF. The lyrics recognizing unit 44 of the first embodiment recognizes the reference acoustic signal SREF in the lyrics, and sequentially selects the sections (consonant sections) corresponding to the consonant of the singing sound among the unvoiced sections QU0 specified by the section specifying unit 42 as the unvoiced sections QU. And consonants (pronunciation contents) C existing in the unvoiced section QU0 of the reference sound signal SREF are sequentially specified. The unvoiced section QU is a section that excludes sections in which an unvoiced sound other than the singing sound (for example, a percussion instrument performance sound) predominates from the initial unvoiced section QU0 specified by the section specifying unit 42. That is, a part of the unvoiced section QU0 is specified as the unvoiced section QU (consonant section). A known speech recognition technique is arbitrarily employed for the lyrics recognition (voice recognition) by the lyrics recognition unit 44.

信号処理部34は、音声解析部32が特定した有声区間QVおよび無声区間QUと子音Cとを適用した信号処理(音源分離)で音響信号SAから音響信号SBを生成する。信号処理部34による音響信号SBの生成は、利用者による歌唱(歌唱音の参照音響信号SREFに対する音声解析部32の処理)に並行して実時間的に実行される。図2に例示される通り、第1実施形態の信号処理部34は、有声分離部52と無声分離部54と合成処理部56とを含んで構成される。   The signal processing unit 34 generates the acoustic signal SB from the acoustic signal SA by signal processing (sound source separation) to which the voiced section QV and unvoiced section QU and the consonant C specified by the speech analysis unit 32 are applied. The generation of the acoustic signal SB by the signal processing unit 34 is executed in real time in parallel with the singing by the user (processing of the voice analysis unit 32 with respect to the reference acoustic signal SREF of the singing sound). As illustrated in FIG. 2, the signal processing unit 34 of the first embodiment includes a voiced separation unit 52, an unvoiced separation unit 54, and a synthesis processing unit 56.

有声分離部52は、信号供給装置12から供給される音響信号SAのうち音声解析部32(区間特定部42)が特定した各有声区間QVから歌唱音の有声成分Vを分離(強調ないし抽出)する。有声成分Vの分離には公知の音源分離技術が任意に採用される。具体的には、調波特性と包絡特性とで歌唱音の有声成分を表現するソースフィルタモデルを利用した非特許文献1や非特許文献2の音源分離技術(V-IMM:"Voiced"-Instantaneous Mixture Model)が、有声分離部52による有声成分Vの分離に好適に採用される。すなわち、声帯等の発音源の振動に由来する調波特性(source)の時系列に相当する非負値行列と、声道等の共鳴管内での変調に由来する包絡特性(filter)の時系列に相当する非負値行列との要素毎の乗算(アダマール積)で有声成分Vを表現し、有声成分Vと有声成分V以外の音響成分との加算が音響信号SAの周波数特性(スペクトログラムを表現する観測行列)に近似するように所定の更新式の演算を反復することで有声成分Vが推定される。   The voiced separation unit 52 separates (emphasizes or extracts) the voiced component V of the singing sound from each voiced section QV specified by the voice analysis section 32 (section specifying section 42) in the acoustic signal SA supplied from the signal supply device 12. To do. For the separation of the voiced component V, a known sound source separation technique is arbitrarily employed. Specifically, sound source separation technology (V-IMM: "Voiced"-) using a source filter model that expresses a voiced component of a singing sound with harmonic characteristics and envelope characteristics. Instantaneous Mixture Model) is preferably used for the separation of the voiced component V by the voiced separation unit 52. That is, a non-negative matrix corresponding to a time series of harmonic characteristics (source) derived from vibration of a sound source such as a vocal cord and a time series of envelope characteristics (filter) derived from modulation in a resonance tube such as a vocal tract The voiced component V is expressed by element-by-element multiplication (Hadamard product) corresponding to the non-negative matrix, and the addition of the voiced component V and the acoustic component other than the voiced component V expresses the frequency characteristic (spectrogram of the acoustic signal SA. The voiced component V is estimated by iterating a predetermined update equation so as to approximate the observation matrix.

無声分離部54は、信号供給装置12から供給される音響信号SAのうち音声解析部32(歌詞認識部44)が特定した各無声区間QUから歌唱音の無声成分Uを分離(強調ないし抽出)する。無声成分Uの分離には公知の音源分離技術が任意に採用され得るが、第1実施形態の無声分離部54は、記憶装置24に記憶された基底行列Mを利用した非負値行列因子分解で無声成分Uを推定する。具体的には、無声分離部54は、記憶装置24に記憶された複数の基底行列Mのうち歌詞認識部44が特定した子音Cに対応する基底行列Mを探索し、当該基底行列Mを教師情報(事前情報)として利用した教師あり非負値行列因子分解(Supervised-NMF)で無声成分Uを分離する。無声分離部54による教師あり非負値行列因子分解には、例えば特開2013−33196号公報に開示された技術が好適に採用される。具体的には、子音Cの基底行列Mと各基底ベクトルmの加重値の時系列を意味する係数行列との行列積で表現される無声成分Uと、無声成分U以外の音響成分との加算が音響信号SAの周波数特性(観測行列)に近似するように所定の更新式の演算を反復することで無声成分Uが推定される。以上の説明から理解される通り、第1実施形態では、音響信号SAの歌唱音の有声成分Vと無声成分Uとが相異なる方法で分離される。   The unvoiced separation unit 54 separates (emphasizes or extracts) the unvoiced component U of the singing sound from each unvoiced section QU specified by the speech analysis unit 32 (lyric recognition unit 44) in the acoustic signal SA supplied from the signal supply device 12. To do. For the separation of the unvoiced component U, a known sound source separation technique can be arbitrarily employed. However, the unvoiced separation unit 54 of the first embodiment performs non-negative matrix factorization using the base matrix M stored in the storage device 24. An unvoiced component U is estimated. Specifically, the unvoiced separation unit 54 searches the basis matrix M corresponding to the consonant C specified by the lyrics recognition unit 44 among the plurality of basis matrices M stored in the storage device 24, and uses the basis matrix M as a teacher. The unvoiced component U is separated by supervised non-negative matrix factorization (Supervised-NMF) used as information (prior information). For the supervised non-negative matrix factorization by the silent separation unit 54, for example, a technique disclosed in Japanese Patent Application Laid-Open No. 2013-33196 is suitably employed. Specifically, an unvoiced component U expressed by a matrix product of a base matrix M of a consonant C and a coefficient matrix that represents a time series of weighted values of each basis vector m, and an acoustic component other than the unvoiced component U are added. The unvoiced component U is estimated by repeating the calculation of a predetermined update formula so as to approximate the frequency characteristic (observation matrix) of the acoustic signal SA. As understood from the above description, in the first embodiment, the voiced component V and the unvoiced component U of the singing sound of the acoustic signal SA are separated by different methods.

合成処理部56は、有声分離部52が分離した有声成分Vと無声分離部54が分離した無声成分Uとを合成することで音響信号SBを生成する。具体的には、合成処理部56は、有声分離部52が有声区間QV毎に生成した有声成分Vと無声分離部54が無声区間QU毎に生成した無声成分Uとを時間軸上に配列することで時間領域の音響信号SBを生成する。したがって、対象楽曲の歌唱音と伴奏音との混合音の音響信号SAから歌唱音を選択的に抽出した音響信号SBが生成される。合成処理部56が生成した音響信号SBが放音装置14に供給されることで音波として放射される。   The synthesis processing unit 56 generates the acoustic signal SB by synthesizing the voiced component V separated by the voiced separation unit 52 and the unvoiced component U separated by the unvoiced separation unit 54. Specifically, the synthesis processing unit 56 arranges the voiced component V generated by the voiced separation unit 52 for each voiced section QV and the voiced component U generated by the voiceless separation unit 54 for each voiced section QU on the time axis. Thus, the time domain acoustic signal SB is generated. Therefore, the acoustic signal SB is generated by selectively extracting the singing sound from the acoustic signal SA of the mixed sound of the singing sound and the accompaniment sound of the target music. The acoustic signal SB generated by the synthesis processing unit 56 is supplied to the sound emitting device 14 and is emitted as a sound wave.

図3は、演算処理装置22が実行する動作のフローチャートである。参照音響信号SREFおよび音響信号SAを時間軸上で区分した単位区間毎に図3の処理が反復的に実行される。図3の処理を開始すると、演算処理装置22(音声解析部32)は、参照音響信号SREFの単位区間から有声区間QVと無声区間QUとを特定する音声解析PAを実行する。具体的には、演算処理装置22(区間特定部42)は、参照音響信号SREFの単位区間から有声区間QVと無声区間QU0とを特定する(PA1)。そして、演算処理装置22(歌詞認識部44)は、参照音響信号SREFの単位区間に対する歌詞認識で無声区間QU0の無声区間QUと子音Cとを特定する(PA2)。   FIG. 3 is a flowchart of the operation executed by the arithmetic processing unit 22. The processing of FIG. 3 is repeatedly executed for each unit section obtained by dividing the reference acoustic signal SREF and the acoustic signal SA on the time axis. When the processing of FIG. 3 is started, the arithmetic processing unit 22 (speech analysis unit 32) executes a speech analysis PA that identifies the voiced section QV and the unvoiced section QU from the unit section of the reference acoustic signal SREF. Specifically, the arithmetic processing unit 22 (section specifying unit 42) specifies the voiced section QV and the unvoiced section QU0 from the unit section of the reference acoustic signal SREF (PA1). Then, the arithmetic processing unit 22 (lyric recognition unit 44) identifies the unvoiced section QU and the consonant C of the unvoiced section QU0 by recognizing the lyrics for the unit section of the reference sound signal SREF (PA2).

音声解析PAを実行すると、演算処理装置22(信号処理部34)は、音声解析PAで特定した有声区間QVと無声区間QUと子音Cを利用して音響信号SAの単位区間から音響信号SBを生成する信号処理PBを実行する。具体的には、演算処理装置22は、単位区間内の有声区間QVから有声成分Vを分離する処理(PB1/有声分離部52)と、単位区間内の無声区間QUから無声成分Uを分離する処理(PB2/無声分離部54)とを実行する。無声成分Uの分離には子音Cが適用される。そして、演算処理装置22(合成処理部56)は、単位区間内の有声成分Vと無声成分Uとを合成することで音響信号SBを生成する(PB3)。   When the voice analysis PA is executed, the arithmetic processing unit 22 (signal processing unit 34) uses the voiced section QV, the unvoiced section QU, and the consonant C specified by the voice analysis PA to generate the acoustic signal SB from the unit section of the acoustic signal SA. The signal processing PB to be generated is executed. Specifically, the arithmetic processing unit 22 separates the voiced component V from the voiced section QV in the unit section (PB1 / voiced separation unit 52), and separates the voiced component U from the voiceless section QU in the unit section. Processing (PB2 / voiceless separation unit 54) is executed. The consonant C is applied to the separation of the unvoiced component U. Then, the arithmetic processing unit 22 (synthesis processing unit 56) generates the acoustic signal SB by synthesizing the voiced component V and the unvoiced component U in the unit section (PB3).

以上に説明した通り、第1実施形態では、利用者が対象楽曲を歌唱した歌唱音の参照音響信号SREFから有声区間QVと無声区間QUとが特定され、音響信号SAの有声区間QVから有声成分Vが分離されるとともに音響信号SAの無声区間QUから無声成分Uが分離される。すなわち、参照音響信号SREFが補助的な情報として有声成分Vおよび無声成分Uの音源分離に適用される。したがって、参照音響信号SREFを利用せずに音響信号SAのみから有声成分Vおよび無声成分Uを分離する構成と比較して、対象楽曲の音響信号SAから高精度に歌唱音(有声成分Vおよび無声成分U)を分離できるという利点がある。   As described above, in the first embodiment, the voiced section QV and the unvoiced section QU are specified from the reference acoustic signal SREF of the singing sound that the user sang the target song, and the voiced component from the voiced section QV of the acoustic signal SA. V is separated and the unvoiced component U is separated from the unvoiced section QU of the acoustic signal SA. That is, the reference acoustic signal SREF is applied to sound source separation of the voiced component V and the unvoiced component U as auxiliary information. Therefore, compared with the configuration in which the voiced component V and the unvoiced component U are separated from only the acoustic signal SA without using the reference acoustic signal SREF, the singing sound (voiced component V and unvoiced component) is accurately obtained from the acoustic signal SA of the target music. There is the advantage that component U) can be separated.

第1実施形態では、参照音響信号SREFの無声区間QU0のうち歌唱音の子音に対応する無声区間(子音区間)QUが歌詞認識で特定され、音響信号SAの無声区間QUから無声成分Uが分離される。したがって、参照音響信号SREFの無声区間QU0に歌唱音の子音以外の無声音(例えば打楽器の演奏音)が包含される場合でも、歌唱音の子音のみが選択的に無声成分Uとして分離される。すなわち、音響信号SAの歌唱音を高精度に分離できるという効果は格別に顕著である。   In the first embodiment, the unvoiced section (consonant section) QU corresponding to the consonant of the singing sound among the unvoiced section QU0 of the reference acoustic signal SREF is specified by the lyrics recognition, and the unvoiced component U is separated from the unvoiced section QU of the acoustic signal SA. Is done. Therefore, even when an unvoiced sound other than the consonant of the singing sound (for example, a percussion instrument performance sound) is included in the unvoiced section QU0 of the reference sound signal SREF, only the consonant of the singing sound is selectively separated as the unvoiced component U. That is, the effect that the singing sound of the acoustic signal SA can be separated with high accuracy is particularly remarkable.

第1実施形態では、記憶装置24に記憶された複数の基底行列Mのうち、参照音響信号SREFの無声区間QU0に対する歌詞認識で特定された子音Cに対応する基底行列Mが、無声分離部54による無声成分Uの分離(教師あり非負値行列因子分解)に適用される。したがって、音響信号SAのうち歌唱音の子音を高精度に無声成分Uとして分離できる(ひいては音響信号SAの歌唱音を高精度に分離できる)という利点がある。   In the first embodiment, among the plurality of basis matrices M stored in the storage device 24, the basis matrix M corresponding to the consonant C identified by the lyrics recognition for the unvoiced section QU 0 of the reference acoustic signal SREF is the unvoiced separation unit 54. Applied to separation of unvoiced component U by (supervised non-negative matrix factorization). Therefore, there is an advantage that the consonant of the singing sound in the acoustic signal SA can be separated as the unvoiced component U with high accuracy (and the singing sound of the acoustic signal SA can be separated with high accuracy).

<第2実施形態>
本発明の第2実施形態を説明する。以下に例示する各形態において作用や機能が第1実施形態と同様である要素については、第1実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。
Second Embodiment
A second embodiment of the present invention will be described. In each of the embodiments exemplified below, elements having the same functions and functions as those of the first embodiment will be referred to in the description of the first embodiment, and detailed descriptions thereof will be appropriately omitted.

図4は、第2実施形態における音響処理装置100の構成図である。図4に例示される通り、第2実施形態の演算処理装置22は、第1実施形態と同様の要素(音声解析部32,信号処理部34)に加えて学習処理部36として機能する。学習処理部36は、収音装置16から供給される参照音響信号SREFのうち音声解析部32(区間特定部42)が特定した無声区間QUを適用した学習処理で基底行列Mを順次に生成する。学習処理には公知の機械学習技術が任意に採用される。信号処理部34の無声分離部54は、学習処理部36が順次に生成する基底行列Mを教師情報として利用した教師あり非負値行列因子分解で、音響信号SAのうち音声解析部32(歌詞認識部44)が特定した無声区間QUから無声成分Uを分離する。   FIG. 4 is a configuration diagram of the sound processing apparatus 100 according to the second embodiment. As illustrated in FIG. 4, the arithmetic processing device 22 of the second embodiment functions as a learning processing unit 36 in addition to the same elements (speech analysis unit 32 and signal processing unit 34) as in the first embodiment. The learning processing unit 36 sequentially generates a base matrix M by a learning process using the unvoiced section QU specified by the speech analysis unit 32 (section specifying unit 42) in the reference acoustic signal SREF supplied from the sound collection device 16. . A known machine learning technique is arbitrarily employed for the learning process. The unvoiced separation unit 54 of the signal processing unit 34 is a supervised non-negative matrix factorization using the base matrix M sequentially generated by the learning processing unit 36 as teacher information, and the speech analysis unit 32 (lyric recognition) of the acoustic signal SA. The unvoiced component U is separated from the unvoiced section QU specified by the section 44).

なお、第1実施形態では、利用者による歌唱(参照音響信号SREFの生成)と音響信号SBの生成とを実時間的に並行して実行した。第2実施形態では、参照音響信号SREFを利用した基底行列Mの生成後に、各基底行列Mを適用した音響信号SBの生成を実行する構成(すなわち、利用者による対象楽曲の歌唱後に音響信号SAから音響信号SBを生成する構成)が好適である。   In the first embodiment, the singing by the user (generation of the reference acoustic signal SREF) and the generation of the acoustic signal SB are executed in parallel in real time. In the second embodiment, after generating the base matrix M using the reference acoustic signal SREF, the acoustic signal SB is generated by applying each base matrix M (that is, the acoustic signal SA after the user sings the target song). The configuration of generating the acoustic signal SB from the above is preferable.

第2実施形態においても第1実施形態と同様の効果が実現される。また、第2実施形態では、無声成分Uの分離に適用される基底行列Mが参照音響信号SREFに対する学習処理で生成されるから、基底行列Mを事前に用意する必要がないという利点がある。なお、前述の例示では、音響信号SAのうち歌詞認識部44が特定した無声区間QUから無声成分Uを分離したが、音響信号SAのうち区間特定部42が特定した無声区間QU0から無声成分Uを分離することも可能である。したがって、第2実施形態では歌詞認識部44は省略され得る。   In the second embodiment, the same effect as in the first embodiment is realized. Further, in the second embodiment, since the base matrix M applied to the separation of the unvoiced component U is generated by the learning process for the reference acoustic signal SREF, there is an advantage that it is not necessary to prepare the base matrix M in advance. In the above example, the unvoiced component U is separated from the unvoiced section QU specified by the lyrics recognizing unit 44 in the acoustic signal SA, but the unvoiced component U from the unvoiced section QU0 specified by the section specifying unit 42 in the acoustic signal SA. Can also be separated. Therefore, the lyrics recognition unit 44 can be omitted in the second embodiment.

<変形例>
以上に例示した各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様は適宜に併合され得る。
<Modification>
Each form illustrated above can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

(1)第1実施形態では、対象楽曲の音響信号SAのうち歌詞認識部44が特定した無声区間QUから無声成分Uを分離したが、音響信号SAのうち区間特定部42が特定した無声区間QU0から無声分離部54が無声成分Uを分離することも可能である。すなわち、歌唱音の子音に対応する無声区間QUを歌詞認識部44が特定する処理は省略され得る。以上の説明から理解される通り、前述の各形態の音声解析部32が参照音響信号SREFから特定する無声区間は、参照音響信号SREFの音声区間のうち有声区間QV以外の無声区間QU0と、無声区間QU0の一部の区間(無声区間QU0のうち歌唱音の子音に対応する子音区間)QUとの双方を包含する。 (1) In the first embodiment, the unvoiced component U is separated from the unvoiced section QU specified by the lyrics recognizing unit 44 in the acoustic signal SA of the target music, but the unvoiced section specified by the section specifying unit 42 in the acoustic signal SA. It is also possible for the unvoiced separation unit 54 to separate the unvoiced component U from QU0. That is, the process in which the lyrics recognizing unit 44 specifies the silent section QU corresponding to the consonant of the singing sound can be omitted. As understood from the above description, the unvoiced section identified from the reference acoustic signal SREF by the speech analysis unit 32 of each embodiment described above is the unvoiced section QU0 other than the voiced section QV in the speech section of the reference acoustic signal SREF, and the unvoiced section. It includes both a part of the section QU0 (consonant section corresponding to the consonant of the singing sound in the unvoiced section QU0) QU.

(2)信号処理部34による音源分離を補助するために参照音響信号SREFから抽出される情報は、前述の各形態で例示した情報(有声区間QV,無声区間QU,子音C)に限定されない。例えば、参照音響信号SREFから抽出される音高(ピッチ)を、有声分離部52による有声成分Vの分離に利用することも可能である。例えば、対象楽曲の音響信号SAのうち、参照音響信号SREFから抽出される音高に対して所定の範囲内にある音響成分を有声成分Vの候補として抽出すれば、参照音響信号SREFの音高を利用しない構成と比較して高精度に有声成分Vを分離することが可能である。なお、参照音響信号SREFの音高の推定には公知の音高推定技術が任意に採用され得る。 (2) Information extracted from the reference acoustic signal SREF to assist sound source separation by the signal processing unit 34 is not limited to the information (voiced section QV, unvoiced section QU, consonant C) exemplified in the above-described embodiments. For example, the pitch (pitch) extracted from the reference acoustic signal SREF can be used for separation of the voiced component V by the voiced separation unit 52. For example, if an acoustic component within a predetermined range with respect to the pitch extracted from the reference acoustic signal SREF is extracted as a candidate for the voiced component V from the acoustic signal SA of the target music, the pitch of the reference acoustic signal SREF is extracted. It is possible to separate the voiced component V with higher accuracy compared to a configuration that does not use the. Note that a known pitch estimation technique can be arbitrarily employed for estimating the pitch of the reference acoustic signal SREF.

(3)利用者の歌唱が下手な場合には、参照音響信号SREFと音響信号SAとの間で有声区間QVや無声区間QUが合致しない可能性がある。したがって、参照音響信号SREFを調整したうえで有声区間QVや無声区間QUを特定する構成が好適である。例えば、音声解析部32(区間特定部42)は、参照音響信号SREFの時間軸上の各時点が対象楽曲内で対応する時点に時点に合致するように参照音響信号SREFを時間軸上で調整(アライメント)したうえで有声区間QVや無声区間QU(QU0)を特定する。以上の構成によれば、利用者の歌唱が下手な場合でも高精度に対象楽曲の歌唱音を分離できるという利点がある。 (3) When the user's singing is poor, the voiced section QV and the unvoiced section QU may not match between the reference acoustic signal SREF and the acoustic signal SA. Therefore, a configuration that specifies the voiced section QV and the unvoiced section QU after adjusting the reference acoustic signal SREF is preferable. For example, the voice analysis unit 32 (section specifying unit 42) adjusts the reference sound signal SREF on the time axis so that each time point on the time axis of the reference sound signal SREF matches the time point corresponding to the target music. After (alignment), the voiced section QV and the unvoiced section QU (QU0) are specified. According to the above structure, even when a user's song is not good, there exists an advantage that the song sound of an object music can be isolate | separated with high precision.

(4)前述の各形態では、有声分離部52による有声成分Vの分離と無声分離部54による無声成分Uの分離とを個別に実行したが、有声成分Vおよび無声成分Uの双方を音響信号SAから一括的に分離する構成も採用され得る。すなわち、有声分離部52と無声分離部54とを一体の要素として把握することも可能である。 (4) In each of the above-described embodiments, the separation of the voiced component V by the voiced separation unit 52 and the separation of the unvoiced component U by the unvoiced separation unit 54 are performed separately, but both the voiced component V and the unvoiced component U are acoustic signals. A configuration that separates from SA in a lump can also be employed. That is, the voiced separation unit 52 and the unvoiced separation unit 54 can be grasped as an integral element.

(5)前述の各形態では、歌唱音と伴奏音との混合音の音響信号SAから歌唱音を抽出したが、音響信号SAから伴奏音を抽出することも可能である。例えば、前述の各形態で生成された音響信号SBを音響信号SAから減算することで、対象楽曲の伴奏音を分離(強調または抽出)した音響信号を生成することが可能である。 (5) In each embodiment described above, the singing sound is extracted from the acoustic signal SA of the mixed sound of the singing sound and the accompaniment sound, but the accompaniment sound can be extracted from the acoustic signal SA. For example, it is possible to generate an acoustic signal obtained by separating (emphasizing or extracting) the accompaniment sound of the target music by subtracting the acoustic signal SB generated in each form described above from the acoustic signal SA.

(6)携帯電話機等の端末装置と通信するサーバ装置で音響処理装置100を実現することも可能である。例えば、音響処理装置100は、端末装置から通信網を介して受信した参照音響信号SREFを利用して音響信号SAから音響信号SBを生成して端末装置に送信する。処理対象の音響信号SAは、音響処理装置100に接続された信号供給装置12から供給された信号、または、音響処理装置100が端末装置から通信網を介して受信した信号である。 (6) The sound processing apparatus 100 can be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the acoustic processing device 100 generates the acoustic signal SB from the acoustic signal SA using the reference acoustic signal SREF received from the terminal device via the communication network, and transmits the acoustic signal SB to the terminal device. The acoustic signal SA to be processed is a signal supplied from the signal supply device 12 connected to the sound processing device 100 or a signal received by the sound processing device 100 from the terminal device via the communication network.

100……音響処理装置、12……信号供給装置、14……放音装置、16……収音装置、22……演算処理装置、24……記憶装置、32……音声解析部、34……信号処理部、36……学習処理部、42……区間特定部、44……歌詞認識部、52……有声分離部、54……無声分離部、56……合成処理部。
DESCRIPTION OF SYMBOLS 100 ... Sound processing device, 12 ... Signal supply device, 14 ... Sound emission device, 16 ... Sound collection device, 22 ... Arithmetic processing device, 24 ... Memory | storage device, 32 ... Voice analysis part, 34 ... A signal processing unit, 36 a learning processing unit, 42 a section specifying unit, 44 a lyrics recognition unit, 52 a voiced separation unit, 54 a voiceless separation unit, and 56 a synthesis processing unit.

Claims (6)

利用者が楽曲を歌唱した歌唱音を表す参照音響信号について有声区間と無声区間とを特定する音声解析手段と、
前記楽曲の歌唱音と伴奏音との混合音の音響信号のうち前記音声解析手段が特定した有声区間について歌唱音の有声成分を分離する有声分離手段と、
前記音響信号のうち前記音声解析手段が特定した無声区間について歌唱音の無声成分を分離する無声分離手段と、
前記有声分離手段が分離した有声成分と前記無声分離手段が分離した無声成分とを合成する合成処理手段と
を具備する音響処理装置。
Voice analysis means for identifying voiced and unvoiced sections for a reference acoustic signal representing a singing sound of a user singing a song;
Voiced separation means for separating the voiced component of the singing sound from the acoustic signal of the mixed sound of the singing sound of the music and the accompaniment sound for the voiced section specified by the voice analysis means;
Unvoiced separation means for separating the unvoiced component of the singing sound for the unvoiced section identified by the voice analysis means of the acoustic signal;
An acoustic processing apparatus comprising: a synthesis processing unit that synthesizes the voiced component separated by the voiced separation unit and the voiced component separated by the voiceless separation unit.
前記音声解析手段は、
前記参照音響信号の有声区間と無声区間とを特定する区間特定手段と、
前記参照音響信号に対する歌詞認識で前記無声区間のうち歌唱音の子音に対応する子音区間を特定する歌詞認識手段とを含み、
前記無声分離手段は、前記音響信号のうち前記歌詞認識手段が特定した子音区間について歌唱音の無声成分を分離する
請求項1の音響処理装置。
The voice analysis means includes
Section specifying means for specifying a voiced section and an unvoiced section of the reference acoustic signal;
Lyrics recognition means for identifying a consonant section corresponding to a consonant of a singing sound among the unvoiced sections in the lyrics recognition for the reference acoustic signal,
The sound processing apparatus according to claim 1, wherein the unvoiced separating unit separates unvoiced components of a singing sound for a consonant section specified by the lyrics recognizing unit in the acoustic signal.
前記歌詞認識手段は、前記参照音響信号の子音を前記歌詞認識で特定し、
前記無声分離手段は、相異なる子音の周波数特性を表す複数の基底行列のうち、前記歌詞認識手段が特定した子音の基底行列を教師情報として利用した教師あり非負値行列因子分解で無声成分を分離する
請求項2の音響処理装置。
The lyric recognition means identifies a consonant of the reference acoustic signal by the lyric recognition,
The unvoiced separation means separates unvoiced components by supervised non-negative matrix factorization using the basis matrix of the consonant specified by the lyrics recognition means as teacher information among a plurality of basis matrices representing frequency characteristics of different consonants. The sound processing apparatus according to claim 2.
前記参照音響信号のうち前記区間特定手段が特定した無声区間を利用した学習処理で前記基底行列を生成する学習処理手段
を具備する請求項3の音響処理装置。
The acoustic processing device according to claim 3, further comprising learning processing means for generating the base matrix by learning processing using a silent section specified by the section specifying means in the reference acoustic signal.
前記参照音響信号のうち前記区間特定手段が特定した無声区間を利用した学習処理で前記基底行列を生成する学習処理手段を具備し、
前記無声分離手段は、前記学習処理手段が生成した基底行列を教師情報として利用した教師あり非負値行列因子分解で無声成分を分離する
請求項3の音響処理装置。
Comprising learning processing means for generating the basis matrix in learning processing using a silent section specified by the section specifying means in the reference acoustic signal;
The unvoiced separating means separates unvoiced components by supervised non-negative matrix factorization using the basis matrix generated by the learning processing means as teacher information.
The sound processing apparatus according to claim 3 .
コンピュータが、  Computer
利用者が楽曲を歌唱した歌唱音を表す参照音響信号について有声区間と無声区間とを特定し、  Identify voiced and unvoiced intervals for the reference acoustic signal representing the singing sound of the user singing the song,
前記楽曲の歌唱音と伴奏音との混合音の音響信号のうち前記有声区間について歌唱音の有声成分を分離し、  Separating the voiced component of the singing sound for the voiced section of the mixed sound of the singing sound and the accompaniment sound of the music;
前記音響信号のうち前記無声区間について歌唱音の無声成分を分離し、  Separating the unvoiced component of the singing sound for the unvoiced section of the acoustic signal;
前記有声成分と前記無声成分とを合成する  Combining the voiced component and the unvoiced component
音響処理方法。  Sound processing method.
JP2013216141A 2013-10-17 2013-10-17 Sound processing apparatus and sound processing method Expired - Fee Related JP6299140B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013216141A JP6299140B2 (en) 2013-10-17 2013-10-17 Sound processing apparatus and sound processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2013216141A JP6299140B2 (en) 2013-10-17 2013-10-17 Sound processing apparatus and sound processing method

Publications (2)

Publication Number Publication Date
JP2015079122A JP2015079122A (en) 2015-04-23
JP6299140B2 true JP6299140B2 (en) 2018-03-28

Family

ID=53010582

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013216141A Expired - Fee Related JP6299140B2 (en) 2013-10-17 2013-10-17 Sound processing apparatus and sound processing method

Country Status (1)

Country Link
JP (1) JP6299140B2 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6766374B2 (en) * 2016-02-26 2020-10-14 沖電気工業株式会社 Classification device, classification method, program, and parameter generator
CN113393857A (en) * 2021-06-10 2021-09-14 腾讯音乐娱乐科技(深圳)有限公司 Method, device and medium for eliminating human voice of music signal

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7567900B2 (en) * 2003-06-11 2009-07-28 Panasonic Corporation Harmonic structure based acoustic speech interval detection method and device
JP5046211B2 (en) * 2008-02-05 2012-10-10 独立行政法人産業技術総合研究所 System and method for automatically associating music acoustic signal and lyrics with time
JP5662276B2 (en) * 2011-08-05 2015-01-28 株式会社東芝 Acoustic signal processing apparatus and acoustic signal processing method

Also Published As

Publication number Publication date
JP2015079122A (en) 2015-04-23

Similar Documents

Publication Publication Date Title
Tachibana et al. Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source
Tachibana et al. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms
JP5961950B2 (en) Audio processing device
Doi et al. Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system
EP3065130B1 (en) Voice synthesis
WO2019138871A1 (en) Speech synthesis method, speech synthesis device, and program
JP6733644B2 (en) Speech synthesis method, speech synthesis system and program
US20130339011A1 (en) Systems, methods, apparatus, and computer-readable media for pitch trajectory analysis
WO2019181767A1 (en) Sound processing method, sound processing device, and program
JP6299140B2 (en) Sound processing apparatus and sound processing method
JP2013164584A (en) Acoustic processor
Benetos et al. Auditory spectrum-based pitched instrument onset detection
WO2014142200A1 (en) Voice processing device
JP6657713B2 (en) Sound processing device and sound processing method
JP7139628B2 (en) SOUND PROCESSING METHOD AND SOUND PROCESSING DEVICE
JP2014134688A (en) Acoustic analyzer
WO2022070639A1 (en) Information processing device, information processing method, and program
JP2017067901A (en) Acoustic analysis device
JP6252420B2 (en) Speech synthesis apparatus and speech synthesis system
JP5573529B2 (en) Voice processing apparatus and program
Sharma et al. Singing characterization using temporal and spectral features in indian musical notes
Pillai et al. Automatic carnatic raga identification using octave mapping and note quantization
JP5272141B2 (en) Voice processing apparatus and program
JP7200483B2 (en) Speech processing method, speech processing device and program
JP7106897B2 (en) Speech processing method, speech processing device and program

Legal Events

Date Code Title Description
RD04 Notification of resignation of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7424

Effective date: 20150410

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20160823

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20170718

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20170725

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20170815

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20180130

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20180212

R151 Written notification of patent or utility model registration

Ref document number: 6299140

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R151

LAPS Cancellation because of no payment of annual fees