JP2008134606A

JP2008134606A - Automatic system and method for temporal alignment of music audio signal with lyric

Info

Publication number: JP2008134606A
Application number: JP2007233682A
Authority: JP
Inventors: Hiromasa Fujiwara; 弘将藤原; Hiroshi Okuno; 博奥乃; Masataka Goto; 真孝後藤
Original assignee: National Institute of Advanced Industrial Science and Technology AIST; Kyoto University NUC
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Kyoto University NUC
Priority date: 2006-10-24
Filing date: 2007-09-10
Publication date: 2008-06-12
Anticipated expiration: 2027-09-10
Also published as: JP5131904B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an automatic system for temporal alignment between a music audio signal and lyrics, capable of preventing accuracy for temporal alignment from being lowered due to the influence of non-vocal sections. <P>SOLUTION: An alignment means 17 includes a phone model 15 for singing voice that estimates phonemes corresponding to temporal-alignment features. The alignment means 17 receives temporal-alignment features output from a temporal-alignment feature extraction means 11, information on the vocal and non-vocal sections output from a vocal section estimation means 9, and a phoneme network SN, and performs an alignment operation on condition that no phoneme exists at least in non-vocal sections. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、歌声と伴奏音とを含む楽曲の音楽音響信号と歌詞との時間的対応付け（アラインメント）を自動で行うシステム及び方法並びに該システムで用いるプログラムに関するものである。 The present invention relates to a system and method for automatically performing temporal association (alignment) between music acoustic signals and lyrics of music including singing voices and accompaniment sounds, and a program used in the system.

コンパクトディスク（ＣＤ）などの記録媒体に記録されたディジタル音楽データ（音楽音響信号）のうち、特に、人の音声（例えば歌声）と人の音声以外の音（例えば伴奏音）とで構成されるディジタル音楽データを再生する際に、人の音声の発話内容（すなわち歌詞）を伴奏音と時間的に同期させながら視覚的に表示させる技術は、いわゆるカラオケ装置などでよく使用されている。 Among digital music data (music acoustic signals) recorded on a recording medium such as a compact disc (CD), in particular, it is composed of human voice (for example, singing voice) and sound other than human voice (for example, accompaniment sound). 2. Description of the Related Art When reproducing digital music data, a technique for visually displaying the utterance content (that is, lyrics) of a person's voice while being synchronized with an accompanying sound in time is often used in a so-called karaoke apparatus.

しかし、従来のカラオケ装置の場合、伴奏音とその歌手の歌声とは正確に同期しておらず、その音楽の歌詞が楽譜上で予定されているテンポで順次画面上に表示されているにすぎない。そのため、実際の発話のタイミングと画面上の表示とが大きくずれることも多い。しかも、伴奏音と歌声の同期作業は、人間の手を介して行われるものであり、かなりの人的労力を必要とする。 However, in the case of a conventional karaoke device, the accompaniment sound and the singer's singing voice are not accurately synchronized, and the lyrics of the music are only displayed sequentially on the screen at the tempo scheduled on the score. Absent. For this reason, the actual utterance timing and the display on the screen often deviate greatly. In addition, the synchronization of the accompaniment sound and the singing voice is performed through human hands and requires considerable human labor.

ところで、いわゆる音声認識技術に代表されるように、人の発話内容を解析する技術が知られている。この技術は、伴奏音がない歌声（これを「単独歌唱」という。）のディジタル音楽データからその発話内容（歌詞）を認識するというものである。これについてはいくつかの研究結果が報告されている。しかしながら、伴奏音の影響を一切考慮しない音声認識技術を、市販のコンパクトディスク（ＣＤ）またはインターネット等の電気通信回線を通じて配信されるディジタル音楽データにそのまま適用することは極めて困難である。 By the way, as represented by so-called speech recognition technology, a technology for analyzing the content of a person's utterance is known. This technique recognizes the utterance content (lyrics) from digital music data of a singing voice without accompaniment sound (this is called “single singing”). Several studies have been reported on this. However, it is extremely difficult to directly apply a voice recognition technology that does not consider the influence of accompaniment sounds to digital music data distributed through a commercially available compact disc (CD) or an electric communication line such as the Internet.

伴奏音を含む歌唱に関する研究としては、各音素の持続する時間長を学習し、歌声を複数の区間に割り振るものが知られている（下記非特許文献１参照）。この技術は、ビートトラッキングやさび部分の検出など高次の情報を利用する。しかしながら、この技術は音韻的な特徴（例えば、母音や子音など）を全く考慮していない。そのためため、正解率がそれほど高くないという問題がある。また、拍子やテンポなどについての制約が大きいため、多くの種類の楽曲に適用することができないという問題もある。 As a study on singing including an accompaniment sound, one that learns the length of time that each phoneme lasts and allocates a singing voice to a plurality of sections is known (see Non-Patent Document 1 below). This technology uses higher order information such as beat tracking and rust detection. However, this technique does not consider phonological features (for example, vowels and consonants) at all. Therefore, there is a problem that the accuracy rate is not so high. There is also a problem in that it cannot be applied to many types of music because the constraints on time signature and tempo are large.

また特開２００１−１１７５８２号公報（特許文献１）には、カラオケ装置において、歌唱者（入力者）の歌声の音素列と特定の歌手の歌声の音素列とをアラインメント手段を利用して対応付けする技術が開示されている。しかしながらこの公報には、音楽音響信号と歌詞とを時間的に対応付ける技術は何も開示されていない。 In Japanese Patent Laid-Open No. 2001-117582 (Patent Document 1), in a karaoke apparatus, a phoneme string of a singer (input person) and a phoneme string of a specific singer's singing voice are associated using an alignment means. Techniques to do this are disclosed. However, this publication does not disclose any technique for temporally associating music acoustic signals with lyrics.

また特開２００１−１２５５６２号公報（特許文献２）には、歌声と伴奏音とを含む混合音の音楽音響信号から、各時刻において歌声を含む最も優勢な音高の音高推定を行って優勢音音響信号を抽出する技術が開示されている。この技術を用いると、音楽音響信号から伴奏音を抑制した優勢音音響信号を抽出することができる。 Japanese Patent Laid-Open No. 2001-125562 (Patent Document 2) predominates by estimating the pitch of the most dominant pitch including the singing voice at each time from the music sound signal of the mixed sound including the singing voice and the accompaniment sound. A technique for extracting an acoustic signal is disclosed. If this technique is used, the dominant sound signal which suppressed the accompaniment sound from the music sound signal can be extracted.

そして藤原弘将、奥乃博、後藤真孝他が、「伴奏音抑制と高信頼度フレーム選択に基づく楽曲の歌手名同定手法」と題する論文［情報処理学会論文誌Vol.47 No.6（発表：2006.6）］（非特許文献２）にも、特許文献２に示された伴奏音を抑制する技術が開示されている。またこの論文には、歌声と非歌声を学習させた２つの混合ガウス分布（GMM）を用いて、優勢音音響信号から歌声区間と非歌声区間を検出する技術が開示されている。さらにこの論文には、歌声に関する特徴量としてLPCメルケプストラムを用いることが開示されている。
Ye Wang, et al.; LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics, Proceeding of the 12th ACM International Conference on Multimedia, October 10-15, 2004. 藤原弘将、奥乃博、後藤真孝他著の「伴奏音抑制と高信頼度フレーム選択に基づく楽曲の歌手名同定手法」と題する論文［情報処理学会論文誌Vol.47 No.6（発表：2006.6）］特開２００１−１１７５８２号公報特開２００１−１２５５６２号公報 And a paper titled “Singer Name Identification Method Based on Accompaniment Suppression and Highly Reliable Frame Selection” by Hiromasa Fujiwara, Hiroshi Okuno, Masataka Goto and others [Information Processing Society Journal Vol.47 No.6 (announcement: 2006.6)] (Non-Patent Document 2) also discloses a technique for suppressing the accompaniment sound shown in Patent Document 2. This paper also discloses a technique for detecting a singing voice section and a non-singing voice section from a dominant sound signal by using two mixed Gaussian distributions (GMM) obtained by learning a singing voice and a non-singing voice. Furthermore, this paper discloses the use of LPC mel cepstrum as a feature quantity related to singing voice.
Ye Wang, et al .; LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics, Proceeding of the 12th ACM International Conference on Multimedia, October 10-15, 2004. A paper titled “Singer Name Identification Method Based on Accompaniment Suppression and Highly Reliable Frame Selection” by Hiromasa Fujiwara, Hiroshi Okuno, Masataka Goto et al. [Information Processing Society Journal Vol.47 No.6 (announcement: 2006.6) ]] JP 2001-117582 A JP 2001-125562 A

人の音声（例えば歌声）と人の音声以外の音（例えば伴奏音）とで構成される音楽音響信号及び歌詞情報から、伴奏音と忠実に同期して歌詞を表示させるためには、時間情報を含む歌詞、換言すると、演奏の開始時刻から何秒後にその歌詞が発話されるのかという時間情報（本明細書ではこれを「時間タグ情報」という。）を伴う歌詞を得ることが必要となる。 In order to display lyrics in synchronism with accompaniment sound faithfully from music acoustic signals and lyrics information composed of human voice (for example, singing voice) and sound other than human voice (for example, accompaniment sound), time information is displayed. In other words, it is necessary to obtain lyrics with time information (in this specification, this is referred to as “time tag information”) indicating how many seconds after the start time of the performance, the lyrics will be spoken. .

歌詞自体はテキストデータ（テキスト形式のディジタル情報）として容易に入手することはできる。この「歌詞のテキストデータ」と、「その歌詞を発声する歌声を伴う音楽音響信号（ディジタル音楽データ）」とを用いて、「時間タグ付きの歌詞」を生成することを、実用可能な程度の精度（正解率）で完全自動化させる技術が切望されている。 The lyrics themselves can be easily obtained as text data (text format digital information). Using this “lyric text data” and “musical sound signal (digital music data) with a singing voice that utters the lyrics”, it is practical to generate “lyrics with time tags”. There is an urgent need for technology that can be fully automated with accuracy (accuracy rate).

伴奏音を含む音楽音響信号と歌詞とを時間的に対応させる上で音声認識技術は有用な技術である。しかしながら歌声が全く存在しない区間（本明細書ではこれを「無発声区間」または「非歌声区間」という。）の影響が、時間的対応付けの精度（正解率）を極端に低下させることが本件発明者らの研究により明らかとなった。 A speech recognition technique is a useful technique for temporally corresponding music acoustic signals including accompaniment sounds and lyrics. However, the effect of the section where there is no singing voice (in this specification, it is called “non-voiced section” or “non-singing section”) may significantly reduce the accuracy (accuracy rate) of temporal correspondence. It became clear by research of inventors.

本発明の目的は、非歌声区間の影響により、時間的対応付けの精度が低下するのを抑制することができる音楽音響信号と歌詞の時間的対応付けを自動で行うシステム及び方法、並びにシステムに用いるプログラムを提供することにある。 An object of the present invention is to provide a system, method, and system for automatically performing temporal association between music acoustic signals and lyrics that can suppress a decrease in accuracy of temporal association due to the influence of a non-singing voice section. It is to provide a program to be used.

本発明の音楽音響信号と歌詞の時間的対応付けを自動で行うシステムは、優勢音音響信号抽出手段と、歌声区間推定用特徴量抽出手段と、歌声区間推定手段と、時間的対応付け用特徴量抽出手段と、音素ネットワーク記憶手段と、アラインメント手段とを有する。 The system for automatically associating a music acoustic signal with lyrics in accordance with the present invention includes a dominant sound signal extracting unit, a singing segment estimation feature amount extraction unit, a singing segment estimation unit, and a temporal correlation feature. A volume extraction unit, a phoneme network storage unit, and an alignment unit;

優勢音音響信号抽出手段は、歌声と伴奏音とを含む楽曲の音楽音響信号から、各時刻（例えば１０ｍsec毎)において歌声を含む最も優勢な音の優勢音音響信号を抽出する。なおこの優勢音音響信号の抽出技術は、前述の特許文献２及び非特許文献２において使用されている抽出技術と同じである。 The dominant sound signal extracting means extracts the dominant sound signal of the most dominant sound including the singing voice at each time (for example, every 10 msec) from the music sound signal of the music including the singing voice and the accompaniment sound. The dominant sound acoustic signal extraction technique is the same as the extraction technique used in Patent Document 2 and Non-Patent Document 2 described above.

歌声区間推定用特徴量抽出手段は、各時刻（例えば１０ｍsec毎)における優勢音音響信号から歌声が含まれている歌声区間と歌声が含まれていない非歌声区間とを推定するために利用可能な歌声区間推定用特徴量を抽出する。ここで利用可能な歌声区間推定用特徴量は、具体的な実施の形態では、１３次元の特徴量である。より具体的には、歌声状態と非歌声状態の識別のためのスペクトル特徴量として、ＬＰＣメルケプストラム及び基本周波数のＦ０の微分係数ΔＦ０を用いることができる。 The feature extraction means for singing voice section estimation can be used to estimate a singing voice section including a singing voice and a non-singing voice section not including a singing voice from the dominant sound signal at each time (for example, every 10 msec). Extract features for singing voice segment estimation. The singing voice section estimation feature quantity that can be used here is a 13-dimensional feature quantity in a specific embodiment. More specifically, the LPC mel cepstrum and the differential coefficient ΔF0 of the fundamental frequency F0 can be used as the spectral feature quantity for identifying the singing voice state and the non-singing voice state.

歌声区間推定手段は、複数の歌声区間推定用特徴量に基づいて、歌声区間と非歌声区間を推定して、歌声区間と非歌声区間に関する情報を出力する。 The singing voice section estimation means estimates a singing voice section and a non-singing voice section based on a plurality of singing voice section estimation feature quantities, and outputs information related to the singing voice section and the non-singing voice section.

また時間的対応付け用特徴量抽出手段は、各時刻における優勢音音響信号から、歌声の歌詞と前記優勢音音響信号との間の時間的対応を付けるのに適した時間的対応付け用特徴量を抽出する。具体的な実施の形態では、時間的対応付け用特徴量として、音素の共鳴特性等の２５次元の特徴量を抽出する。 In addition, the temporal association feature amount extraction means is adapted to attach temporal correspondence between the lyrics of the singing voice and the dominant acoustic signal from the dominant acoustic signal at each time. To extract. In a specific embodiment, 25-dimensional feature quantities such as phoneme resonance characteristics are extracted as temporal association feature quantities.

なお歌声区間推定用特徴量抽出手段及び時間的対応付け用特徴量抽出手段により抽出した結果は、それぞれの手段に記憶部を設けておき、少なくとも１曲分を記憶部に記憶しておき、後の処理の際に利用するようにしてもよい。 The results extracted by the singing voice segment estimation feature quantity extraction means and the temporal association feature quantity extraction means are provided with a storage unit in each means, and at least one song is stored in the storage unit. You may make it utilize in the case of this process.

音素ネットワーク記憶手段は、音楽音響信号に対応する楽曲の歌詞に関して複数の音素とショートポーズとによって構成された音素ネットワークを記憶する。このような音素ネットワークは、例えば、歌詞を音素列に変換し、その後、フレーズの境界を複数個のショートポーズに変換し、単語の境界を１個のショートポーズに変換することにより、日本語の歌詞であれば母音とショートポーズのみからなる音素列を用いて構成するのが好ましい。また英語の歌詞であれば、英語の音素とショートポーズのみからなる音素列を用いて音素ネットワークを構成するのが好ましい。 The phoneme network storage means stores a phoneme network composed of a plurality of phonemes and short pauses with respect to the lyrics of the music corresponding to the music acoustic signal. Such a phoneme network, for example, converts lyrics into phoneme strings, then converts phrase boundaries into multiple short poses, and converts word boundaries into a single short pose. In the case of lyrics, it is preferable to use a phoneme string consisting only of vowels and short pauses. For English lyrics, it is preferable to construct a phoneme network using phoneme strings consisting only of English phonemes and short pauses.

アラインメント手段は、時間的対応付け用特徴量に基づいて該時間的対応付け用特徴量に対応する音素を推定する歌声用音響モデルを備えている。そしてアラインメント手段は、音素ネットワーク中の複数の音素と優先音音響信号とを時間的に対応付けるアラインメント動作を実行する。具体的には、アラインメント手段は、時間的対応付け用特徴量抽出手段から出力される時間的対応付け用特徴量と、歌声区間と非歌声区間に関する情報と、音素ネットワークとを入力として、歌声用音響モデルを用いて、少なくとも非歌声区間には音素が存在しないという条件の下で、アラインメントを実行して、音楽音響信号と歌詞の時間的対応付けを自動で行う。 The alignment means includes a singing voice acoustic model that estimates phonemes corresponding to the temporal association feature quantity based on the temporal association feature quantity. The alignment means executes an alignment operation for temporally associating the plurality of phonemes in the phoneme network with the priority sound signal. Specifically, the alignment means receives the temporal association feature amount output from the temporal association feature amount extraction means, the information about the singing voice section and the non-singing voice section, and the phoneme network as inputs. Using the acoustic model, alignment is executed under the condition that there is no phoneme at least in the non-singing voice section, and the time correspondence between the music acoustic signal and the lyrics is automatically performed.

本発明によれば、歌声区間及び非歌声区間の推定に用いるのに適した特徴量（歌声区間推定用特徴量）と、歌詞との時間的対応付けに用いるのに適した特徴量（時間的対応付け用特徴量）とを、優勢音音響信号からそれぞれ別個に抽出しているので、歌声区間及び非歌声区間の推定精度及び時間的対応付け精度を高くすることができる。特に、本発明によれば、アラインメント手段では、話し声用音響モデルを使用せずに、時間的対応付け用特徴量に対応する音素を推定する歌声用音響モデルを使用しているので、話し声とは異なる歌声の特徴を考慮したより精度の高い音素の推定を行うことができる。さらにアラインメント手段は、少なくとも非歌声区間には音素が存在しないという条件の下で、アラインメント動作を実行するので、非歌声区間の影響を極力排除した状態で、音素ネットワーク中の複数の音素と各時刻における優先音音響信号とを時間的に対応付けることができる。したがって本発明によれば、アラインメント手段の出力を用いて、音楽音響信号に同期した時間タグ付きの歌詞データを自動で得ることができる。 According to the present invention, a feature quantity suitable for use in estimating a singing voice section and a non-singing voice section (feature quantity for singing voice section estimation) and a feature quantity suitable for temporal association with lyrics (temporal) Since the feature amount for association) is extracted separately from the dominant sound signal, the estimation accuracy and temporal association accuracy of the singing voice section and the non-singing voice section can be increased. In particular, according to the present invention, the alignment means uses the singing voice acoustic model for estimating the phoneme corresponding to the temporal correspondence feature amount without using the speaking voice acoustic model. It is possible to estimate phonemes with higher accuracy considering characteristics of different singing voices. Furthermore, since the alignment means executes the alignment operation under the condition that there is no phoneme in at least the non-singing voice segment, a plurality of phonemes in the phoneme network and each time in a state where the influence of the non-singing voice segment is eliminated as much as possible. Can be temporally associated with the priority sound signal. Therefore, according to the present invention, the lyric data with the time tag synchronized with the music sound signal can be automatically obtained using the output of the alignment means.

歌声区間推定手段の構成は、推定精度が高いものであれば、どのような構成のものでも任意である。例えば、歌声区間推定手段に、予め複数の学習用楽曲に基づいて学習により得られた歌声と非歌声の複数の混合ガウス分布を記憶するガウス分布記憶手段を設ける。そして、音楽音響信号から得た複数の歌声区間推定用特徴量と複数の混合ガウス分布とに基づいて、歌声区間と非歌声区間を推定するように、歌声区間推定手段を構成することができる。このように事前の学習により得られた混合ガウス分布に基づいて、歌声区間と非歌声区間とを推定すると、高い精度で歌声区間と非歌声区間とを推定することができ、アラインメント手段におけるアラインメント精度を高くすることができる。 The configuration of the singing voice section estimation means is arbitrary as long as the estimation accuracy is high. For example, the singing voice section estimation means is provided with Gaussian distribution storage means for storing a plurality of mixed Gaussian distributions of singing voice and non-singing voice obtained by learning based on a plurality of learning songs in advance. And a singing voice section estimation means can be comprised so that a singing voice section and a non-singing voice section may be estimated based on the several singing voice area estimation feature-value obtained from the music acoustic signal, and several mixed Gaussian distribution. Based on the mixed Gaussian distribution obtained by prior learning in this way, estimating the singing voice section and the non-singing voice section can estimate the singing voice section and the non-singing voice section with high accuracy, and the alignment accuracy in the alignment means Can be high.

このような歌声区間推定手段は、対数尤度計算手段と、対数尤度差計算手段と、ヒストグラム作成手段と、バイアス調整値決定手段と、推定用パラメータ決定手段と、重み付け手段と、最尤経路計算手段とから構成することができる。対数尤度計算手段は、音楽音響信号の最初から最後までの期間中の各時刻における歌声区間推定用特徴量と事前に記憶した混合ガウス分布とに基づいて、各時刻における歌声対数尤度と非歌声対数尤度とを計算する。そして対数尤度差計算手段は、各時刻における歌声対数尤度と非歌声対数尤度との対数尤度差を計算する。ヒストグラム作成手段は、推定に先立つ前処理において、優先音音響信号の全期間から得られる複数の対数尤度差に関するヒストグラムを作成する。そしてバイアス調整値決定手段は、作成したヒストグラムを、楽曲に依存した、歌声区間における対数尤度差のクラスと非歌声区間における対数尤度差のクラスに２分割する場合に、クラス間分散を最大とするような閾値を決定し、この閾値を楽曲依存のバイアス調整値と定める。また推定用パラメータ決定手段は、バイアス調整値を補正するため（アラインメントの精度を高めるため又は歌声区間を広げる調整のため）に、バイアス調整値にタスク依存値を加算して歌声区間を推定する際に用いる推定用パラメータを決定する。そして重み付け手段は、各時刻における歌声対数尤度及び非歌声対数尤度を推定用パラメータを用いて重み付けを行う。なおこの際に使用する歌声対数尤度及び非歌声対数尤度は、前処理の際に求めたものを使用してもよいが、あらたに計算をしてもよいのは勿論である。なお前処理の計算結果を利用する場合には、対数尤度計算手段に記憶機能を持たせておけばよい。最尤経路計算手段は、音楽音響信号の全期間から得られる、重み付けされた複数の歌声対数尤度及び重み付けされた複数の非歌声対数尤度を、それぞれ隠れマルコフモデルの歌声状態（Ｓ_Ｖ）の出力確率及び非歌声状態（Ｓ_Ｎ）の出力確率とみなす。そして最尤経路計算手段は、音楽音響信号の全期間における歌声状態と非歌声状態の最尤経路を計算し、最尤経路から音楽音響信号の全期間における歌声区間と非歌声区間に関する情報を決定する。なお対数尤度差決定手段、ヒストグラム作成手段、バイアス調整値決定手段及び推定用パラメータ決定手段は、本発明のシステムで歌声区間を推定する前の前処理において、音楽音響信号に対して使用される。前処理により得た推定用パラメータを用いた重み付け手段による重み付けを、各時刻における歌声対数尤度及び非歌声対数尤度に対して行うと、後の最尤経路計算手段における歌声区間と非歌声区間の境界部の調整を、適切に調整することができる。なお推定動作時においては、歌声区間推定用特徴量抽出手段から各時刻において出力される歌声区間推定用特徴量から、対数尤度計算手段が計算した歌声対数尤度及び非歌声対数尤度に、直接重み付けを行って、最尤経路を計算することになる。このような前処理によって対数尤度差のヒストグラムを利用して、歌声対数尤度及び非歌声対数尤度のバイアス調整値（閾値）を決定すると、音楽音響信号に合ったバイアス調整値を決定することができる。このバイアス調整値（閾値）は、歌声状態と非歌声状態との境界部を決定する。そしてバイアス調整値により定めた推定用パラメータを用いて重み付けを行うと、楽曲ごとの音楽音響信号の音響的特性の違いによって現れる歌声区間推定用特徴量の傾向に合わせて、歌声状態と非歌声状態との境界部を中心にして歌声対数尤度及び非歌声対数尤度を調整することができ、歌声区間及び非歌声区間の境界を、個々の楽曲に合わせて適切に設定することができる。 Such singing voice section estimation means includes log likelihood calculation means, log likelihood difference calculation means, histogram creation means, bias adjustment value determination means, estimation parameter determination means, weighting means, and maximum likelihood path. And a calculation means. The log likelihood calculating means calculates the singing voice log likelihood at each time and the non-log based on the feature value for estimating the singing voice interval at each time during the period from the beginning to the end of the music acoustic signal and the pre-stored mixed Gaussian distribution. The singing voice log likelihood is calculated. Then, the log likelihood difference calculating means calculates a log likelihood difference between the singing voice log likelihood and the non-singing voice log likelihood at each time. The histogram creation means creates a histogram related to a plurality of log likelihood differences obtained from the entire period of the priority sound signal in the pre-processing prior to estimation. The bias adjustment value determination means maximizes the interclass variance when the generated histogram is divided into a log likelihood difference class in the singing voice section and a log likelihood difference class in the non-singing voice section depending on the music. Is determined, and this threshold is determined as a music-dependent bias adjustment value. Further, the estimation parameter determination means adds a task-dependent value to the bias adjustment value to estimate the singing voice section in order to correct the bias adjustment value (in order to increase alignment accuracy or to adjust the singing voice section). The parameter for estimation used for is determined. Then, the weighting unit weights the singing voice log likelihood and the non-singing voice log likelihood at each time using the estimation parameters. Note that the singing voice log likelihood and the non-singing voice log likelihood used at this time may be those obtained at the time of the preprocessing, but it is of course possible to newly calculate them. When using the calculation result of the preprocessing, the log likelihood calculating means may have a storage function. The maximum likelihood path calculating means obtains a plurality of weighted singing voice log likelihoods and a plurality of weighted non-singing voice log likelihoods obtained from the entire period of the music acoustic signal, respectively, in the singing voice state (S _V ) of the hidden Markov model. Output probability and non-singing voice state (S _N ) output probability. The maximum likelihood path calculating means calculates the maximum likelihood path of the singing voice state and the non-singing voice state in the whole period of the music acoustic signal, and determines information on the singing voice section and the non-singing voice section in the whole period of the music acoustic signal from the maximum likelihood path. To do. The log likelihood difference determining means, the histogram creating means, the bias adjustment value determining means and the estimation parameter determining means are used for the music acoustic signal in the pre-processing before estimating the singing voice section in the system of the present invention. . When weighting by the weighting means using the estimation parameter obtained by the preprocessing is performed on the singing voice log likelihood and the non-singing log likelihood at each time, the singing voice section and the non-singing voice section in the maximum likelihood path calculation means later It is possible to appropriately adjust the adjustment of the boundary portion. At the time of the estimation operation, the singing voice log likelihood and the non-singing voice log likelihood calculated by the log likelihood calculation means from the singing voice section estimation feature quantity output at each time from the singing voice section estimation feature quantity extraction means, Direct weighting is performed to calculate the maximum likelihood path. When bias adjustment values (thresholds) of the singing voice log likelihood and the non-singing voice log likelihood are determined by using the histogram of the log likelihood difference by such preprocessing, the bias adjustment value suitable for the music acoustic signal is determined. be able to. This bias adjustment value (threshold value) determines the boundary between the singing voice state and the non-singing voice state. When weighting is performed using the estimation parameter determined by the bias adjustment value, the singing voice state and the non-singing voice state are matched to the tendency of the singing voice section estimation feature amount that appears due to the difference in the acoustic characteristics of the music acoustic signal for each song. The singing voice log likelihood and the non-singing voice log likelihood can be adjusted centering on the boundary portion between the singing voice section and the non-singing voice log likelihood, and the boundary between the singing voice section and the non-singing voice section can be appropriately set according to individual music.

なお最尤経路計算手段においては、以下のようにして、最尤経路を計算することができる。すなわち歌声状態（ｓ_Ｖ）の出力確率ｌｏｇｐ（ｘ｜ｓ_Ｖ）及び非歌声状態（ｓ_Ｎ）の出力確率ｌｏｇｐ（ｘ｜ｓ_Ｎ）を下記の式で近似する。
The maximum likelihood route calculation means can calculate the maximum likelihood route as follows. That output probability logp singing condition _{(s V)} approximating the | | _{(s N} x) by the following equation (x _{s V)} and non-output probability singing condition _{(s N)} logp.

上記式において、Ｎ_GMM（ｘ；θ_Ｖ）は歌声の混合ガウス分布（ＧＭＭ）の確率密度関数を表し、Ｎ_GMM（ｘ；θ_Ｎ）は非歌声の混合ガウス分布（ＧＭＭ）の確率密度関数を表す。またθ_Ｖ及びθ_Ｎは複数の学習用楽曲に基づいて予め学習により定められたパラメータであり、ηは推定用パラメータである。最尤経路を下記の式を用いて計算すればよい。
In the above equation, N _GMM (x; θ _V ) represents a probability density function of a mixed gaussian distribution (GMM) of singing voice, and N _GMM (x; θ _N ) represents a probability density function of a mixed Gaussian distribution (GMM) of non-singing voice. Represents. Θ _V and θ _N are parameters determined by learning in advance based on a plurality of learning songs, and η is an estimation parameter. The maximum likelihood path may be calculated using the following formula.

上記式において、ｐ（ｘ｜ｓ_ｔ）は状態Ｓ_ｔの出力確率を表す。そしてｐ（ｓ_ｔ＋１｜ｓ_ｔ）は、状態ｓ_ｔから状態ｓ_ｔ＋１への遷移確率を表している。 In the above formula, p (x | _{s t)} represents the output probability of the state _{S t.} P (s _{t + 1} | s _t ) represents the transition probability from the state s _t to the state s _{t + 1} .

上記式を用いて最尤経路を計算すれば、音楽音響信号の全期間における歌声区間と非歌声区間に関するより正確な情報を得ることができる。 If the maximum likelihood path is calculated using the above formula, more accurate information about the singing voice section and the non-singing voice section in the entire period of the music acoustic signal can be obtained.

アラインメント手段は、ビタビアラインメントを用いてアラインメント動作を実行するように構成されたものを用いることができる。ここで「ビタビアラインメント」とは、音声認識の技術分野において知られるもので、音響信号と文法（アラインメント用の音素列）の間の最尤経路を探索するビタビアルゴリズムを用いた最適解探索手法の一つである。ビタビアラインメントの実行においては、「非歌声区間には音素が存在しないという条件」として、少なくとも非歌声区間をショートポーズとする条件を定める。そしてショートポーズにおいては、他の音素の尤度をゼロとして、アラインメント動作を実行する。このようにするとショートポーズの区間においては、他の音素の尤度がゼロになるため、歌声区間情報を利用することができ、精度の高いアラインメントを行うことができる。 As the alignment means, one configured to perform an alignment operation using a Viterbi alignment can be used. Here, “Viterbiar alignment” is known in the technical field of speech recognition, and is an optimal solution search method using a Viterbi algorithm that searches for a maximum likelihood path between an acoustic signal and a grammar (phoneme sequence for alignment). One. In the execution of the viterbi alignment, a condition that at least the non-singing voice section is in a short pause is defined as “a condition that no phoneme is present in the non-singing voice section”. In the short pause, the alignment operation is executed with the likelihood of other phonemes set to zero. In this way, since the likelihood of other phonemes becomes zero in the short pause section, the singing voice section information can be used, and high-precision alignment can be performed.

また使用する歌声用音響モデルとして、話し声用の音響モデルのパラメータを、歌声と伴奏音を含む楽曲中の歌声の音素を認識できるように再推定して（学習して）得た音響モデルを用いることができる。歌声用音響モデルとしては、歌声の発話内容（歌詞）に対してアラインメントを行うため、大量の歌声のデータから学習されたモデルを使用することが理想的である。しかしながら、現段階ではそのようなデータベースは構築されていない。そこで話し声用の音響モデルのパラメータを、歌声と伴奏音を含む楽曲中の歌声の音素を認識できるように再推定して（学習して）得た音響モデルを用いれば、話し声用の音響モデルを使用する場合よりも、高い精度で歌声の音素を認識することが可能になる。 Also, as the acoustic model for singing voice to be used, the acoustic model obtained by re-estimating (learning) the parameters of the acoustic model for speaking voice so as to recognize the phoneme of the singing voice in the music including the singing voice and the accompaniment sound is used. be able to. As the acoustic model for singing voice, it is ideal to use a model learned from a large amount of singing voice data in order to align the utterance contents (lyrics) of the singing voice. However, no such database has been constructed at this stage. Therefore, if the acoustic model obtained by re-estimating (learning) the parameters of the acoustic model for speaking voice so that it can recognize the phoneme of the singing voice in the music including the singing voice and the accompaniment sound, the acoustic model for speaking voice can be obtained. It becomes possible to recognize the phoneme of the singing voice with higher accuracy than in the case of using it.

なおこのような歌声用音響モデルとしては、歌声だけを含む単独歌唱の適応用音楽音響信号と、該適応用音楽音響信号に対する適応用音素ラベルとを用いて、話し声用音響モデルのパラメータを、適応用音楽音響信号から歌声の音素を認識できるように再推定して得た単独歌唱用の音響モデルを用いることができる。この音響モデルでは、伴奏音が無いかまたは伴奏音が歌声に比べて小さい場合に適している。 Note that, as such an acoustic model for singing voice, the parameters of the acoustic model for speaking voice are adapted using an adaptive music acoustic signal for single singing including only the singing voice and an adaptive phoneme label for the adaptive musical acoustic signal. It is possible to use an acoustic model for single singing obtained by re-estimation so that the phoneme of the singing voice can be recognized from the music audio signal for use. This acoustic model is suitable when there is no accompaniment sound or the accompaniment sound is smaller than the singing voice.

また歌声用音響モデルとしては、歌声に加えて伴奏音を含む適応用音楽音響信号から抽出した歌声を含む最も優勢な音の優勢音音響信号と、該優勢音音響信号に対する適応用音素ラベルとを用いて、前述の単独歌唱用の音響モデルのパラメータを、優勢音音響信号からの音素を認識できるように再推定して得た分離歌声用の音響モデルを用いることができる。このような分離歌声用の音響モデルは、歌声と同様に伴奏音が大きい場合に適している。 The singing voice acoustic model includes a dominant sound acoustic signal of the most dominant sound including the singing voice extracted from the adaptive music acoustic signal including the accompaniment sound in addition to the singing voice, and an adaptive phoneme label for the dominant sound acoustic signal. It is possible to use an acoustic model for separated singing voice obtained by re-estimating the parameters of the acoustic model for single singing described above so that the phoneme from the dominant sound acoustic signal can be recognized. Such an acoustic model for a separated singing voice is suitable when the accompaniment sound is large like the singing voice.

さらに歌声用音響モデルとしては、時間的対応付け用特徴量記憶手段に記憶されている複数の時間的対応付け用特徴量と音素ネットワークに記憶されている音素ネットワークとを用いて、前述の分離歌声用の音響モデルのパラメータを音響信号抽出手段に入力された音楽音響信号の楽曲を歌う特定の歌手の音素を認識できるように推定して得た特定歌手用の音響モデルを用いることができる。この特定歌手用の音響モデルは、歌手を特定した音響モデルであるため、アラインメントの精度を最も高くすることができる。 Furthermore, as the acoustic model for singing voice, the above-mentioned separated singing voice is used by using a plurality of temporal correlation feature quantities stored in the temporal correlation feature quantity storage means and the phoneme network stored in the phoneme network. The acoustic model for the specific singer obtained by estimating the parameters of the acoustic model for use so that the phoneme of the specific singer who sings the music of the music acoustic signal input to the acoustic signal extraction means can be recognized can be used. Since the acoustic model for the specific singer is an acoustic model specifying the singer, the alignment accuracy can be maximized.

なお音楽音響信号に時間的に対応付けられた歌詞を、表示画面上に表示させながら音楽音響信号を再生する音楽音響信号再生装置において、本発明のシステムを用いて音楽音響信号に時間的に対応付けられた歌詞を表示画面に表示させると、再生される音楽と画面に表示される歌詞とを同期させて表示画面に表示することができる。 In a music sound signal reproducing apparatus that reproduces a music sound signal while displaying lyrics that are temporally associated with the music sound signal on the display screen, the music sound signal is temporally supported using the system of the present invention. When the attached lyrics are displayed on the display screen, the reproduced music and the lyrics displayed on the screen can be synchronized and displayed on the display screen.

本発明の音楽音響信号と歌詞の時間的対応付けを自動で行う方法では、次のようにして、時間的対応付けを行う。まず歌声と伴奏音とを含む楽曲の音楽音響信号から、各時刻において歌声を含む最も優勢な音の優勢音音響信号を優勢音響信号抽出手段が抽出する（優勢音響信号抽出ステップ）。次に各時刻における優勢音音響信号から歌声が含まれている歌声区間と歌声が含まれていない非歌声区間とを推定するために利用可能な歌声区間推定用特徴量を歌声区間推定用特徴量抽出手段が抽出する（歌声区間推定用特徴量抽出ステップ）。そして複数の歌声区間推定用特徴量に基づいて、歌声区間と非歌声区間を歌声区間推定手段が推定して、歌声区間と前記非歌声区間に関する情報を出力する（歌声区間推定ステップ）。また各時刻における優勢音音響信号から、歌声の歌詞と音楽音響信号との間の時間的対応を付けるのに適した時間的対応付け用特徴量を時間的対応付け用特徴量抽出手段が抽出する（時間的対応付け用特徴量抽出ステップ）。さらに音楽音響信号に対応する楽曲の歌詞の複数の音素が、該複数の音素の隣りあう二つの音素の時間的間隔が調整可能に繋がって構成された音素ネットワークを音素ネットワーク記憶手段に記憶する（記憶ステップ）。そして時間的対応付け用特徴量に基づいて該時間的対応付け用特徴量に対応する音素を推定する歌声用音響モデルを備え、音素ネットワーク中の複数の音素と優先音音響信号とを時間的に対応付けるアラインメント動作をアラインメント手段が実行する（アラインメントステップ）。このアラインメントステップでは、アラインメント手段が、時間的対応付け用特徴量抽出ステップで得られる時間的対応付け用特徴量と、歌声区間と非歌声区間に関する情報と、音素ネットワークとを入力として、歌声用音響モデルを用いて、少なくとも非歌声区間には音素が存在しないという条件の下で、アラインメント動作を実行する。 In the method of automatically performing temporal association between the music acoustic signal and the lyrics according to the present invention, temporal association is performed as follows. First, the dominant sound signal extraction means extracts the dominant sound signal of the most dominant sound including the singing voice at each time from the music sound signal of the music including the singing voice and the accompaniment sound (dominant sound signal extracting step). Next, a singing voice segment estimation feature quantity that can be used to estimate a singing voice section that includes a singing voice and a non-singing voice section that does not contain a singing voice from the dominant sound signal at each time. Extraction means extracts (singing voice segment estimation feature extraction step). Then, the singing voice section estimation means estimates the singing voice section and the non-singing voice section based on the plurality of singing voice section estimation feature quantities, and outputs information on the singing voice section and the non-singing voice section (singing voice section estimation step). Also, the temporal correlation feature quantity extraction means extracts a temporal correlation feature quantity suitable for providing a temporal correlation between the singing voice lyrics and the music acoustic signal from the dominant sound acoustic signal at each time. (Temporal association feature extraction step). Furthermore, the phoneme network storage unit stores a phoneme network in which a plurality of phonemes of lyrics corresponding to a music acoustic signal are connected so that a time interval between two phonemes adjacent to the plurality of phonemes can be adjusted ( Memory step). And a singing voice acoustic model for estimating a phoneme corresponding to the temporal association feature amount based on the temporal association feature amount, and temporally combining the plurality of phonemes and the priority sound acoustic signal in the phoneme network. The alignment means executes the alignment operation to be associated (alignment step). In this alignment step, the alignment means inputs the temporal association feature obtained in the temporal association feature extraction step, the information about the singing voice segment and the non-singing voice segment, and the phoneme network, and inputs the singing voice sound. Using the model, the alignment operation is executed under the condition that there is no phoneme at least in the non-singing voice section.

また本発明は、歌声と伴奏音とを含む楽曲の音楽音響信号と歌詞の時間的対応付けを行うためにコンピュータを利用する場合において、コンピュータを前述の優勢音響信号抽出手段と、歌声区間推定用特徴量抽出手段と、歌声区間推定手段と、時間的対応付け用特徴量抽出手段と、音素ネットワーク記憶手段と、アラインメント手段として機能させるプログラムとして特定することができる。なおこのプログラムは、コンピュータ読み取り可能な記録媒体に記録されていてもよいのは勿論である。 Further, the present invention provides a computer that uses the above-described dominant acoustic signal extraction means and singing voice interval estimation in the case of using a computer for temporally associating music acoustic signals and lyrics of music including singing voices and accompaniment sounds. It can be specified as a program that functions as feature quantity extraction means, singing voice section estimation means, temporal association feature quantity extraction means, phoneme network storage means, and alignment means. Needless to say, this program may be recorded on a computer-readable recording medium.

なお表示画面上に歌詞を表示させながら音楽ディジタルデータを再生するための音楽音響信号再生装置において、本発明に係る音楽音響信号と歌詞の時間的対応付けプログラムを実行させることができる。この場合には、予め時間情報を伴う歌詞を生成した後で表示画面上に歌詞を表示させる。そして表示画面上に歌詞を表示させた状態で、表示された歌詞の表示部分をポインタにより選択する。このようにすると、選択された歌詞の一部に相当する時間情報を元に、その部分から音響音楽信号の再生を行うように構成してもよい。また事前に本発明のシステムで予め生成した時間情報を伴う歌詞を音楽音響信号再生装置に設けたハードディスク等の記憶手段に記憶させておいてもよく、またネットワーク上のサーバーに記憶させておいてもよい。そして音楽音響信号再生装置による音楽ディジタルデータの再生と同期させて、記憶手段に記憶したまたはネットワーク上のサーバーから取得した時間情報を伴う歌詞を表示画面上に表示するようにしてもよい。 In the music acoustic signal reproducing apparatus for reproducing the music digital data while displaying the lyrics on the display screen, the music acoustic signal and lyrics temporal association program according to the present invention can be executed. In this case, after generating lyrics with time information in advance, the lyrics are displayed on the display screen. Then, with the lyrics displayed on the display screen, the display portion of the displayed lyrics is selected by the pointer. In this case, based on time information corresponding to a part of the selected lyrics, an acoustic music signal may be reproduced from that part. In addition, lyrics with time information generated in advance by the system of the present invention may be stored in a storage means such as a hard disk provided in the music sound signal reproducing apparatus, or stored in a server on the network. Also good. Then, in synchronism with the reproduction of the music digital data by the music acoustic signal reproduction device, the lyrics accompanied by the time information stored in the storage means or acquired from the server on the network may be displayed on the display screen.

以下図面を参照して、本発明の音楽音響信号と歌詞の時間的対応付けを自動で行うシステム及びその方法の実施の形態の一例について詳細に説明する。図１は、音楽音響信号と歌詞の時間的対応付けを自動で行うシステム１の実施の形態をコンピュータを用いて実現する場合に、コンピュータ内に実現される機能実現手段の構成を示すブロックである。また図２は、図１の実施の形態をプログラムをコンピュータで実行することにより実施する場合のステップを示すフローチャートである。このシステム１は、音楽音響御信号記憶手段３と、優勢音音響信号抽出手段５と、歌声区間推定用特徴量抽出手段７と、歌声区間推定手段９と、時間的対応付け用特徴量抽出手段１１と、音素ネットワーク記憶手段１３と、歌声用音響御モデル１５を備えたアラインメント手段１７とを備えている。 Hereinafter, an example of an embodiment of a system and method for automatically associating a time-synchronization of a music sound signal and lyrics according to the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a configuration of function realizing means realized in a computer when an embodiment of a system 1 that automatically associates a music acoustic signal and lyrics with time is realized using a computer. . FIG. 2 is a flowchart showing steps when the embodiment of FIG. 1 is executed by executing a program on a computer. This system 1 includes a music sound control signal storage means 3, a dominant sound signal extraction means 5, a singing voice section estimation feature quantity extraction means 7, a singing voice section estimation means 9, and a temporal correlation feature quantity extraction means. 11, a phoneme network storage means 13, and an alignment means 17 including a singing voice acoustic model 15.

本発明は上記技術的課題を効果的に達成するための基本的なアプローチとして、大きく以下の３つのステップを実行する。 The present invention performs the following three steps as a basic approach for effectively achieving the above technical problem.

ステップ１：伴奏音抑制
ステップ２：歌声区間検出
ステップ３：アラインメント（時間的対応付け）
ステップ１を実行するために、音楽音響御信号記憶手段３には、対象とする歌声と伴奏音とを含む複数の楽曲の音楽音響信号が記憶されている。優勢音音響信号抽出手段５は、図３に示すフローチャートに従って、歌声と伴奏音とを含む楽曲の音楽音響信号Ｓ１から、各時刻（具体的には１０ｍsec毎)において歌声を含む最も優勢な音の優勢音音響信号Ｓ２を抽出する。本実施の形態においては、優勢音音響信号とは、伴奏音が抑制された信号と見ることができる。優勢音音響信号の抽出技術は、前述の特開２００１−１２５５６２号公報（特許文献２）及び非特許文献２に示された抽出技術と同じである。歌声と伴奏音とを含む楽曲の音楽音響信号Ｓ１の信号波形は、例えば図４（Ａ）に示すような信号波形であり、優勢音音響信号抽出手段５が出力する伴奏音が抑制された優勢音音響信号Ｓ２の信号波形は、図４（Ｄ）に示すよう信号波形である。以下優勢音音響信号の抽出方法について説明する。 Step 1: Accompaniment sound suppression Step 2: Singing voice segment detection Step 3: Alignment (temporal association)
In order to execute Step 1, the music acoustic signal storage means 3 stores music acoustic signals of a plurality of music pieces including the target singing voice and accompaniment sound. According to the flowchart shown in FIG. 3, the dominant sound signal extraction means 5 obtains the most dominant sound including the singing voice at each time (specifically, every 10 msec) from the music acoustic signal S1 of the music including the singing voice and the accompaniment sound. The dominant sound signal S2 is extracted. In the present embodiment, the dominant sound signal can be regarded as a signal in which the accompaniment sound is suppressed. The extraction technique of the dominant sound signal is the same as the extraction technique disclosed in Japanese Patent Laid-Open No. 2001-125562 (Patent Document 2) and Non-Patent Document 2. The signal waveform of the music acoustic signal S1 of the music including the singing voice and the accompaniment sound is, for example, a signal waveform as shown in FIG. 4A, and the accompaniment sound output by the dominant sound signal extraction means 5 is suppressed. The signal waveform of the acoustic signal S2 is a signal waveform as shown in FIG. Hereinafter, a method for extracting the dominant sound signal will be described.

まず歌声と伴奏音とを含む楽曲（混合音）の音楽音響信号から、後述する歌声区間推定用特徴量及び時間的対応付け用特徴量［メロディ（歌声）の音韻的特徴を表す特徴量等］を抽出するためには、音楽音響信号から伴奏音の影響を低減させた優勢音音響信号を得ることが必要である。そこで優勢音音響信号抽出手段５では、図３に示す以下の３つのステップを実行する。 First, from a music acoustic signal of a tune (mixed sound) including a singing voice and an accompaniment sound, a singing voice section estimation feature amount and a temporal association feature amount [a feature amount representing a phonological feature of a melody (singing voice), etc.] In order to extract, it is necessary to obtain a dominant sound signal in which the influence of the accompaniment sound is reduced from the music sound signal. Therefore, the dominant sound signal extraction means 5 executes the following three steps shown in FIG.

ＳＴ１：メロディ（歌声）の基本周波数Ｆ０を推定する。 ST1: Estimate the fundamental frequency F0 of the melody (singing voice).

ＳＴ２：推定された基本周波数に基づいて、メロディ（歌声）の調波構造を抽出する。 ST2: Extract the harmonic structure of the melody (singing voice) based on the estimated fundamental frequency.

ＳＴ３：抽出された調波構造を優勢音音響信号に再合成する。 ST3: Re-synthesize the extracted harmonic structure into the dominant sound signal.

なお、優勢音音響信号には、間奏などの区間では歌声以外の音響信号（伴奏音や無音）を含んでいる場合がある。したがって本実施の形態では、伴奏音の「除去」ではなく伴奏音の「低減」と表現する。以下ステップＳＴ１乃至ＳＴ３について説明する。 Note that the dominant sound acoustic signal may include an acoustic signal (accompaniment sound or silence) other than a singing voice in an interval such as an interlude. Therefore, in this embodiment, it is expressed as “reduction” of accompaniment sound, not “removal” of accompaniment sound. Hereinafter, steps ST1 to ST3 will be described.

（ＳＴ１：Ｆ０推定処理について）
メロディ（歌声）の基本周波数の推定方法には種々の方法が知られている。例えば、音源数を仮定しない音高推定手法（PreFEst）により、基本周波数を推定する方法を用いることができる（例えば、後藤真孝著 "音楽音響信号を対象としたメロディとベースの音高推定"、電子情報通信学会論文誌 D-II, Vol.J84-D-II, No.1, pp.12-22, January 2001.参照）。ここで、PreFEstはメロディとベースの基本周波数Ｆ０を推定する手法として知られている。制限された周波数帯域において、各時刻で最も優勢な調波構造（つまり、最も大きな音）を持つ優勢音の基本周波数Ｆ０を推定する手法である。この音高推定手法（PreFEst）では、調波構造の形状を表す確率分布をあらゆる音高（基本周波数）に対して用意する。そして、それらの混合分布（加重混合＝重み付き和）として入力の周波数成分をモデル化する。 (ST1: F0 estimation process)
Various methods are known for estimating the fundamental frequency of a melody (singing voice). For example, it is possible to use a method of estimating the fundamental frequency by a pitch estimation method (PreFEst) that does not assume the number of sound sources (for example, Masataka Goto, "Melody and bass pitch estimation for music acoustic signals", (See IEICE Transactions D-II, Vol.J84-D-II, No.1, pp.12-22, January 2001.) Here, PreFEst is known as a method for estimating the melody and bass fundamental frequency F0. This is a method of estimating the fundamental frequency F0 of the dominant sound having the most dominant harmonic structure (that is, the loudest sound) at each time in the limited frequency band. In this pitch estimation method (PreFEst), a probability distribution representing the shape of the harmonic structure is prepared for every pitch (fundamental frequency). Then, the input frequency component is modeled as a mixture distribution (weighted mixture = weighted sum).

メロディ（歌声）は中高域の周波数帯域において、各時刻で最も優勢な調波構造を持つ場合が多い。そこで周波数帯域を適切に制限することで、メロディ（歌声）の基本周波数Ｆ０を推定することができる。以下、PreFEstの概要について説明する。なお、以下の説明で用いられるｘはcentの単位で表される対数周波数軸上の周波数であり、（ｔ）は時間を表すものとする。また、centは、本来は音高差（音程）を表す尺度であるが、本明細書では、４４０×２｛^(3/12)-5｝［Ｈｚ］を基準として、次式のように絶対的な音高を表す単位として用いる。
The melody (singing voice) often has the most dominant harmonic structure at each time in the mid-high frequency band. Therefore, the fundamental frequency F0 of the melody (singing voice) can be estimated by appropriately limiting the frequency band. The outline of PreFEst will be described below. Note that x used in the following description is a frequency on the logarithmic frequency axis expressed in units of cent, and (t) represents time. In addition, cent is originally a scale representing a pitch difference (pitch), but in this specification, 440 × 2 { ^{(3/12) -5} } [Hz] is used as a reference, and the absolute value is as follows: It is used as a unit that represents a typical pitch.

パワースペクトルΨ_p ^(t)（ｘ）に対して、メロディの周波数成分の多くが通過するように設計された帯域通過フィルタ（Band Pass Filter）を用いる。例えば、4800cent以上の成分を通過させるフィルタを用いるのが好ましい。フィルタを通過後の周波数成分は、
ＢＰＦ（ｘ）・Ψ_p ^(t)（ｘ）
と表される。但し、ＢＰＦ（ｘ）はフィルタの周波数応答である。以後の確率的処理を可能にするため、フィルタを通過後の周波数成分を確率密度関数（ＰＤＦ）として、以下のように表現する。
For the power spectrum Ψ _p ^(t) (x), a band pass filter (Band Pass Filter) designed to pass most of the frequency components of the melody is used. For example, it is preferable to use a filter that passes a component of 4800 cent or more. The frequency component after passing through the filter is
BPF (x) · Ψ _p ^(t) (x)
It is expressed. Where BPF (x) is the frequency response of the filter. In order to enable the subsequent stochastic processing, the frequency component after passing through the filter is expressed as a probability density function (PDF) as follows.

その後、周波数成分の確率密度関数ＰＤＦが、全ての可能な基本周波数Ｆ０に対応する音モデル（確率分布）の重み付き和からなる確率モデル：
Thereafter, a probability model in which the probability density function PDF of frequency components is a weighted sum of sound models (probability distributions) corresponding to all possible fundamental frequencies F0:

から生成されたと考える。 I think that it was generated from.

ここで、ｐ（ｘ｜Ｆ）は、それぞれのＦ０についての音モデルであり、Ｆｈは取りうるＦ０の上限値を表し、Ｆｌは取りうるＦ０の下限値を表すものとする。また、ｗ^(t)（Ｆ）は音モデルの重みであり、
Here, p (x | F) is a sound model for each F0, Fh represents an upper limit value of F0 that can be taken, and Fl represents a lower limit value of F0 that can be taken. W ^(t) (F) is the weight of the sound model,

を満たす。すなわち、音モデルとは典型的な調波構造を表現した確率分布である。そして、ＥＭ（Expectation Maximization）アルゴリズムを用いてｗ^(t)（Ｆ）を推定し、推定したｗ^(t)（Ｆ）を基本周波数Ｆ０の確率密度関数（ＰＤＦ）と解釈する。最終的に、ｗ^(t)（Ｆ）の中の優勢なピークの軌跡をマルチエージェントモデルを用いて追跡することで、メロディ（歌声）のＦ０系列（Ｆ０ Estimation）を得る。図４は、このようにして取得したＦ０系列（Ｆ０ Estimation）を示している。 Meet. That is, the sound model is a probability distribution that represents a typical harmonic structure. Then, w ^(t) (F) is estimated using an EM (Expectation Maximization) algorithm, and the estimated w ^(t) (F) is interpreted as a probability density function (PDF) of the fundamental frequency F0. Finally, by tracking the trajectory of the dominant peak in w ^(t) (F) using a multi-agent model, an F0 sequence (F0 Estimation) of a melody (singing voice) is obtained. FIG. 4 shows the F0 sequence (F0 Estimation) acquired in this way.

（ＳＴ２：調波構造抽出）
このようにして推定された基本周波数Ｆ０に基づいて、メロディの調波構造の各倍音成分のパワーを抽出する。各周波数成分の抽出には、前後ｒcentずつの誤差を許容し、この範囲で最もパワーの大きなピークを抽出する。ｌ次倍音（ｌ＝１，・・・，Ｌ）のパワーＡ_lと周波数Ｆ_lは、以下のように表される。
(ST2: Harmonic structure extraction)
Based on the fundamental frequency F0 thus estimated, the power of each harmonic component of the harmonic structure of the melody is extracted. In extracting each frequency component, an error of about rcent before and after is allowed, and the peak with the largest power in this range is extracted. The power A _l and frequency F _l of the l-order overtone (l = 1,..., L) are expressed as follows.

ここで、Ｓ（Ｆ）はスペクトルを表し、Ｆの上部にバー（−）のある記号は、PreFEstによって推定された基本周波数Ｆ０を表す。本願発明者らの実験では、ｒの値として２０を用いて調波構造の抽出を実施し、後述のとおりその効果を確認した。図４（Ｃ）は、抽出した各周波数成分の調波構造を示している。 Here, S (F) represents a spectrum, and a symbol with a bar (−) above F represents a fundamental frequency F0 estimated by PreFEst. In the experiments by the inventors of the present application, the harmonic structure was extracted using 20 as the value of r, and the effect was confirmed as described later. FIG. 4C shows the harmonic structure of each extracted frequency component.

（ＳＴ３：再合成）
抽出された調波構造を正弦波重畳モデルに基づいて再合成することで、各時刻において歌声を含む最も優勢な音の優勢音音響信号を得る。ここで時刻ｔにおけるｌ次倍音の周波数をＦ_l ^(t)とし、振幅をＡ_l ^(t)と表す。各フレーム間（時刻ｔと時刻ｔ＋１との間）の周波数が線形に変化するように、位相の変化を２次関数で近似する。また、各フレーム間の振幅の変化は１次関数で近似する。再合成された優勢音音響信号ｓ（ｋ）は、以下のように表される。なお以下の式でθ_l（ｋ）は、ｌ次倍音の時刻ｋにおける位相であり、ｓ_l（ｋ）は、ｌ次倍音の時刻ｋにおける波形である。
(ST3: Resynthesis)
By re-synthesizing the extracted harmonic structure based on the sine wave superposition model, the dominant sound signal of the most dominant sound including the singing voice is obtained at each time. Here, the frequency of the l-th overtone at time t is F ₁ ^(t) , and the amplitude is A ₁ ^(t) . The phase change is approximated by a quadratic function so that the frequency between frames (between time t and time t + 1) changes linearly. The change in amplitude between frames is approximated by a linear function. The re-synthesized dominant sound signal s (k) is expressed as follows. In the following expression, θ _l (k) is the phase of the l-order overtone at time k, and s _l (k) is the waveform of the l-order overtone at time k.

ここで、ｋは時間（単位：秒）を表し、時刻ｔにおいてｋ＝０とする。また、Ｋは（ｔ）と（ｔ＋１）の時間の差、つまりフレームシフトを秒の単位で表す。 Here, k represents time (unit: second), and k = 0 at time t. K represents a time difference between (t) and (t + 1), that is, a frame shift in units of seconds.

θ_l,0 ^(t)は、位相の初期値を表し、入力信号の先頭のフレームでは、θ_l,0 ^(t)＝０とする。以後のフレームでは、θ_l,0 ^(t)は、前フレームのｌ次倍音の周波数Ｆ_l ^(t-1)と、初期位相θ_l,0 ^(t-1)とを用いて
θ _{l, 0} ^(t) represents an initial value of the phase, and θ _{l, 0} ^(t) = 0 in the first frame of the input signal. In subsequent frames, θ _{l, 0} ^(t) uses the frequency F _l ^(t−1) of the l-order harmonic of the previous frame and the initial phase θ _{l, 0} ^(t−1).

で与えられる。 Given in.

図１に戻って、歌声区間推定用特徴量抽出手段７は、各時刻（具体的には、１０ｍsec毎)における優勢音音響信号から歌声が含まれている歌声区間と歌声が含まれていない非歌声区間とを推定するために利用可能な歌声区間推定用特徴量を抽出する。本実施の形態では、１２次元のＬＰＣメルケプストラム（ＬＰＭＣＣ）と１次元の基本周波数Ｆ０の微分係数（ΔＦ０）をここで利用可能な歌声区間推定用特徴量として用いる。本実施の形態では、歌声区間推定用特徴量抽出手段７は、歌声と非歌声を識別するために、歌声区間推定用特徴量（スペクトル特徴量）として、下記の二種類の特徴量を抽出する。 Returning to FIG. 1, the singing voice section estimation feature quantity extraction means 7 includes a singing voice section including a singing voice from a dominant sound signal at each time (specifically, every 10 msec) and a non-singing voice. A feature amount for singing voice section estimation that can be used for estimating a singing voice section is extracted. In this embodiment, a 12-dimensional LPC mel cepstrum (LPMCC) and a one-dimensional fundamental frequency F0 differential coefficient (ΔF0) are used as singing voice section estimation feature quantities usable here. In this embodiment, the singing voice section estimation feature quantity extraction means 7 extracts the following two types of feature quantities as the singing voice section estimation feature quantity (spectrum feature quantity) in order to identify the singing voice and the non-singing voice. .

・ＬＰＣメルケプストラム（ＬＰＭＣＣ）
第１の種類のスペクトル特徴量として、１２次元のＬＰＣメルケプストラム（ＬＰＭＣＣ）を用いる。ＬＰＭＣＣはＬＰＣスペクトルから計算されたメルケプストラム係数である。本願発明者らの実験によると、この特徴量は、メル周波数ケプストラム係数（ＭＦＣＣ）と比べて、歌声の特徴をよく表現することを確認している。本実施の形態では、ＬＰＣスペクトルからメル周波数ケプストラム係数ＭＦＣＣを計算することでＬＰＣメルケプストラムＬＰＭＣＣを抽出するものとした。・ LPC Mel Cepstrum (LPMCC)
A 12-dimensional LPC mel cepstrum (LPMCC) is used as the first type of spectral feature. LPMCC is the mel cepstrum coefficient calculated from the LPC spectrum. According to the experiments by the present inventors, it has been confirmed that this feature value expresses the characteristics of the singing voice better than the mel frequency cepstrum coefficient (MFCC). In this embodiment, the LPC mel cepstrum LPMCC is extracted by calculating the mel frequency cepstrum coefficient MFCC from the LPC spectrum.

・ΔＦ０_s
第２の種類のスペクトル特徴量として、基本周波数Ｆ０の微分係数（ΔＦ０）を用いる。これは、歌声の動的な性質を表現するのに役立つ。歌声は他の楽曲と比較して、ビブラートなどに起因する時間変動が多いので、基本周波数Ｆ０の軌跡の傾きを表す微分係数ΔＦ０は、歌声と非歌声の識別に適していると考えられるからである。・ ΔF0 _s
A differential coefficient (ΔF0) of the fundamental frequency F0 is used as the second type of spectral feature quantity. This is useful for expressing the dynamic nature of the singing voice. Since the singing voice has more time variation due to vibrato, etc., compared to other songs, the differential coefficient ΔF0 representing the inclination of the trajectory of the fundamental frequency F0 is considered suitable for discriminating between singing voice and non-singing voice. is there.

ΔＦ０の計算には、次式のように５フレーム間の回帰係数を用いた。
In calculating ΔF0, a regression coefficient between 5 frames was used as in the following equation.

ここで、ｆ［ｔ］は、時刻ｔにおける周波数（単位：cent）である。 Here, f [t] is a frequency (unit: cent) at time t.

そして前述のステップ２を実行するために、歌声区間推定手段９は、各時刻で抽出した複数の歌声区間推定用特徴量に基づいて、歌声区間と非歌声区間を推定して、歌声区間と非歌声区間に関する情報を出力する。本実施の形態の歌声区間推定手段９は、図５に示す構成を有している。図５に示した歌声区間推定手段９では、図２に示すように、予め複数の学習用楽曲８に基づいて学習により得られた歌声と非歌声の複数の混合ガウス分布を記憶するガウス分布記憶手段９１を備えている。歌声区間推定手段９は、１曲の音楽音響信号Ｓ１の全期間において、複数の歌声区間推定用特徴量と複数の混合ガウス分布とに基づいて、歌声区間と非歌声区間を推定し、その情報を出力する。そこでこの歌声区間推定手段９は、さらに対数尤度計算手段９２と、対数尤度差計算手段９３と、ヒストグラム作成手段９４と、バイアス調整値決定手段９５と、推定用パラメータ決定手段９６と、重み付け手段９７と、最尤経路計算手段９８とを備えている。対数尤度差計算手段９３と、ヒストグラム作成手段９４と、バイアス調整値決定手段９５と、推定用パラメータ決定手段９６とは、歌声区間の推定を行う前の前処理において使用される。図６は、図５に示した歌声区間推定手段９をプログラムによりコンピュータで実現する場合のフローチャートを示している。また図７には、歌声区間の検出をプログラムで実現する際のフローチャートを示している。図７は、図６のステップＳＴ１１とステップＳＴ１６の詳細に相当する。 And in order to perform the above-mentioned step 2, the singing voice section estimation means 9 estimates the singing voice section and the non-singing voice section based on the plurality of singing voice section estimation feature values extracted at each time, and determines the singing voice section and non-singing voice section. Outputs information about the singing voice section. The singing voice section estimation means 9 of the present embodiment has the configuration shown in FIG. In the singing voice section estimation means 9 shown in FIG. 5, as shown in FIG. 2, a Gaussian distribution memory for storing a plurality of mixed Gaussian distributions of singing voices and non-singing voices obtained in advance based on a plurality of learning songs 8. Means 91 are provided. The singing voice section estimation means 9 estimates a singing voice section and a non-singing voice section based on a plurality of singing voice section estimation features and a plurality of mixed Gaussian distributions over the entire period of one music acoustic signal S1. Is output. Therefore, the singing voice section estimating means 9 further includes a log likelihood calculating means 92, a log likelihood difference calculating means 93, a histogram creating means 94, a bias adjustment value determining means 95, an estimation parameter determining means 96, and a weighting. Means 97 and maximum likelihood path calculation means 98 are provided. The log likelihood difference calculating means 93, the histogram creating means 94, the bias adjustment value determining means 95, and the estimation parameter determining means 96 are used in the pre-processing before estimating the singing voice section. FIG. 6 shows a flowchart in the case where the singing voice section estimating means 9 shown in FIG. 5 is realized by a computer by a program. FIG. 7 shows a flowchart when the detection of the singing voice section is realized by a program. FIG. 7 corresponds to details of step ST11 and step ST16 of FIG.

対数尤度計算手段９２は、音楽音響信号Ｓ１の最初から最後までの期間中の各時刻にいて、歌声区間推定用特徴量抽出手段７が抽出した歌声区間推定用特徴量（ステップＳＴ１１）と、事前に前処理によりガウス分布記憶手段９１に記憶した混合ガウス分布とに基づいて、各時刻における歌声対数尤度と非歌声対数尤度とを計算する。 The log likelihood calculation means 92 is a singing voice section estimation feature quantity (step ST11) extracted by the singing voice section estimation feature quantity extraction means 7 at each time during the period from the beginning to the end of the music acoustic signal S1. Based on the mixed Gaussian distribution previously stored in the Gaussian distribution storage unit 91 by preprocessing, the singing voice log likelihood and the non-singing voice log likelihood at each time are calculated.

そして対数尤度差計算手段９３は、各時刻における歌声対数尤度と非歌声対数尤度との対数尤度差を計算する（ステップＳＴ１２）。この計算は、入力された音楽音響信号から抽出された歌声区間推定用特徴量（特徴ベクトル列）に対して、次式のように歌声対数尤度と非歌声対数尤度の対数尤度差ｌ（ｘ）を計算する。
Then, the log likelihood difference calculating means 93 calculates the log likelihood difference between the singing voice log likelihood and the non-singing voice log likelihood at each time (step ST12). This calculation is performed on the singing voice segment estimation feature quantity (feature vector sequence) extracted from the input music acoustic signal, as shown in the following equation, the log likelihood difference l between the singing voice log likelihood and the non-singing voice log likelihood. (X) is calculated.

上記式の前方の関数が歌声対数尤度を示し、後者の関数が非歌声関数尤度を示す。ヒストグラム作成手段９４は、音楽音響信号の全期間から抽出した優先音音響信号から得られる複数の対数尤度差に関するヒストグラムを作成する（ステップＳＴ１３）。図６には、ヒストグラム作成手段９４が作成したヒストグラムの例が例示してある。 The function in front of the above formula shows the singing voice log likelihood, and the latter function shows the non-singing voice function likelihood. The histogram creating means 94 creates a histogram relating to a plurality of log likelihood differences obtained from the priority sound signal extracted from the whole period of the music sound signal (step ST13). FIG. 6 illustrates an example of a histogram created by the histogram creation means 94.

そしてバイアス調整値決定手段９５は、作成したヒストグラムを、楽曲に依存した、歌声区間における対数尤度差のクラスと非歌声区間における対数尤度差のクラスに２分割する場合に、クラス間分散を最大とするような閾値を決定し、この閾値を楽曲依存のバイアス調整値η_dyn.と定める（ステップＳＴ１４）。図６には、この閾値を図示してある。また推定用パラメータ決定手段９６は、バイアス調整値η_dyn.を補正するため（アラインメントの精度を高めるため又は歌声区間を広げる調整のため）に、バイアス調整値η_dyn.にタスク依存値η_fixedを加算して歌声区間を推定する際に用いる推定用パラメータη（＝η_dyn.＋η_fixed）を決定する（ステップＳＴ１５）。混合ガウス分布（ＧＭＭ）の尤度には、楽曲によってバイアスがかかるため、全ての楽曲に適切な推定用パラメータηを定めるのは困難である。そこで、本実施の形態では、推定用パラメータηをバイアス調整値η_dyn.とタスク依存値η_fixedとに分割することとした。なおこのタスク依存値η_fixedは、楽曲の種別等を考慮して予め手動で設定する。一方、バイアス調整値η_dyn.は前述のステップを経てまたは公知の閾値自動設定法を用いて楽曲毎に自動的に設定してもよいし、楽曲の種別に応じて、代表的な学習用音楽音響信号に基づいて予め設定してもよい。 Then, the bias adjustment value determining unit 95 divides the variance between classes when the generated histogram is divided into two, depending on the music, into a log likelihood difference class in the singing voice section and a log likelihood difference class in the non-singing voice section. A threshold value that maximizes the threshold value is determined, and this threshold value is determined as a music-dependent bias adjustment value η _dyn. (Step ST14). FIG. 6 illustrates this threshold value. The estimation parameter determination means 96, in order to correct the bias adjustment value eta _dyn. (For adjustment to extend the order or vocal sections to increase the accuracy of the alignment), bias adjustment value eta _dyn. In a task-dependent value eta _fixed An estimation parameter η (= η _dyn. + Η _fixed ) used when estimating the singing voice interval by addition is determined (step ST15). Since the likelihood of the mixed Gaussian distribution (GMM) is biased by music, it is difficult to determine an estimation parameter η appropriate for all music. Therefore, in this embodiment, the estimation parameter η is divided into the bias adjustment value η _dyn. And the task-dependent value η _fixed . The task dependence value η _fixed is manually set in advance in consideration of the type of music. On the other hand, the bias adjustment value η _{dyn. May be} automatically set for each piece of music through the above-described steps or using a known threshold automatic setting method, or representative learning music depending on the type of music piece. You may set beforehand based on an acoustic signal.

そして重み付け手段９７は、各時刻における歌声対数尤度及び非歌声対数尤度を推定用パラメータηを用いて重み付けを行う（図７のステップＳＴ１６Ａ）。なおこの例では、ここで使用する歌声対数尤度及び非歌声対数尤度として前処理の際に計算したものを用いる。すなわち重み付け手段９７は、歌声対数尤度及び非歌声対数尤度の出力確率を、次式のように近似する。
The weighting means 97 weights the singing voice log likelihood and the non-singing voice log likelihood at each time using the estimation parameter η (step ST16A in FIG. 7). In this example, the singing voice log likelihood and the non-singing voice log likelihood used here are those calculated in the preprocessing. That is, the weighting means 97 approximates the output probabilities of the singing voice log likelihood and the non-singing voice log likelihood as shown in the following equation.

ここで、Ｎ_GMM（ｘ；θ）は混合ガウス分布（ＧＭＭ）の確率密度関数を表す。また、ηは正解率と棄却率の関係を調整する推定用パラメータである。歌声ＧＭＭのパラメータθ_vと非歌声ＧＭＭのパラメータθ_Nはそれぞれ学習データの歌声区間と非歌声区間とを用いて学習する。本願発明者らの実験では、混合数６４のＧＭＭを用いて実施し後述のとおりその効果を確認した。 Here, N _GMM (x; θ) represents a probability density function of a mixed Gaussian distribution (GMM). Η is an estimation parameter for adjusting the relationship between the correct answer rate and the rejection rate. The parameter θ _v of the singing voice GMM and the parameter θ _N of the non-singing voice GMM are learned using the singing voice section and the non-singing voice section of the learning data, respectively. In the experiments conducted by the inventors of the present application, the effect was confirmed by using a GMM having a mixing number of 64 as described later.

最尤経路計算手段９８は、音楽音響信号の全期間から得られる、重み付けされた複数の歌声対数尤度及び重み付けされた複数の非歌声対数尤度を、それぞれ隠れマルコフモデルの歌声状態（Ｓ_Ｖ）の出力確率及び非歌声状態（Ｓ_Ｎ）の出力確率とみなす（図のステップＳＴ１６Ｂ）。そして最尤経路計算手段９８は、音楽音響信号の全期間における歌声状態と非歌声状態の最尤経路を計算し（図７のステップＳＴ１６Ｃ）、最尤経路から音楽音響信号の全期間における歌声区間と非歌声区間に関する情報を決定する。すなわち歌声の検出には、図８に示すように、歌声状態（Ｓ_v）と非歌声状態（Ｓ_N）を行き来する隠れマルコフモデル（ＨＭＭ）を用いることとする。歌声状態とは、文字通り「歌声が存在する状態」を表し、「非歌声状態」は歌声が存在しない状態を表している。最尤経路計算手段９８は、次式のように、入力音響信号から抽出された特徴ベクトル列に対して、歌声・非歌声状態の最尤経路
The maximum likelihood path calculation means 98 obtains a plurality of weighted singing voice log likelihoods and a plurality of weighted non-singing voice log likelihoods obtained from the entire period of the music acoustic signal, respectively, as the singing state (S _V of the hidden Markov model). ) And non-singing voice state (S _N ) output probability (step ST16B in the figure). The maximum likelihood path calculation means 98 calculates the maximum likelihood path of the singing voice state and the non-singing voice state in the entire period of the music acoustic signal (step ST16C in FIG. 7), and the singing voice section in the entire period of the music acoustic signal from the maximum likelihood path. And information about non-singing voice segments. That is, for the detection of the singing voice, as shown in FIG. 8, a hidden Markov model (HMM) that goes back and forth between the singing voice state (S _v ) and the non-singing voice state (S _N ) is used. The singing voice state literally represents a “state where a singing voice exists”, and the “non-singing voice state” represents a state where no singing voice exists. The maximum likelihood path calculation means 98 is a maximum likelihood path in a singing voice / non-singing voice state with respect to the feature vector sequence extracted from the input acoustic signal as shown in the following equation.

を検索する。
Search for.

上記式において、ｐ（ｘ｜ｓ_ｔ）は状態の出力確率を表し、ｐ（ｓ_ｔ＋１｜ｓ_ｔ）は状態ｓ_ｔ＋１から状態ｓ_ｔへの遷移確率を表している。 In the above equation, p (x | s _t ) represents the output probability of the state, and p (s _{t + 1} | s _t ) represents the transition probability from the state s _t _{+ 1} to the state _st .

この歌声区間推定手段９では、前処理以外の通常の推定動作時においては、歌声区間推定用特徴量抽出手段７から各時刻において出力される歌声区間推定用特徴量から、対数尤度計算手段９２が計算した歌声対数尤度及び非歌声対数尤度に、直接重み付けを行って、最尤経路を計算することになる。このような前処理によって対数尤度差のヒストグラムを利用して、歌声対数尤度及び非歌声対数尤度のバイアス調整値η_dyn（閾値）を決定すると、音楽音響信号に合ったバイアス調整値η_dynを決定することができる。そしてバイアス調整値η_dynにより定めた推定用パラメータηを用いて重み付けを行うと、楽曲ごとの音楽音響信号の音響的特性の違いによって現れる歌声区間推定用特徴量の傾向に合わせて、歌声状態と非歌声状態との境界部を中心にして歌声対数尤度及び非歌声対数尤度を調整することができ、歌声区間及び非歌声区間の境界を、楽曲に合わせて適切に調整することができる。 In this singing voice section estimation means 9, during a normal estimation operation other than pre-processing, a log likelihood calculation means 92 is obtained from the singing voice section estimation feature quantity output at each time from the singing voice section estimation feature quantity extraction means 7. The maximum likelihood path is calculated by directly weighting the singing voice log likelihood and the non-singing voice log likelihood calculated by. When the bias adjustment value η _dyn (threshold value) of the singing voice log likelihood and the non-singing voice log likelihood is determined by using the histogram of the log likelihood difference by such preprocessing, the bias adjustment value η suitable for the music acoustic signal is determined. _dyn can be determined. Then, when weighting is performed using the estimation parameter η determined by the bias adjustment value η _dyn , the singing voice state and the singing voice state and The singing voice log likelihood and the non-singing voice log likelihood can be adjusted around the boundary with the non-singing voice state, and the boundary between the singing voice section and the non-singing voice section can be adjusted appropriately according to the music.

図１に戻って、時間的対応付け用特徴量抽出手段１１は、各時刻における優勢音音響信号から、歌声の歌詞と優勢音音響信号との間の時間的対応を付けるのに適した時間的対応付け用特徴量を抽出する。具体的な実施の形態では、時間的対応付け用特徴量として、音素の共鳴特性等の２５次元の特徴量を抽出する。この処理は、次のアラインメント処理において必要な前処理に当たる。詳細については図９に示すビタビアラインメントの分析条件を参照して後述するが、本実施の形態で抽出する特徴量は、１２次元ＭＦＣＣ、１２次元ΔＭＦＣＣ及びΔパワーの２５次元とする。 Returning to FIG. 1, the temporal correspondence feature quantity extraction means 11 is suitable for attaching temporal correspondence between the lyrics of the singing voice and the dominant sound acoustic signal from the dominant sound acoustic signal at each time. A feature amount for association is extracted. In a specific embodiment, 25-dimensional feature quantities such as phoneme resonance characteristics are extracted as temporal association feature quantities. This processing corresponds to preprocessing necessary for the next alignment processing. Details will be described later with reference to the analysis conditions of the viterbi alignment shown in FIG. 9, but the feature quantities extracted in this embodiment are 12 dimensions MFCC, 12 dimensions ΔMFCC, and 25 dimensions of Δ power.

音素ネットワーク記憶手段１３は、音楽音響信号に対応する楽曲の歌詞に関して複数の音素によって構成された音素ネットワークＳＮを記憶する。このような音素ネットワークＳＮは、例えば、日本語の歌詞であれば、歌詞を音素列に変換し、その後、フレーズの境界を複数個のショートポーズに変換し、単語の境界を１個のショートポーズに変換することにより、母音とショートポーズのみからなる音素列を用いて構成するのが好ましい。与えられた歌詞のテキストデータを元に、アラインメントに用いる文法（これを「アラインメント用の音素列」と定義する。）を作成する。 The phoneme network storage unit 13 stores a phoneme network SN composed of a plurality of phonemes with respect to the lyrics of the music corresponding to the music acoustic signal. Such a phoneme network SN converts, for example, a Japanese lyrics into a phoneme string, and then converts a phrase boundary into a plurality of short pauses and a word boundary into one short pause. It is preferable to use a phoneme string consisting of only vowels and short pauses. Based on the text data of the given lyrics, a grammar used for the alignment (this is defined as “phoneme string for alignment”) is created.

日本語の歌詞のためのアラインメント用の音素列は、ショートポーズ（sp）すなわち空白と母音と子音のみから構成される。これは、無声子音は調波構造を持たず、伴奏音抑制手法で抽出できないこと、有声子音も発声長が短いため安定して基本周波数Ｆ０を推定するのが難しいことなどがその理由である。具体的な処理としては、まず歌詞をそのまま音素列に変換（実質的には、歌詞を音読したものをローマ字に変換する作業に等しい）し、その後、以下の２つの規則（日本語用の文法）に従って、アラインメント用の音素列に変換する。 The phoneme string for alignment for Japanese lyrics consists of a short pause (sp), that is, only blanks, vowels and consonants. This is because unvoiced consonants do not have a harmonic structure and cannot be extracted by the accompaniment sound suppression method, and voiced consonants have short utterance lengths, making it difficult to stably estimate the fundamental frequency F0. Specifically, the lyrics are first converted into phoneme strings (substantially equivalent to the work of converting the words read aloud into Roman characters), and then the following two rules (Japanese grammar) ) To convert to a phoneme string for alignment.

ルール１：歌詞中の文やフレーズの境界を複数回のショートポーズ（sp）に変換する。 Rule 1: The boundaries of sentences and phrases in the lyrics are converted into multiple short pauses (sp).

ルール２：単語の境界を一回のショートポーズに変換する。 Rule 2: Convert word boundaries into one short pause.

図１０は、日本語の歌詞からアラインメント用の音素列（音素ネットワーク）への変換の例を示している。まずオリジナルの歌詞のフレーズを表すテキストデータＡが音素列（Sequence of the phonemes）Ｂに変換される。音素列Ｂに上記「文法」を当てはめることにより、母音と子音とショートポーズ（sp）のみから構成される「アラインメント用の音素列」Ｃに変換される。 FIG. 10 shows an example of conversion from Japanese lyrics into a phoneme string (phoneme network) for alignment. First, text data A representing a phrase of original lyrics is converted into a phoneme string (Sequence of the phonemes) B. By applying the above “grammar” to the phoneme sequence B, the phoneme sequence B is converted into an “phoneme sequence for alignment” C composed only of vowels, consonants, and a short pause (sp).

この例では、日本語の歌詞「立ち止まる時またふと振り返る」という歌詞Ａが、「tachidomaru toki mata futo furikaeru」という音素列Ｂに変換され、さらに、母音と子音とを含む音素とショートポーズ（sp）からなるアラインメント用の音素列Ｃに変換される様子が示されている。このアラインメント用の音素列Ｃが、音素ネットワークＳＮである。 In this example, the lyrics A in Japanese, “When you stop and also look back,” A is converted to a phoneme sequence B, “tachidomaru toki mata futo furikaeru”, and then a phoneme and short pause (sp) containing vowels and consonants. A state in which the phoneme string C is converted into an alignment phoneme string C is shown. The phoneme string C for alignment is a phoneme network SN.

図１に戻って、前述のステップ３を実行するために、アラインメント手段１７は、前述の時間的対応付け用特徴量に基づいて該時間的対応付け用特徴量に対応する音素を推定する歌声用音響モデル１５を備えている。そしてアラインメント手段１７は、音素ネットワーク中の複数の音素と優先音音響信号とを時間的に対応付けるアラインメント動作を実行する。具体的には、アラインメント手段１７は、時間的対応付け用特徴量抽出手段１１からの時間的対応付け用特徴量と、歌声区間推定手段９からの歌声区間と非歌声区間に関する情報と、音素ネットワーク記憶手段１３からの音素ネットワークとを入力として、歌声用音響モデル１５を用いて、少なくとも非歌声区間には音素が存在しないという条件の下で、アラインメントを実行して、音楽音響信号と歌詞の時間的対応付けを自動で行う。 Returning to FIG. 1, in order to execute the above-described step 3, the aligning means 17 uses the singing voice for estimating the phoneme corresponding to the temporal association feature amount based on the temporal association feature amount. An acoustic model 15 is provided. And the alignment means 17 performs the alignment operation | movement which matches a some phoneme in a phoneme network, and a priority sound acoustic signal temporally. Specifically, the alignment unit 17 includes the temporal association feature amount from the temporal association feature amount extraction unit 11, information about the singing voice segment and the non-singing voice segment from the singing voice segment estimation unit 9, and a phoneme network. Using the phoneme network from the storage means 13 as an input, using the acoustic model for singing voice 15, the alignment is executed under the condition that there is no phoneme at least in the non-singing voice section, and the time of the music acoustic signal and the lyrics Automatic association.

本実施の形態のアラインメント手段１７は、ビタビアラインメントを用いてアラインメント動作を実行するように構成されている。ここで「ビタビアラインメント」とは、音声認識の技術分野において知られるもので、音響信号と文法（アラインメント用の音素列すなわち音素ネットワーク）との間の最尤経路を探索するビタビアルゴリズムを用いた最適解探索手法の一つである。ビタビアラインメントの実行においては、非歌声区間には音素が存在しないという条件として、少なくとも非歌声区間をショートポーズ（ｓｐ）とする条件を定める。そしてショートポーズ（ｓｐ）においては、他の音素の尤度をゼロとして、アラインメント動作を実行する。このようにするとショートポーズ（ｓｐ）の区間においては、他の音素の尤度がゼロになるため、歌声区間情報を利用することができ、精度の高いアラインメントを行うことができる。 The alignment means 17 of the present embodiment is configured to execute an alignment operation using viterbi alignment. Here, “Viterbi alignment” is known in the technical field of speech recognition, and is optimal using a Viterbi algorithm that searches for the maximum likelihood path between an acoustic signal and a grammar (phoneme sequence for alignment or phoneme network). This is one of the solution search methods. In the execution of the viterbi alignment, a condition that at least the non-singing voice section is set to a short pause (sp) is determined as a condition that no phoneme is present in the non-singing voice section. In the short pause (sp), the alignment operation is executed with the likelihood of other phonemes set to zero. In this way, in the short pause (sp) section, the likelihood of other phonemes becomes zero, so the singing voice section information can be used, and high-precision alignment can be performed.

図１１は、「フレーム同期ビタビ探索」と呼ばれるビタビアラインメントを用いて、アラインメント手段１７をプログラムによりコンピュータで実現する場合のプログラムのアルゴリズムを示すフローチャートである。なお以下のアラインメント動作の説明では、歌詞が日本語の場合を例として説明する。ステップＳＴ１０１のｔ＝１は最初の時間的対応付け用特徴量（以下図１１の説明においては、単に特徴量と言う）が入力されるフレームである。ステップＳＴ１０２では、スコア０で空の仮説を作成する。ここで「仮説」とは、今の時刻までの「音素の並び」を意味する。したがって空の仮説を作成するとは、何も音素がない状態にすることを意味する。 FIG. 11 is a flowchart showing an algorithm of a program when the alignment means 17 is realized by a computer using a Viterbi alignment called “frame synchronization Viterbi search”. In the following description of the alignment operation, a case where the lyrics are in Japanese will be described as an example. T = 1 in step ST101 is a frame in which the first temporal association feature amount (hereinafter simply referred to as feature amount in the description of FIG. 11) is input. In step ST102, an empty hypothesis with a score of 0 is created. Here, “hypothesis” means “line of phonemes” up to the present time. Therefore, creating an empty hypothesis means that there is no phoneme.

次にステップＳＴ１０３では、ループ１として、現在持っているすべての仮説に対して処理をする。ループ１は、前のフレームでの処理が終わった時点で持っている仮説それぞれについてスコアの計算処理を行うループである。例えば、「ａ−ｉ−ｓｐ−ｕ−ｅ・・・」という音素ネットワークとの間の時間的対応を付けると仮定する。この場合に、６フレーム目（６音素目）まで来たときのあり得る仮説（音素の並び）には、「ａａａａａａ」という仮説や、「ａａａｉｉｉ」という仮説や、「ａａｉｉspｕ」という仮説等の様々な仮説が考えられる。探索の途中では、これら複数の仮説を同時に保持して計算処理が実行される。なおこれらの複数の仮説は、すべて自分のスコアを持っている。ここでスコアとは、６フレームまであるとしたとき、１〜６フレームまでの特徴量それぞれが、例えば「ａａａｉｉｉ」という音素の並びであった可能性（対数尤度）を、特徴量と音響モデルとを比較することにより計算したものである。例えば、６フレーム目（ｔ＝６）の処理が終わり、７フレーム目の処理が始まると、現在保持しているすべての仮説に対して計算処理が行われる。このような処理をすることがループ１の処理である。 Next, in step ST103, as loop 1, all hypotheses currently held are processed. Loop 1 is a loop that performs score calculation processing for each hypothesis possessed when the processing in the previous frame is completed. For example, it is assumed that a temporal correspondence with a phoneme network “ai-sp-ue ...” is attached. In this case, there are various hypotheses (arrangement of phonemes) when coming to the sixth frame (sixth phoneme), such as a hypothesis “aaaaaa”, a hypothesis “aaaiii”, and a hypothesis “aaiispu”. Can be hypothesized. During the search, the calculation process is executed while simultaneously holding the plurality of hypotheses. These hypotheses all have their own scores. Here, when the score is assumed to be up to 6 frames, the possibility (log likelihood) that each of the feature quantities from 1 to 6 frames is, for example, a sequence of phonemes “aaaiii” is represented by the feature quantity and the acoustic model. Is calculated by comparing. For example, when processing for the sixth frame (t = 6) is completed and processing for the seventh frame is started, calculation processing is performed for all currently held hypotheses. This process is the process of loop 1.

次にステップＳＴ１０４では、音素ネットワークを元に仮説を「１フレーム展開」する。ここで「１フレーム展開」するとは、仮説の長さを１フレーム延ばすことを意味する。そして展開した場合には、一つ次の時刻のフレームまで考慮に入れることにより、１つの仮説に新たな音素が続いて複数の新たな仮説ができる可能性がある。次に続く可能性のある音素を見つけるために、音素ネットワークが参照される。例えば、「ａａａｉｉｉ」という仮説については、音素ネットワークを参照すると、次のフレームでは「ａａａｉｉｉｉ」というように「ｉ」が続く場合と、「ａａａｉｉｉsp」というようにショートポーズspに移る場合の２通りの新しい仮説が考えられる。この場合には、１つの仮説を「１フレームに展開」すると次の時刻のフレームまで考慮した新しい２つの仮説が出ることになる。ステップＳＴ１０５では、ループ２として、すべての仮説について１フレーム展開されて発生した新たなすべての仮説に対して、スコアを計算する。スコアの計算は、ループ１におけるスコアの計算と同じである。ループ２は、保持しているそれぞれの仮説から新たに幾つかの仮説が展開されるので、その新しく展開されたそれぞれの仮説についてスコア計算の処理を行うループである。 In step ST104, the hypothesis is "one frame developed" based on the phoneme network. Here, “one frame expansion” means that the length of the hypothesis is extended by one frame. In the case of expansion, there is a possibility that a new phoneme is followed by a new phoneme by taking into consideration up to the frame of the next time, so that a plurality of new hypotheses can be formed. The phoneme network is referenced to find the next phoneme that may follow. For example, with regard to the hypothesis “aaaiii”, referring to the phoneme network, in the next frame, “i” continues like “aaaiii”, and “aaaiiisp” moves to a short pause sp. New hypotheses can be considered. In this case, when one hypothesis is “expanded into one frame”, two new hypotheses are taken into consideration until the next frame. In step ST105, as loop 2, scores are calculated for all new hypotheses generated by developing one frame for all hypotheses. The score calculation is the same as the score calculation in loop 1. The loop 2 is a loop that performs score calculation processing for each newly developed hypothesis, since several hypotheses are newly developed from each held hypothesis.

次にステップＳＴ１０６では、歌声区間推定手段９からの歌声区間情報を入力として、ｔ番目のフレームが歌声区間であるか又は音素がショートポーズ(sp)であるか否かの判定が行われる。例えば、７フレーム目は非歌声区間であるという歌声区間情報があるとする。この場合に、７フレーム目で仮説を展開した時点で、「ａａａｉｉｉsp」という仮説はあっても、「ａａａｉｉｉｉ」という仮説はあり得ないことになる。このようなあり得ない仮説は、ステップＳＴ１０７で棄却される。このように歌声区間情報があると、ステップＳＴ１０６及び１０７を経て、あり得ない仮説が棄却できるため、アラインメントが容易になる。ステップＳＴ１０６において、Ｙｅｓの判定がなされると、ステップＳＴ１０８へと進む。 Next, in step ST106, the singing voice section information from the singing voice section estimation means 9 is input, and it is determined whether the t-th frame is a singing voice section or whether the phoneme is in a short pause (sp). For example, it is assumed that there is singing voice section information that the seventh frame is a non-singing voice section. In this case, when the hypothesis is developed in the seventh frame, even though there is a hypothesis “aaaiiisp”, there is no hypothesis “aaaiii”. Such impossible hypotheses are rejected in step ST107. When there is singing voice section information in this way, an impossible hypothesis can be rejected through steps ST106 and ST107, so that alignment becomes easy. If YES is determined in step ST106, the process proceeds to step ST108.

ステップＳＴ１０８では、入力された特徴量と音響モデルとを用いて、ｔ番目の特徴量の音響スコアを計算し、仮説のスコアに加算する。すなわちｔ番目の特徴量を音響モデルと比較して、対数尤度（スコア）を計算し、計算したスコアを仮説のスコアに加算する。結局、スコアの計算は、特徴量と音響モデルとを比較し、特徴量が音響モデル中にある複数の音素についての情報にどの程度似ているのかを計算していることになる。なおスコアは対数で計算するため、全く似ていないといった場合には、その値は−∞となる。ステップＳＴ１０８では、すべての仮説についてスコアの計算が行われる。ステップＳＴ１０８での計算が終了すると、ステップＳＴ１０９へと進み、仮説とスコアとが保持される。そしてステップＳＴ１１０ではステップＳＴ１０５に対応したループ２が終了する。ステップＳＴ１１１ではステップＳＴ１０３に対応したループ１が終了する。その後、ステップＳＴ１１２で、現在の処理対象時刻を１増加させ（ｔ＋１）、次のフレームに進む。そしてステップＳＴ１１３で、フレームが入力されてくる複数の特徴量の終端であるか否かの判断がなされる。すべての特徴量が入力されるまでは、ステップＳＴ１０３からステップＳＴ１１２までの各ステップが繰り返し実行される。すべての特徴量について処理が終了すると、ステップＳＴ１１４へと進む。この時点では、特徴量と音響モデルとの比較は、音素ネットワークの終端に達している。そして音素ネットワークの終端に達している複数の仮説の中から合計スコアが最大の仮説（音素の並び）を最終決定された仮説として選ぶ。この最終決定された仮説すなわち音素の並びは、時刻と対応している特徴量を基準にして定められている。すなわちこの最終決定された音素の並びは、音楽音響信号と同期した音素の並びになっている。したがってこの最終決定された音素の並びに基づいて表示される歌詞のデータが、時間タグ付きの（音楽音響信号と同期するための時刻情報が付いた）歌詞となる。 In step ST108, the acoustic score of the t-th feature amount is calculated using the input feature amount and the acoustic model, and is added to the hypothesis score. That is, the logarithmic likelihood (score) is calculated by comparing the t-th feature quantity with the acoustic model, and the calculated score is added to the hypothesis score. Eventually, the score is calculated by comparing the feature quantity with the acoustic model and calculating how much the feature quantity is similar to information about a plurality of phonemes in the acoustic model. Since the score is calculated logarithmically, if it is not similar at all, the value is −∞. In step ST108, scores are calculated for all hypotheses. When the calculation in step ST108 ends, the process proceeds to step ST109, and the hypothesis and score are held. In step ST110, loop 2 corresponding to step ST105 ends. In step ST111, loop 1 corresponding to step ST103 ends. Thereafter, in step ST112, the current processing target time is increased by 1 (t + 1), and the process proceeds to the next frame. Then, in step ST113, it is determined whether or not the frame is the end of a plurality of feature amounts input. Until all the feature values are input, each step from step ST103 to step ST112 is repeatedly executed. When the process is completed for all feature amounts, the process proceeds to step ST114. At this point, the comparison between the feature quantity and the acoustic model has reached the end of the phoneme network. Then, the hypothesis (phoneme arrangement) having the maximum total score is selected as the final determined hypothesis from a plurality of hypotheses reaching the end of the phoneme network. This finally determined hypothesis, that is, the phoneme sequence, is determined based on the feature quantity corresponding to the time. That is, the final determined phoneme arrangement is a sequence of phonemes synchronized with the music acoustic signal. Therefore, the lyrics data displayed on the basis of the sequence of the final determined phonemes becomes lyrics with a time tag (with time information for synchronizing with the music acoustic signal).

図１２（Ａ）は、ビタビアラインメントを利用して、時刻において音楽音響信号から抽出した優勢音音響信号の信号波形Ｓ′（伴奏音が抑制された音響信号の音声波形）に対して、音素ネットワーク（文法）を時間的に対応付けた様子を示している。アラインメントが完了した後は、時間情報を伴ったアラインメント用の音素列（文法）から逆に歌詞に戻すことで、最終的に、時間情報を含む「時間タグ付き歌詞データ」が得られる。図１２（Ａ）では図示を簡単にするために母音のみを示してある。 FIG. 12A shows a phoneme network for the signal waveform S ′ of the dominant sound signal extracted from the music sound signal at the time using the Viterbi alignment (the sound waveform of the sound signal in which the accompaniment sound is suppressed). It shows how (grammar) is associated with time. After the alignment is completed, “lyric data with time tag” including the time information is finally obtained by returning to the lyrics from the phoneme string (grammar) for alignment accompanied by the time information. In FIG. 12A, only vowels are shown for simplicity of illustration.

図１２（Ｂ）は、アラインメントが完了した後、音素列（文法）から歌詞に戻すことによって伴奏音を含む混合音の音楽音響信号Ｓと歌詞の時間的対応付けが完了した様子を示している。ＰＡ〜ＰＤは、それぞれ歌詞のフレーズである。 FIG. 12B shows a state in which the temporal association between the music acoustic signal S of the mixed sound including the accompaniment sound and the lyrics is completed by returning to the lyrics from the phoneme string (grammar) after the alignment is completed. . PA to PD are lyric phrases.

次にアラインメント手段１７で使用する歌声用音響モデル１５について説明する。使用する歌声用音響モデル１５としては、歌声の発話内容（歌詞）に対してアラインメントを行うため、大量の歌声のデータから学習された音響モデルを使用することが理想的である。しかしながら、現段階ではそのようなデータベースは構築されていない。そこで本実施の形態では、話し声用の音響モデルのパラメータを、歌声と伴奏音を含む楽曲中の歌声の音素を認識できるように再推定して（学習して）得た音響モデルを用いる。 Next, the singing voice acoustic model 15 used in the alignment means 17 will be described. As the singing voice acoustic model 15 to be used, it is ideal to use an acoustic model learned from a large amount of singing voice data in order to align the utterance contents (lyrics) of the singing voice. However, no such database has been constructed at this stage. Therefore, in this embodiment, the acoustic model obtained by re-estimating (learning) the parameters of the acoustic model for speaking voice so as to recognize the phoneme of the singing voice in the music including the singing voice and the accompaniment sound is used.

話し声用の音響モデルをベースにして歌声用音響モデルを作る手法（適応：adaptation）は、以下のように３段階ある。なお事前の作業として、「話し声用の音響モデル」を準備するステップが必要であるが、この点は公知であるので省略する。 There are three stages of methods (adaptation) for creating an acoustic model for singing voice based on an acoustic model for speaking voice. As a prior work, a step of preparing an “acoustic model for speaking voice” is necessary, but this point is well known and will be omitted.

（１）話し声用の音響モデルを単独歌唱の歌声に適応させる。 (1) Adapt an acoustic model for speaking voice to a single singing voice.

（２）単独歌唱用の音響モデルを伴奏音抑制手法によって抽出された分離歌声に適応させる。 (2) An acoustic model for single singing is adapted to the separated singing voice extracted by the accompaniment sound suppression method.

（３）分離歌声用の音響モデルを入力楽曲中の特定楽曲（特定歌手）に適応させる。 (3) An acoustic model for separated singing voice is adapted to a specific music (specific singer) in the input music.

これら（１）乃至（３）段階は、いずれも図２における「学習時」の処理に対応するものであり、実行時よりも前に行うものである。 These steps (1) to (3) all correspond to the “learning” process in FIG. 2 and are performed before the execution.

（１）段階の適応では、図２に示すように、話し声用音響モデル１０１を音素ラベル１０２（教師情報）及び伴奏音を伴わない歌声だけのすなわち単独歌唱の歌声１０３に適応させて単独歌唱用の音響モデル１０４を生成する。（２）の適応では、単独歌唱用の音響モデル１０４を、伴奏音抑制手法によって抽出された優勢音音響信号からなる歌声データ１０５及び音素ラベル１０２（教師情報）に適応させて、分離歌声用の音響モデル１０６を生成する。（３）の適応では、分離歌声用の音響モデル１０６を、入力楽曲の特定楽曲の音素ラベル（音素ネットワーク）と特徴量とに適応させて、特定歌手用音響モデル１０７を生成する。図２の例では、図１の歌声用音響モデル１５として、特定歌手用音響モデル１０７を用いている。 In the (1) stage adaptation, as shown in FIG. 2, the acoustic model 101 for speaking voice is adapted to a singing voice 103 with only a phoneme label 102 (teacher information) and a singing voice without accompaniment sounds, that is, a single singing voice 103. The acoustic model 104 is generated. In the adaptation of (2), the acoustic model 104 for single singing is adapted to the singing voice data 105 and the phoneme label 102 (teacher information) composed of the dominant sound acoustic signals extracted by the accompaniment sound suppression method, and is used for the separated singing voice. An acoustic model 106 is generated. In the adaptation of (3), the acoustic model 106 for the separated singing voice is adapted to the phoneme label (phoneme network) and the feature amount of the specific music of the input music to generate the acoustic model 107 for the specific singer. In the example of FIG. 2, a specific singer acoustic model 107 is used as the singing voice acoustic model 15 of FIG.

なお、（１）乃至（３）は必ずしも全て実施する必要はなく、例えば（１）のみを実施する場合（これを「１段階適応」という。）、（１）及び（２）を実施する場合（これを「２段階適応」という。）、及び（１）乃至（３）を全て実施する場合（これを「３段階適応」という。）、などのように、一つ又は複数を適宜組み合わせて、音響モデルの適応を実施することができる。 Note that it is not always necessary to implement (1) to (3). For example, when only (1) is implemented (this is referred to as “one-stage adaptation”), and (1) and (2) are implemented. (This is called “two-stage adaptation”) and when all of (1) to (3) are implemented (this is called “three-stage adaptation”), etc. Acoustic model adaptation can be implemented.

ここで、教師情報とは、各音素ごとの時間情報(音素の始端時間、終端時間)を指している。従って、単独歌唱データ１０３や音素ラベル１０２のような教師情報を用いて、話し声用の音響モデルを適応させる場合は、時間情報により正確にセグメンテーションされた音素データを用いて適応が行われる。 Here, the teacher information refers to time information (phoneme start time and end time) for each phoneme. Therefore, when adapting the acoustic model for speaking voice using the teacher information such as the single singing data 103 or the phoneme label 102, the adaptation is performed using the phoneme data accurately segmented by the time information.

図１３は、時間情報を伴う日本語の歌詞の場合の適応用音素ラベル１０２の一例を示している。なお、図１３の音素ラベル１０２は手動で付与した。適応時のパラメータ推定には、最尤線形回帰ＭＬＬＲ（Maximum Likelihood Linear Regression）と最大事後確率ＭＡＰ（Maximum a Posterior）推定を組み合わせることができる。なお、ＭＬＬＲとＭＡＰを組み合わせるということの意味は、ＭＬＬＲ適応法で得られた結果を、ＭＡＰ推定法における事前分布（初期値のようなもの）として使用することを意味する。 FIG. 13 shows an example of the adaptive phoneme label 102 in the case of Japanese lyrics with time information. Note that the phoneme label 102 in FIG. 13 was manually applied. For parameter estimation at the time of adaptation, maximum likelihood linear regression MLLR (Maximum Likelihood Linear Regression) and maximum posterior probability MAP (Maximum a Posterior) estimation can be combined. Note that the combination of MLLR and MAP means that the result obtained by the MLLR adaptation method is used as a prior distribution (such as an initial value) in the MAP estimation method.

以下さらに音響モデルの具体的な適応技術について説明する。図１４は、前述の１段階適応の詳細を示すフローチャートである。１段階適応では、歌声用音響モデル１５としては、歌声だけを含む単独歌唱のデータすなわち適応用音楽音響信号１０３を、適応用音楽音響信号１０３に対する適応用音素ラベル１０２を元に音素ごとに分割する。そして音素ごとに分割されたデータを用いて、話し声用音響モデル１０１のパラメータを、適応用音楽音響信号１０３から歌声の音素を認識できるように再推定して単独歌唱用の音響モデル１０４を得る。この音響モデル１０４は、伴奏音が無いかまたは伴奏音が歌声に比べて小さい場合に、適している。 Hereinafter, a specific adaptation technique of the acoustic model will be described. FIG. 14 is a flowchart showing details of the one-stage adaptation described above. In the one-step adaptation, the singing voice acoustic model 15 divides the single singing data including only the singing voice, that is, the adaptation music acoustic signal 103 for each phoneme based on the adaptation phoneme label 102 for the adaptation music acoustic signal 103. . Then, using the data divided for each phoneme, the parameters of the acoustic model 101 for speaking voice are re-estimated so that the phoneme of the singing voice can be recognized from the adaptive music acoustic signal 103 to obtain the acoustic model 104 for single singing. This acoustic model 104 is suitable when there is no accompaniment sound or the accompaniment sound is smaller than the singing voice.

また図１５は、前述の２段階適応の詳細を示すフローチャートである。２段階適応では、歌声に加えて伴奏音を含む適応用音楽音響信号から抽出した歌声を含む最も優勢な音の優勢音音響信号１０５を適応用音素ラベル１０２を元に音素ごとに分割する。そして音素ごとに分割されたデータを用いて、単独歌唱用の音響モデル１０４のパラメータを、優勢音音響信号１０５から歌声の音素を認識できるように再推定して得た分離歌声用の音響モデル１０６を得る。このような分離歌声用の音響モデル１０６は、歌声と同様に伴奏音が大きい場合に適している。 FIG. 15 is a flowchart showing details of the two-stage adaptation described above. In the two-stage adaptation, the dominant sound acoustic signal 105 of the most dominant sound including the singing voice extracted from the adaptive music acoustic signal including the accompaniment sound in addition to the singing voice is divided for each phoneme based on the adaptive phoneme label 102. Then, using the data divided for each phoneme, the acoustic model 106 for separated singing voice obtained by re-estimating the parameters of the acoustic model 104 for single singing so that the phoneme of the singing voice can be recognized from the dominant sound acoustic signal 105. Get. Such an acoustic model 106 for a separated singing voice is suitable when the accompaniment sound is large like the singing voice.

さらに図１６は、前述の３段階適応の詳細を示すフローチャートである。３段階適応では、システムの実行時に入力された歌声と伴奏音とを含む音楽音響信号Ｓ１から伴奏音抑制法により伴奏音を抑制して得た優勢音音響信号Ｓ２を用いる。そしてシステムに入力された音楽音響信号から抽出した歌声を含む最も優勢な音の優勢音音響信号Ｓ２から時間的対応付け用特徴量抽出手段１１によって抽出された複数の時間的対応付け用特徴量と入力された音楽音響信号に対する音素ネットワークＳＮを用いて、分離歌声用の音響モデル１０６のパラメータを音楽音響信号の楽曲を歌う特定の歌手の音素を認識できるように推定して特定歌手用の音響モデル１０７を得る。この特定歌手用の音響モデル１０７は、歌手を特定した音響モデルであるため、アラインメントの精度を最も高くすることができる。 FIG. 16 is a flowchart showing details of the above-described three-stage adaptation. In the three-stage adaptation, the dominant sound signal S2 obtained by suppressing the accompaniment sound by the accompaniment sound suppression method from the music sound signal S1 including the singing voice and the accompaniment sound input when the system is executed is used. A plurality of temporal correspondence feature amounts extracted by the temporal correspondence feature amount extraction means 11 from the dominant sound acoustic signal S2 of the most dominant sound including the singing voice extracted from the music acoustic signal input to the system; Using the phoneme network SN for the input music acoustic signal, the parameters of the acoustic model 106 for the separated singing voice are estimated so that the phoneme of the specific singer who sings the music of the music acoustic signal can be recognized, and the acoustic model for the specific singer 107 is obtained. Since the acoustic model 107 for the specific singer is an acoustic model that identifies the singer, the alignment accuracy can be maximized.

なお音楽音響信号に時間的に対応付けられた歌詞を、表示画面上に表示させながら音楽音響信号を再生する音楽音響信号再生装置において、本発明のシステムを用いて音楽音響信号に時間的に対応付けられた歌詞を表示画面に表示させると、再生される音楽と画面に表示される歌詞とが同期させて表示画面に表示することができる。 In a music sound signal reproducing apparatus that reproduces a music sound signal while displaying lyrics that are temporally associated with the music sound signal on the display screen, the music sound signal is temporally supported using the system of the present invention. When the attached lyrics are displayed on the display screen, the music to be played and the lyrics displayed on the screen can be synchronized and displayed on the display screen.

本発明の音楽音響信号と歌詞の時間的対応付けを自動で行う方法を、図１及び図２を用いて説明する。まず歌声と伴奏音とを含む楽曲の音楽音響信号Ｓ１から、各時刻において歌声を含む最も優勢な音の優勢音音響信号Ｓ２を優勢音響信号抽出手段５が抽出する（優勢音響信号抽出ステップ）。次に各時刻における優勢音音響信号Ｓ２から歌声が含まれている歌声区間と歌声が含まれていない非歌声区間とを推定するために利用可能な歌声区間推定用特徴量を歌声区間推定用特徴量抽出手段７が抽出する（歌声区間推定用特徴量抽出ステップ）。そして複数の歌声区間推定用特徴量に基づいて、歌声区間と非歌声区間を歌声区間推定手段が推定して、歌声区間と前記非歌声区間に関する情報を出力する（歌声区間推定ステップ）。また各時刻における優勢音音響信号Ｓ２から、歌声の歌詞と音楽音響信号との間の時間的対応を付けるのに適した時間的対応付け用特徴量を時間的対応付け用特徴量抽出手段１１が抽出する（時間的対応付け用特徴量抽出ステップ）。さらに音楽音響信号Ｓ１に対応する楽曲の歌詞の複数の音素が、該複数の音素の隣りあう二つの音素の時間的間隔が調整可能に繋がって構成された音素ネットワークＳＮを音素ネットワーク記憶手段１３に記憶する（記憶ステップ）。そして時間的対応付け用特徴量に基づいて該時間的対応付け用特徴量に対応する音素を推定する歌声用音響モデル１５を備え、音素ネットワークＳＮ中の複数の音素と優先音音響信号Ｓ１とを時間的に対応付けるアラインメント動作をアラインメント手段１７が実行する（アラインメントステップ）。このアラインメントステップでは、アラインメント手段１７が、時間的対応付け用特徴量抽出ステップで得られる時間的対応付け用特徴量と、歌声区間と非歌声区間に関する情報と、音素ネットワークＳＮとを入力として、歌声用音響モデル１５を用いて、少なくとも非歌声区間には音素が存在しないという条件の下で、アラインメント動作を実行する。 A method for automatically associating a music acoustic signal with lyrics in accordance with the present invention will be described with reference to FIGS. First, the dominant sound signal extraction means 5 extracts the dominant sound signal S2 of the most dominant sound including the singing voice at each time from the music sound signal S1 of the music including the singing voice and the accompaniment sound (dominant sound signal extracting step). Next, singing voice section estimation features that can be used to estimate a singing voice section that includes a singing voice and a non-singing voice section that does not include a singing voice from the dominant sound signal S2 at each time. The quantity extraction means 7 extracts (singing voice segment estimation feature quantity extraction step). Then, the singing voice section estimation means estimates the singing voice section and the non-singing voice section based on the plurality of singing voice section estimation feature quantities, and outputs information on the singing voice section and the non-singing voice section (singing voice section estimation step). The temporal association feature quantity extraction means 11 uses the temporal association feature quantity suitable for providing temporal correspondence between the lyrics of the singing voice and the music acoustic signal from the dominant sound acoustic signal S2 at each time. Extract (temporal association feature extraction step). Further, a phoneme network SN configured by connecting a plurality of phonemes of the lyrics of the music corresponding to the music acoustic signal S1 so that the time interval between two phonemes adjacent to the plurality of phonemes can be adjusted is stored in the phoneme network storage unit 13. Store (memory step). A singing voice acoustic model 15 for estimating a phoneme corresponding to the temporal association feature amount based on the temporal association feature amount is provided, and a plurality of phonemes in the phoneme network SN and the priority sound acoustic signal S1 are provided. The alignment means 17 performs the alignment operation | movement matched temporally (alignment step). In this alignment step, the alignment means 17 receives as input the temporal association feature amount obtained in the temporal association feature extraction step, information about the singing voice segment and the non-singing voice segment, and the phoneme network SN, and singing voice The alignment operation is executed using the acoustic model 15 under the condition that there is no phoneme at least in the non-singing voice section.

一般に、歌声の検出は、正解率（hit rate）と棄却率（correct rejection rate）によって評価される。但し、正解率とは実際に歌声を含む領域のうち、正しく歌声区間として検出できた割合を指し、棄却率とは実際に歌声を含まない領域のうち、正しく非歌声区間として棄却できた割合を指すものとする。なお、本上記実施の形態で採用した歌声区間推定手段９は、正解率と棄却率のバランスを調整することができる仕組みとなっている。このような仕組みが必要になる理由は、正解率と棄却率の基準はいわばトレードオフの関係にあるからであり、適切な関係は例えば用途によっても異なるものだからである。歌声検出区間の推定は、ビタビアラインメントの前処理としての意味を持つため、正解率をある程度高く保つことによって歌声を含む可能性が少しでもあれば漏れなく検出できるようにすることが一般的には望ましい。しかし、その一方で、歌手名の同定などの用途に用いる場合は、棄却率を高く保つことによって、確実に歌声を含む部分のみを検出するべきである。ちなみに、歌声の検出に関する従来技術では、正解率と棄却率のバランスを調整できるものはなかった。 In general, the detection of singing voice is evaluated by a correct rate (hit rate) and a correct rejection rate. However, the correct answer rate refers to the ratio that can be correctly detected as a singing voice section in the area that actually contains singing voice, and the rejection rate means the percentage that can be correctly rejected as a non-singing voice section in the area that does not actually contain singing voice. Shall point to. In addition, the singing voice section estimation means 9 employ | adopted by this said embodiment has a mechanism which can adjust the balance of a correct answer rate and a rejection rate. The reason why such a mechanism is necessary is that the standard of the correct answer rate and the rejection rate are in a trade-off relationship, and the appropriate relationship varies depending on, for example, the use. Since the estimation of the singing voice detection section has a meaning as a pre-processing of the Viterbi alignment, generally it is possible to detect without any omission if there is a possibility of including a singing voice by keeping the correct answer rate high to some extent. desirable. However, on the other hand, when used for applications such as singer name identification, only the portion containing the singing voice should be detected reliably by keeping the rejection rate high. Incidentally, none of the conventional techniques related to singing voice detection can adjust the balance between the correct answer rate and the rejection rate.

次に本発明を適用した実施の形態の評価結果について説明する。 Next, the evaluation result of the embodiment to which the present invention is applied will be described.

本発明に係る方法を実際に市販されているディジタル音楽データと歌詞データに適用し、再生と同期した歌詞の表示を実験により確かめた。その結果、本発明に係る方法によると、様々な伴奏音を含む実世界の音楽音響信号に対して頑健にその歌詞を時間的に対応付けることができることが確認された。以下、評価実験の方法について説明する。 The method according to the present invention was applied to digital music data and lyrics data that were actually marketed, and the display of lyrics synchronized with playback was confirmed by experiments. As a result, according to the method of the present invention, it has been confirmed that the lyrics can be temporally associated with music acoustic signals in the real world including various accompaniment sounds. Hereinafter, the evaluation experiment method will be described.

（実験方法）
公的な研究用音楽データベースの一つであるＲＷＣ研究用音楽データベースに登録されているポピュラー音楽データベース（ＲＷＣ−ＭＤＢ−Ｐ−２００１）から、１０歌手１０曲（男性歌手５曲・女性歌手５曲）をランダムに抽出した。 (experimental method)
10 popular singers (5 male and 5 female singers) from the popular music database (RWC-MDB-P-2001) registered in the RWC research music database, which is one of the public research music databases. ) Were randomly extracted.

楽曲の大半の部分は日本語で歌われているが、一部は英語で歌われている。本実験では、英語の音素は類似した日本語の音素の音響モデルを用いて近似した。これらの楽曲に対して、性別毎の５ｆｏｌｄｃｒｏｓｓ−ｖａｌｉｄａｔｉｏｎ法で評価をした。つまり、ある歌手によって歌われている楽曲を評価する際は、その歌手と同じ性別の歌手によって歌われている他の楽曲を用いて音響モデルを適応させた。 Most of the songs are sung in Japanese, but some are sung in English. In this experiment, English phonemes were approximated using an acoustic model of similar Japanese phonemes. These songs were evaluated by the 5fold cross-validation method for each gender. In other words, when evaluating a song sung by a singer, the acoustic model was adapted using another song sung by a singer of the same gender as the singer.

歌声区間検出手法の学習データには、ランダムに選ばれた１１歌手からなる１９曲を用いた。なお、これらの楽曲も“ＲＷＣ音楽データベース:ポピュラー音楽（RWC-MDB-P-2001）”から抽出した。 19 pieces of music composed of 11 singers selected at random were used as learning data for the singing voice section detection method. These music pieces were also extracted from “RWC Music Database: Popular Music (RWC-MDB-P-2001)”.

また、これらの１１歌手は学習用のデータであるため、評価に用いられた１０歌手には含まれていない。歌声区間検出手法の学習データにも、伴奏音抑制手法は適用した。また、η_fixedの値は１５に設定した。 Moreover, since these 11 singers are data for learning, they are not included in the 10 singers used for evaluation. The accompaniment sound suppression method was also applied to the learning data of the singing voice interval detection method. The value of η _fixed was set to 15.

前述の図９は、ビタビアラインメントの分析条件を示している。初期音響モデルとしては、ＣＳＲＣソフトウェア中の性別非依存モノフォンモデルを用いた。また、歌詞から音素列の変換には、日本語形態素解析システム茶筅（ChaSen）を実行し、その際に出力される読みの情報を用いた。音響モデルの適応には、Hidden Markov Toolkit （HTK）を用いた。 FIG. 9 described above shows the analysis conditions for the Viterbi alignment. As the initial acoustic model, the gender-independent monophone model in the CSRC software was used. In addition, to convert phoneme strings from lyrics, we used the Japanese morphological analysis system ChaSen, and used the reading information output at that time. Hidden Markov Toolkit (HTK) was used for adaptation of the acoustic model.

評価は、フレーズ単位のアラインメントを元に行った。本実験では、フレーズとは、元歌詞中のスペースや改行で区切られた一節を意味するものとする。 Evaluation was performed based on the phrase unit alignment. In this experiment, the phrase means a passage separated by spaces or line breaks in the original lyrics.

図１７は、評価基準を説明するための図である。まず、図１７に示すように、「正解していた区間」とは、正解ラベルと出力結果とが重複している時間を指し、その他を「不正解」とする。楽曲の全体長（正解区間と不正解区間の長さの総和）に対する、正解区間の長さの総和を「正解率」［＝正解区間の長さの総和（Length of "correct " regions)／楽曲の全体長さ(Total length of the song）］と定義した。例えば図１０の例であれば、「立ち止まる時」と「またふと振り返る」がそれぞれ、１フレーズを構成している。 FIG. 17 is a diagram for explaining the evaluation criteria. First, as shown in FIG. 17, “the section that was correctly answered” refers to the time during which the correct answer label and the output result overlap, and the other is assumed to be “incorrect answer”. The sum of the length of the correct answer section to the total length of the music (the sum of the length of the correct answer section and the incorrect answer section) is expressed as “correct answer rate” [= total length of correct answer sections (Length of “correct” regions) / music Total length of the song]. For example, in the example of FIG. 10, “when you stop” and “turn back again” each form one phrase.

そして、全体の評価基準として、楽曲の全体長の中で、フレーズ単位のラベルが正解していた区間の割合を計算した。精度が９０％を超えていた場合に、その楽曲は正しくアラインメントされたと判断した。 And as a whole evaluation standard, the ratio of the section in which the phrase unit label was correct in the overall length of the music was calculated. When the accuracy exceeded 90%, it was judged that the music was correctly aligned.

（システム全体の評価）
提案手法全体での性能を評価するため、発明に係る方法により実験を行った。 (Evaluation of the entire system)
In order to evaluate the performance of the proposed method as a whole, an experiment was conducted by the method according to the invention.

図１８（Ａ）及び（Ｂ）は、本発明の効果を確認するための評価実験の結果を示している。図１８（Ａ）に示すとおり、＃００７と＃０１３の２曲を除き１０曲中８曲で９０％以上のアラインメントの正解率を達成した。また、図１８（Ｂ）はフレーズの開始時刻の平均誤差を楽曲別に示した結果を示す一覧表である。 18A and 18B show the results of an evaluation experiment for confirming the effect of the present invention. As shown in FIG. 18 (A), an alignment accuracy rate of 90% or more was achieved with 8 of 10 songs except for 2 songs of # 007 and # 013. FIG. 18B is a list showing the results of the average error of the phrase start time for each music piece.

これらの結果は、本手法により１０曲中８曲について十分な精度で時間的対応を推定することができることを示している。また、男声の精度が女性の精度に比べて高いことが見て取れる。これは、女声は一般に男声よりも高いＦ０を持つため、ＭＦＣＣなどのスペクトル特徴量を抽出するのが困難であるからである。代表的な誤りは、歌詞に書かれていないハミング等が歌われている部分で発生していた。 These results show that the temporal correspondence can be estimated with sufficient accuracy for 8 out of 10 songs by this method. It can also be seen that the accuracy of male voice is higher than that of female. This is because a female voice generally has a higher F0 than a male voice, so it is difficult to extract a spectral feature amount such as MFCC. A typical error occurred in the part where humming etc. not written in the lyrics were sung.

（音響モデル適応の効果の確認）
音響モデルを適応させた効果を確認することを目的として、以下の４つの条件でアラインメント実験を行った。 (Confirmation of effects of acoustic model adaptation)
An alignment experiment was performed under the following four conditions for the purpose of confirming the effect of adapting the acoustic model.

（ｉ）適応なし：音響モデル適応を行わなかった。 (I) No adaptation: Acoustic model adaptation was not performed.

（ｉｉ）１段階適応：話し声用の音響モデルを直接分離歌声に適応させた。特定歌手への教師なし適応は行わなかった。 (Ii) One-step adaptation: The acoustic model for speaking voice was directly adapted to the separated singing voice. There was no unsupervised adaptation to specific singers.

（ｉｉｉ）２段階適応：まず、話し声用の音響モデルを単独歌唱音声に適応させた後、分離歌声に適応させた。特定歌手への教師なし適応は行わなかった。 (Iii) Two-stage adaptation: First, an acoustic model for speaking voice was adapted to a single singing voice and then adapted to a separated singing voice. There was no unsupervised adaptation to specific singers.

（ｉｖ）３段階適応（提案手法）：まず、話し声用の音響モデルを単独歌唱音声に適応させた後、分離歌声に適応させた。最後に、入力音響信号の特定歌手への教師なし適応を行った。なお、本実験では（ｉ）乃至（ｉｖ）全ての条件について伴奏音抑制（ステップ１）と歌声区間検出（ステップ２）を適用した。 (Iv) Three-stage adaptation (proposed method): First, an acoustic model for speaking voice was adapted to a single singing voice and then adapted to a separated singing voice. Finally, unsupervised adaptation of the input acoustic signal to a specific singer was performed. In this experiment, accompaniment sound suppression (step 1) and singing voice segment detection (step 2) were applied for all conditions (i) to (iv).

図１９（Ａ）及び（Ｂ）は、条件（ｉ）乃至（ｉｖ）とした場合の実験の結果を示している。このうち、図１９（Ａ）は、各楽曲に対するアラインメントの正解率をそれぞれの条件ごとに調べた結果を示している。また、図１９（Ｂ）は、その正解率を数値で一覧表にまとめたものである。 FIGS. 19A and 19B show the results of the experiment under the conditions (i) to (iv). Among these, FIG. 19 (A) shows the result of examining the accuracy rate of the alignment for each musical piece for each condition. FIG. 19B summarizes the accuracy rates in a list with numerical values.

これらの結果は、全ての楽曲で一定の効果があることを示している。特に、条件（ｉｖ）が最も正解率が高いことが分かる。この意味において、条件（ｉｖ）は発明を実施するための最良の形態であるということができる。 These results show that all the songs have a certain effect. In particular, it can be seen that condition (iv) has the highest accuracy rate. In this sense, it can be said that the condition (iv) is the best mode for carrying out the invention.

（歌声区間検出の評価）
次に、ステップ２において説明した歌声区間検出の有効性を確認することを目的として、各楽曲に対する歌声区間検出の正解率（hit rate）と棄却率（correct rejection rate）を調べた。 (Evaluation of singing voice segment detection)
Next, for the purpose of confirming the effectiveness of the singing voice section detection described in Step 2, the correct rate (hit rate) and the rejection rate (correct rejection rate) of the singing voice section detection for each musical composition were examined.

また、これと共に歌声区間検出自体の性能の評価も行った。これについては歌声区間検出を用いた場合と用いない場合の２通りの条件で実験した。本実験では、適応処理には全て３段階（ステップ１乃至ステップ３）の適応手法を使用した。 At the same time, the performance of singing voice section detection itself was also evaluated. About this, it experimented on two kinds of conditions, the case where it does not use the case where singing voice area detection is used. In this experiment, the adaptation process uses all three stages (steps 1 to 3) of adaptation methods.

図２０（Ａ）は、各楽曲に対する歌声区間検出の正解率（hit rate）と棄却率（correct rejection rate）を示している。また、図２０（Ｂ）は各楽曲に対するアラインメントの正解率を、歌声区間検出有りの場合と無しの場合の比較を示している。 FIG. 20A shows the correct answer rate (hit rate) and correct rejection rate (single voice segment detection) for each song. FIG. 20B shows a comparison of the correct answer rate of the alignment for each music piece with and without the singing voice section detection.

これらの結果から、平均的に見ると、歌声区間検出を適用することによってアラインメントの正解率が向上したと評価できる。特に、図２０（Ｂ）の結果から明らかなように、比較的精度が低い楽曲に歌声区間検出を適用したとき、特にアラインメントの正解率が向上していることがわかる。但し、＃００７と＃０１３に関しては、元々精度が低い楽曲に適用されたにもかかわらず、歌声区間検出手法の効果が薄い。この理由は、これらの楽曲は、図２０（Ａ）に見られるように、歌声区間検出の棄却率が高くないため非歌声区間を十分に除去できなかったからであると考えられる。 From these results, on average, it can be evaluated that the accuracy rate of the alignment has been improved by applying the singing voice section detection. In particular, as is apparent from the results of FIG. 20B, it can be seen that when the singing voice section detection is applied to a music with relatively low accuracy, the accuracy rate of the alignment is particularly improved. However, with respect to # 007 and # 013, the effect of the singing voice section detection method is weak despite being applied to a music with low accuracy. The reason for this is considered to be that these music pieces could not sufficiently remove the non-singing voice section because the rejection rate of the singing voice section detection was not high as seen in FIG.

また、＃０１２や＃０３７などのように、元々アラインメントの正解率が高い楽曲に歌声区間検出を行うと、正解率が僅かながら低下していることがわかる。これは、歌声区間検出で誤って除去（棄却）されてしまった歌声区間は、アラインメントの際には必ず不正解となるからと考えられる。 Further, when the singing voice section detection is performed on a song whose original correct rate is high, such as # 012 and # 037, it can be seen that the correct rate is slightly reduced. This is considered to be because a singing voice section that has been erroneously removed (rejected) by singing voice section detection is always an incorrect answer during alignment.

なお、上述の通り、本発明では、日本語歌詞の楽曲を用いて実験を行い動作を確認した。しかし英語楽曲においては、英語の音素を発音が最も近い日本語の音素に変換して音素ネットワークを作成することで、英語の楽曲に対しても、比較的高い精度で時間的対応付けが推定できることを確認した。対象の楽曲の言語に応じて適切な音響モデルと音響モデル適応用データを準備することができれば、英語を含む他の言語の楽曲についても、より高い精度時間的対応付けが推定可能である。 As described above, in the present invention, the operation was confirmed by performing experiments using Japanese lyrics. However, for English music, by creating a phoneme network by converting English phonemes to the nearest Japanese phonemes, the temporal correspondence can be estimated with relatively high accuracy for English music. It was confirmed. If an appropriate acoustic model and acoustic model adaptation data can be prepared according to the language of the target music, higher accuracy and temporal association can be estimated for music in other languages including English.

さらに、楽曲中に含まれる部分的な繰り返し部分やテンポなどの高次の楽曲構造情報を利用することで、より高度な音楽と歌詞の時間的対応付けが可能になると考えられる。 Furthermore, it is considered that more advanced music and lyrics can be temporally associated with each other by using high-order music structure information such as partial repetitions and tempos included in the music.

本発明に係る音楽音響信号と歌詞の時間的対応付け方法は、現時点では各ステップがツールキットなどの形で配布されるそれぞれ独立したプログラムで構成されているが、用途に応じて適切にプログラミングすれば、一つのコンピュータプログラムの形で実施されることも考えられる。その具体的な本発明の応用例としては、以下のような適用事例が考えられる。 The method for temporally associating a music acoustic signal and lyrics according to the present invention is composed of independent programs in which each step is distributed in the form of a tool kit or the like at the present time. For example, it may be implemented in the form of a single computer program. The following application examples are conceivable as specific application examples of the present invention.

（適用事例１）再生と同期した歌詞の表示
再生と同期した歌詞の表示を行うという用途である。本件発明者らは、時間タグ付き歌詞に基づき音楽の再生と時間的に同期して歌詞の色を変化させる音楽ディジタルデータ再生用ソフトウェアを同時に開発することで、再生中の歌声と時間的に同期して歌詞の色を変化させることに成功し、アラインメントの正解率は上記の通りであることを確認した。 (Application example 1) Display of lyrics synchronized with playback This is an application of displaying lyrics synchronized with playback. The inventors of the present invention are simultaneously developing software for reproducing digital music data that changes the color of the lyrics in synchronization with the playback of music based on the time-tagged lyrics, thereby synchronizing with the singing voice being played in time. I succeeded in changing the color of the lyrics, and confirmed that the accuracy rate of the alignment was as described above.

なお、表示されている画面上に歌詞が表示され、歌声と共に色が変化する動作は、一見するといわゆるカラオケのように見えるが、フレーズと歌詞の追随が極めて正確であり、楽曲の鑑賞が一層充実するという印象を得た。しかも、人間を介することなくプログラムによって自動的に対応付けされたものである点で、従来のものとは全く異質のものである。 The movement of the lyrics displayed on the displayed screen and the color changing with the singing voice looks like a so-called karaoke at first glance, but the phrase and the lyrics are very accurate and the music can be enjoyed more fully. I got the impression that Moreover, it is completely different from the conventional one in that it is automatically associated by a program without human intervention.

（適用事例２）歌詞を用いた楽曲の頭出し
本発明に係る方法によって歌詞に時間情報が得られる場合、予め歌詞を表示させておき、歌詞の一部をクリックするとそこから演奏が開始されるようにプログラミングすることも可能である。 (Application example 2) Cueing of music using lyrics When time information is obtained in the lyrics by the method according to the present invention, the lyrics are displayed in advance, and when a part of the lyrics is clicked, the performance starts from there. It is also possible to program as follows.

本件発明者らは，前記の本件発明者らが開発した音楽ディジタルデータ再生用ソフトウェアに機能を追加することで、歌詞をクリックすることで、そこから演奏が開始させることに成功した。この動作は、今までには実現されていなかった機能であり、ユーザの好みの部分を能動的に選択しながら楽曲を鑑賞出来るという点で新しい音楽鑑賞方法を実現したと言える。 The inventors of the present invention have succeeded in starting performance by clicking on the lyrics by adding a function to the music digital data reproduction software developed by the inventors. This operation is a function that has not been realized so far, and it can be said that a new music appreciation method has been realized in that music can be appreciated while actively selecting a user's favorite part.

なお、上記適用事例１及び２においては，本件発明者らが独自に開発した音楽ディジタルデータ再生ソフトウェアを使用しているが，これに限定されずに他の音楽ディジタルデータ再生用ソフトウェアを用いてもよいのは勿論である． In the above application examples 1 and 2, the music digital data playback software originally developed by the inventors is used, but the present invention is not limited to this, and other music digital data playback software may be used. Of course it is good.

本発明は、音楽鑑賞支援技術或いは検索技術といった産業上の利用分野に適用されることが期待されるものであり、特に、近年のディジタル音楽データ配信サービスの普及に伴い、その重要性は一層増大しているものと考えられる。 The present invention is expected to be applied to industrial application fields such as music appreciation support technology or search technology, and in particular, with the recent spread of digital music data distribution services, its importance increases further. It is thought that.

音楽音響信号と歌詞の時間的対応付けを自動で行うシステムの実施の形態をコンピュータを用いて実現する場合に、コンピュータ内に実現される機能実現手段の構成を示すブロック図である。It is a block diagram which shows the structure of the function implementation | achievement means implement | achieved in a computer, when implement | achieving embodiment of the system which performs a time correlation with a music acoustic signal and a lyrics automatically using a computer. 図１の実施の形態をプログラムをコンピュータで実行することにより実施する場合のステップを示すフローチャートである。It is a flowchart which shows the step in the case of implementing embodiment of FIG. 1 by running a program with a computer. 伴奏音抑制処理について、その処理手順を示す図である。It is a figure which shows the process sequence about an accompaniment sound suppression process. （Ａ）乃至（Ｄ）は、音楽音響信号から優勢音音響信号を抽出する仮定を説明するために用いる波形図である。(A) thru | or (D) is a wave form diagram used in order to demonstrate the assumption which extracts a dominant sound sound signal from a music sound signal. 歌声区間推定手段の具体的な構成を示すブロック図である。It is a block diagram which shows the specific structure of a singing voice area estimation means. 図５に示した歌声区間推定手段をプログラムにより実現する場合のフローチャートである。It is a flowchart in the case of implement | achieving the singing voice area estimation means shown in FIG. 5 with a program. 歌声区間の検出をプログラムで実現する際のフローチャートである。It is a flowchart at the time of implement | achieving the detection of a singing voice area by a program. 歌声状態（Ｓ_v）と非歌声状態（Ｓ_N）を行き来する隠れマルコフモデル（ＨＭＭ）を用いることを説明するために用いる図である。It is a figure used in order to demonstrate using the hidden Markov model (HMM) which goes back and forth between a singing voice state ( _Sv ) and a non-singing voice state ( _SN ). ビタビアラインメントの分析条件を示す図である。It is a figure which shows the analysis conditions of viterbi alignment. 歌詞からアラインメント用の音素列への変換の例を示す図である。It is a figure which shows the example of the conversion from the lyrics to the phoneme string for alignment. アラインメント手段をプログラムによりコンピュータで実現する場合のプログラムのアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of a program when the alignment means is implement | achieved by a computer with a program. （Ａ）はビタビアラインメントを利用して、時刻において音楽音響信号から抽出した優勢音音響信号の信号波形に対して、音素ネットワークを時間的に対応付けた様子を示す図であり、（Ｂ）はアラインメントが完了した後、音素列から歌詞に戻すことによって伴奏音を含む混合音の音楽音響信号と歌詞の時間的対応付けが完了した様子を示す図である。(A) is a figure which shows a mode that the phoneme network was temporally matched with respect to the signal waveform of the dominant sound acoustic signal extracted from the music acoustic signal at the time using Viterbi alignment, and (B). It is a figure which shows a mode that the time correlation of the music acoustic signal of the mixed sound containing an accompaniment sound and a lyrics was completed by returning to a lyrics from a phoneme sequence after alignment was completed. 時間情報を伴う適応用音素ラベルの一例を示す図である。It is a figure which shows an example of the phoneme label for adaptation with time information. 音響モデルを作成する場合の流れを示すフローチャートである。It is a flowchart which shows the flow in the case of producing an acoustic model. 音響モデルを作成する場合の流れを示すフローチャートである。It is a flowchart which shows the flow in the case of producing an acoustic model. 音響モデルを作成する場合の流れを示すフローチャートである。It is a flowchart which shows the flow in the case of producing an acoustic model. 評価基準を説明するための図である。It is a figure for demonstrating evaluation criteria. （Ａ）及び（Ｂ）は、本発明の効果を確認するための評価実験の結果を示している。(A) And (B) has shown the result of the evaluation experiment for confirming the effect of this invention. （Ａ）及び（Ｂ）は、条件（ｉ）乃至（ｉｖ）とした場合の実験の結果を示している。このうち、図１９（Ａ）は、各楽曲に対するアラインメントの正解率をそれぞれの条件ごとに調べた結果を示している。図１９（Ｂ）は、その正解率を数値で一覧表にまとめたものである。(A) and (B) show the results of experiments under conditions (i) to (iv). Among these, FIG. 19 (A) shows the result of examining the accuracy rate of the alignment for each musical piece for each condition. FIG. 19B summarizes the accuracy rates in a list in numerical values. （Ａ）は各楽曲に対する歌声区間検出の正解率（hit rate）と棄却率（correct rejection rate）を示している。（Ｂ）は楽曲に対するアラインメントの正解率を、歌声区間検出有りの場合と無しの場合の比較を示している。(A) shows the correct answer rate (hit rate) and correct rejection rate of singing voice section detection for each music piece. (B) shows a comparison of the correct answer rate of the alignment for music with and without singing voice section detection.

Explanation of symbols

１音楽音響信号と歌詞の時間的対応付けを自動で行うシステム
３音楽音響御信号記憶手段
５優勢音音響信号抽出手段
７歌声区間推定用特徴量抽出手段
９歌声区間推定手段
１１時間的対応付け用特徴量抽出手段
１３音素ネットワーク記憶手段
１５歌声用音響モデル
１７アラインメント手段 DESCRIPTION OF SYMBOLS 1 The system which performs the time correlation of a music acoustic signal and a lyrics automatically 3 The music sound control signal memory | storage means 5 The dominant sound sound signal extraction means 7 The feature-value extraction means for singing voice area estimation 9 The singing voice area estimation means 11 For time correlation Feature extraction means 13 Phoneme network storage means 15 Acoustic model for singing voice 17 Alignment means

Claims

A dominant sound sound signal extracting means for extracting a dominant sound sound signal of the most dominant sound including the singing voice at each time from a music sound signal of a song including a singing voice and an accompaniment sound;
A singing voice section that extracts a singing voice section estimation feature quantity that can be used to estimate a singing voice section that includes the singing voice and a non-singing voice section that does not include the singing voice from the dominant sound signal at each time. A feature extraction means for estimation;
Based on a plurality of singing voice section estimation features, the singing voice section and the non-singing voice section are estimated, and a singing voice section estimation means for outputting information on the singing voice section and the non-singing voice section;
Temporal association feature extraction for extracting temporal association features suitable for temporal association between the lyrics of the singing voice and the music acoustic signal from the dominant sound acoustic signal at each time Means,
Phoneme network storage means for storing a phoneme network composed of a plurality of phonemes and short pauses for the lyrics of the music corresponding to the music acoustic signal;
A singing voice acoustic model for estimating a phoneme corresponding to the temporal association feature amount based on the temporal association feature amount, and a plurality of phonemes in the phoneme network and the priority sound acoustic signal are temporally combined. Alignment means for executing an alignment operation that is associated with each other, and the alignment means includes the temporal association feature quantity output from the temporal association feature quantity extraction means, the singing voice section, and the non-singing voice. Temporal correspondence between music audio signal and lyrics, wherein the alignment operation is executed under the condition that at least the non-singing voice segment does not exist, with the information about the segment and the phoneme network as inputs A system that performs automatic attachment.

The singing voice section estimation means includes Gaussian distribution storage means for storing a plurality of mixed Gaussian distributions of singing voice and non-singing voice obtained by learning based on a plurality of learning songs in advance.
The singing voice section estimation means is configured to estimate the singing voice section and the non-singing voice section based on a plurality of singing voice section estimation feature quantities and the plurality of mixed Gaussian distributions. The system which performs the time correlation of the music acoustic signal of 1 and a lyrics automatically.

The singing voice section estimating means includes
Log likelihood calculating means for calculating a singing voice log likelihood and a non-singing log log likelihood at each time based on the singing voice section estimation feature value and the mixed Gaussian distribution at each time;
Log likelihood difference calculating means for calculating a log likelihood difference between the singing voice log likelihood and the non-singing log likelihood at each time;
Histogram creation means for creating a plurality of log likelihood difference histograms obtained from the whole period of the music acoustic signal;
A threshold that maximizes the interclass variance is determined when the histogram is divided into the log likelihood difference class in the singing voice section and the log likelihood difference class in the non-singing voice section depending on the music. Bias adjustment value determining means for determining the threshold as a music-dependent bias adjustment value;
In order to correct the bias adjustment value, an estimation parameter determination means for determining an estimation parameter used when estimating a singing voice section by adding a task-dependent value to the bias adjustment value;
Weighting means for weighting the singing voice log likelihood and the non-singing voice log likelihood at each time using the estimation parameters;
A weighted plurality of singing voice log likelihoods and a plurality of weighted non-singing log likelihoods obtained from the whole period of the music acoustic signal are respectively expressed as the output probability of the singing voice state (s _V ) of the hidden Markov model and Considering the output probability of the non-singing voice state (s _N ), the maximum likelihood path of the singing voice state and the non-singing voice state in the whole period of the music acoustic signal is calculated, and the whole period of the music acoustic signal is calculated from the maximum likelihood path. The system according to claim 2, further comprising a maximum likelihood path calculation means for determining information related to the singing voice section and the non-singing voice section.

The weighting means approximates the output probability logp (x | s _V ) of the singing voice state (s _V ) and the output probability logp (x | s _N ) of the non-singing voice state (s _N ) by the following equation:
In the above equation, N _GMM (x; θ _V ) represents a probability density function of a mixed gaussian distribution (GMM) of singing voice, and N _GMM (x; θ _N ) represents a probability density function of a mixed Gaussian distribution (GMM) of non-singing voice. Θ _V and θ _N are parameters previously determined by learning based on the plurality of learning songs, η is the estimation parameter,
The maximum likelihood path calculating means calculates the maximum likelihood path using the following equation:
In the above formula, p (x | s _t ) represents the output probability of the state s _t , and p (s _{t + 1} | s _t ) represents the transition probability from the state s _t to the state s _{t + 1} . A system that automatically associates the described music acoustic signal and lyrics with time.

The alignment means is configured to perform the alignment operation using a Viterbi alignment;
In the execution of the Viterbi alignment, as a condition that there is no phoneme in the non-singing voice section, a condition for setting at least the non-singing voice section as a short pause is determined, and in the short pause, the likelihood of other phonemes is set to zero. The system according to claim 1, wherein the alignment operation is executed automatically.

The acoustic model for singing voice is an acoustic model obtained by re-estimating parameters of the acoustic model for speaking voice so that the phoneme of the singing voice in a song including a singing voice and accompaniment sounds can be recognized. A system that automatically associates music audio signals with lyrics in time.

The acoustic model uses a music acoustic signal for adaptation of a single singing that includes only a singing voice and a phoneme label for adaptation to the music acoustic signal for adaptation, and parameters of the acoustic model for speech are used as the acoustic music signal for adaptation. 7. The system for automatically associating the time relationship between the music acoustic signal and the lyrics according to claim 6, which is an acoustic model for single singing obtained by re-estimation so that the phoneme of the singing voice can be recognized.

The acoustic model is
Using the adaptive music acoustic signal of a single singing that includes only the singing voice and the adaptive phoneme label for the adaptive music acoustic signal, the parameters of the acoustic model for speaking voice are derived from the adaptive music acoustic signal and the phoneme of the singing voice. Prepare an acoustic model for single singing obtained by re-estimation so that it can be recognized,
Using the dominant sound sound signal of the most dominant sound including the singing voice extracted from the adaptive music sound signal including the accompaniment sound in addition to the singing voice, and the adaptive phoneme label for the dominant sound sound signal, the single singing The time of the music acoustic signal and the lyrics according to claim 6, which is an acoustic model for a separated singing voice obtained by re-estimating the parameters of the acoustic model for the voice so that the phoneme of the singing voice can be recognized from the dominant sound acoustic signal. A system that automatically performs automatic association.

The acoustic model is
Using the adaptive music acoustic signal of a single singing that includes only the singing voice and the adaptive phoneme label for the adaptive music acoustic signal, the parameters of the acoustic model for speaking voice are derived from the adaptive music acoustic signal and the phoneme of the singing voice. Prepare an acoustic model for single singing obtained by re-estimation so that it can be recognized,
Next, using the dominant sound acoustic signal of the most dominant sound including the singing voice extracted from the adaptive music acoustic signal including the accompaniment sound in addition to the singing voice, and the adaptive phoneme label for the dominant sound acoustic signal, Preparing an acoustic model for a separated singing voice obtained by re-estimating the parameters of the acoustic model for a single singing so that the phoneme of the singing voice can be recognized from the dominant sound acoustic signal;
Next, using the plurality of temporal association feature quantities stored in the temporal association feature quantity storage means and the phoneme network stored in the phoneme network, the sound for the separated singing voice is used. The acoustic model for a specific singer obtained by estimating model parameters so that a phoneme of a specific singer who sings the song of the music acoustic signal input to the acoustic signal extraction unit can be recognized. A system that automatically associates music audio signals and lyrics with time.

In a music acoustic signal reproduction apparatus that reproduces the music acoustic signal while displaying lyrics associated with the music acoustic signal in time on a display screen,
The music acoustic signal reproducing apparatus, wherein the lyrics associated with the music acoustic signal in time are displayed on the display screen using the system according to claim 1.

The dominant sound acoustic signal extraction step in which the dominant sound signal extraction means extracts the dominant sound acoustic signal of the most dominant sound including the singing voice at each time from the music acoustic signal of the song including the singing voice and the accompaniment sound;
A singing voice section estimation feature amount that can be used to estimate a singing voice section that includes the singing voice and a non-singing voice section that does not include the singing voice from the dominant sound signal at each time. A feature extraction step for singing voice estimation extracted by the feature extraction means;
Based on a plurality of singing voice section estimation features, the singing voice section estimating means estimates the singing voice section and the non-singing voice section, and outputs information about the singing voice section and the non-singing voice section; and
Temporal association feature quantity extraction means suitable for assigning temporal correspondence between the lyrics of the singing voice and the music acoustic signal from the dominant sound acoustic signal at each time. A temporal feature extraction step for extracting,
Storing a phoneme network composed of a plurality of phonemes and short pauses with respect to the lyrics of the music corresponding to the music acoustic signal in a phoneme network storage means;
A singing voice acoustic model for estimating a phoneme corresponding to the temporal association feature amount based on the temporal association feature amount, and a plurality of phonemes in the phoneme network and the priority sound acoustic signal are temporally combined. An alignment step in which the alignment means executes an alignment operation to be associated with each other,
In the alignment step, the alignment means inputs the temporal association feature amount obtained in the temporal association feature extraction step, information on the singing voice segment and the non-singing voice segment, and the phoneme network. As a method for automatically performing temporal association between a music acoustic signal and lyrics, the alignment operation is performed under the condition that no phoneme is present in at least the non-singing voice section.

In order to make temporal correspondence between the music acoustic signal of the music including the singing voice and the accompaniment sound and the lyrics,
From the music acoustic signal, the dominant sound acoustic signal extraction means for extracting the dominant sound acoustic signal of the most dominant sound including the singing voice at each time;
A singing voice section that extracts a singing voice section estimation feature quantity that can be used to estimate a singing voice section that includes the singing voice and a non-singing voice section that does not include the singing voice from the dominant sound signal at each time. A feature extraction means for estimation;
Based on a plurality of singing voice section estimation features, the singing voice section and the non-singing voice section are estimated, and a singing voice section estimation means for outputting information on the singing voice section and the non-singing voice section;
Temporal association feature quantity for extracting temporal association feature quantity suitable for providing temporal correspondence between the lyrics of the singing voice and the dominant acoustic signal from the dominant sound acoustic signal at each time Extraction means;
Phoneme network storage means for storing a phoneme network composed of a plurality of phonemes and short pauses for the lyrics of the music corresponding to the music acoustic signal;
A singing voice acoustic model for estimating a phoneme corresponding to the temporal association feature amount based on the temporal association feature amount, and a plurality of phonemes in the phoneme network and the priority sound acoustic signal are temporally combined. Function as an alignment means for performing an alignment operation that is automatically associated,
The alignment means is inputted with the temporal association feature quantity output from the temporal association feature quantity extraction means, information on the singing voice section and the non-singing voice section, and the phoneme network, and at least A program for temporally associating a music acoustic signal and lyrics for executing the alignment operation under the condition that no phoneme is present in the non-singing voice section.

A computer-readable recording medium on which the program according to claim 12 is recorded.