JPH037996A

JPH037996A - Generating device for singing voice synthetic data

Info

Publication number: JPH037996A
Application number: JP1142404A
Authority: JP
Inventors: Kanji Kunisawa; 国澤　寛治; Noboru Uechi; 上地　登; Akira Yamamura; 山村　彰; Junko Omukai; 大向　順子
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1989-06-05
Filing date: 1989-06-05
Publication date: 1991-01-16

Abstract

PURPOSE:To synthesize a singing voice with an accurate interval by correcting the pitch of pitch data extracted from a song voice into the pitch of the closest scale. CONSTITUTION:A pitch data correction part 7 corrects the pitch of the pitch data found from an actual song voice into the closest pitch in a pitch-scale table 7a. Thus, the pitch data extracted from the song voice is corrected to obtain the same effect with a case wherein the pitch data is inputted from a score. Therefore, the singing voice with the accurate interval can be synthe sized.

Description

【発明の詳細な説明】Ｃ産業上の利用分野］本発明は、音韻情報および韻律情報よりなる音声データ
に基づいて歌音声の規則合成を行う音声合成回路に入力
される歌音声合成データの作成装置に関するものである
。[Detailed Description of the Invention] C. Industrial Application Field] The present invention is directed to the creation of song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. It is related to the device.

［従来の技術］一般に、音声の規則合成方式は、音韻情報を制御する音
韻パラメータと韻律情報を制御する韻律パラメータを入
力とし、音声学的・言語学的規則に基づいて音声を生成
する方式であって、処理する情報量（外部入力情報、内
部蓄積情報）は非常に少ないにも拘らず、大量の単語や
文章の合成音声を出力することができるようにした音声
合成方式である。[Prior Art] In general, a speech rule synthesis method is a method that generates speech based on phonetic and linguistic rules by inputting phonological parameters that control phonological information and prosodic parameters that control prosodic information. Although the amount of information to be processed (external input information, internally stored information) is very small, it is a speech synthesis method that can output synthesized speech of a large number of words and sentences.

一方、テキスト合成方式は、文字列のみから音声合成を
行なうという究極の音声合成方式であり、文字列から音
韻情報と韻律情報を自動的に作成しそれらの情報に基づ
いて規則合成を行なうことにより実現されるものである
。しかしながら、現在盛んに研究されているものの、合
成音声の品貰はまだ低い。On the other hand, the text synthesis method is the ultimate speech synthesis method that performs speech synthesis only from character strings. It automatically creates phonological information and prosody information from character strings and performs rule synthesis based on that information. It will be realized. However, although it is currently being actively researched, the quality of synthesized speech is still low.

ここで、テキスト合成を、文字列から音韻情報と韻律情
報を自動的に作成し、それらを音声データとし記憶媒体
に記憶させたり、伝送路に出力したりする部分と、記憶
媒体や伝送路から音声データを取り出し、規則合成によ
って音声を出力する部分の２つに分けることが考えられ
、このようにすると合成音声を出力する部分はコンパク
トになる、すなわち、小型、軽量化、低コスト化が図れ
る（特願昭６２−２６３２９８号参照）。Here, text synthesis is divided into two parts: automatically creating phonological information and prosody information from character strings, converting them into audio data, storing them on a storage medium or outputting them to a transmission line, and It is conceivable to extract voice data and divide it into two parts, a part that outputs the voice by rule synthesis.In this way, the part that outputs the synthesized voice can be made compact, that is, it can be made smaller, lighter, and lower in cost. (See Japanese Patent Application No. 62-263298).

ところで、歌音声を合成する場合について考えると、こ
の場合の究極の形は、楽譜を入力すると歌音声が出力さ
れるものということになる。これをテキスト合成方式と
比べると、楽譜の場合は文字列（歌詞）だけでなく、高
さ情報のような他の情報も含まれているので韻律情報の
生成が容易であり、テキスト合成方式よりも簡単な構成
で品質も高いものが得られることになる（特願昭６３−
１７７３１４号参照）。By the way, if we consider the case of synthesizing singing voices, the ultimate form in this case would be to input musical scores and output singing voices. Comparing this with the text synthesis method, it is easier to generate prosodic information because musical scores contain not only character strings (lyrics) but also other information such as height information. This means that a product with a simple structure and high quality can be obtained.
177314).

［発明が解決しようとする課題］しかしながら、上述のように、楽譜を入力とした場合、
歌詞に加えて高さ情報、たとえばＣ■音節列と各Ｃｖ音
節に対するピッチを入力する必要があり、この楽譜デー
タの入力において、印刷されたものから光学的読取り装
置により入力することについて考えると、印刷文字を読
取る技術はほぼ確立されているが、音符を読取る技術は
確立されていない。そのため、少なくとも高さに関する
データは人間が手作業でキーボードから入力する必要が
あり入力作業が面倒であるという問題があった。また、
楽譜を入力とした場合、文字列のみより情報量が多いも
のの、高さについては音符より一義的に決るが、長さに
ついては各音符に対する長さが与えられるだけであり、
その中で子音長や母音長をどれだけにするかは与えられ
ないので規則によって決める必要があり、強さについて
も具体的な数値で与えられるわけではないので、これも
規則により具体的な数値として表現する必要があり、現
在の技術レベルでは品質の低い音声しか得られないとい
う問題があった。[Problem to be solved by the invention] However, as mentioned above, when inputting musical scores,
In addition to the lyrics, it is necessary to input height information, such as the C■ syllable string and the pitch for each Cv syllable.When inputting this musical score data, consider inputting it from a printed version using an optical reading device. Although the technology for reading printed characters is almost established, the technology for reading musical notes has not been established. Therefore, there is a problem in that at least data related to height must be input manually by a human using a keyboard, making the input work troublesome. Also,
When inputting a musical score, although the amount of information is larger than just a string of characters, the height is determined more uniquely than the notes, but the length is only given as the length for each note.
The length of consonants and vowels is not given, so it must be determined by rules, and the strength is also not given as a specific value, so this is also determined by rules. However, with the current level of technology, only low-quality audio can be obtained.

本発明は上記の点に鑑みて為されたものであり、その目
的とするところは、楽譜のデータ入力を文字列のみとし
てデータ入力作業を容易にすることができ、しかも、規
則合成方式の音声合成回路に入力される歌音声合成デー
タとして、自然な歌音声に対する音声データが得られ、
しかも、より品質の高い歌音声を正確な音程で合成でき
る音声データが得られる歌音声合成データの作成装置を
提供することにある。The present invention has been made in view of the above points, and its purpose is to simplify the data input work by inputting musical score data using only character strings, and furthermore, it is possible to input musical score data using character strings only. As the song voice synthesis data input to the synthesis circuit, voice data for natural song voices is obtained,
Moreover, it is an object of the present invention to provide a song voice synthesis data creation device that can obtain voice data that can synthesize higher quality song voices with accurate pitches.

［課題を解決するための手段］本発明の歌音声合成データの作成装置は、音韻情報およ
び韻律情報よりなる音声データに基づいて歌音声の規則
合成を行う音声合成回路に入力される歌音声合成データ
の作成装置において、楽譜と歌音声とを入力とし、入力
される楽譜から音韻情報である文字列データを抽出する
楽譜情報抽出手段と、入力される歌音声から韻律情報で
ある高さデータ、強さデータおよび長さデータを抽出す
る歌音声情報抽出手段とを設けるとともに、抽出された
高さデータのピッチを最も近い音階のピッチに修正する
高さデータ修正部を設け、文字列データ、修正された高
さデータ、強さデータおよび長さデータを音声データと
して出力するようにしたものである。[Means for Solving the Problems] The song speech synthesis data creation device of the present invention provides song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. In the data creation device, a musical score and a singing voice are input, and a musical score information extraction means extracts character string data, which is phonological information, from the input musical score, and height data, which is prosody information, from the input singing voice; A song voice information extraction means for extracting strength data and length data is provided, and a height data correction section is provided for correcting the pitch of the extracted height data to the pitch of the nearest musical scale. The height data, strength data, and length data obtained are output as audio data.

［作　用］本発明は上述のように構成されており、音声データに基
づいて歌音声の規則合成を行う音声合成回路に入力され
る歌音声合成データの作成装置において、楽譜から音韻
情報である文字列データのみを抽出するようにし、高さ
データのキーボード入力を不要にしているので、データ
入力作業を容易に行うことができ、また、実際の歌音声
に基づいて韻律情報の抽出を行っているので、より品質
の高い歌音声を音声合成回路がら出力させることができ
、しかも、歌音声から抽出された高さデータのピッチを
ピッチ−音階テーブルのピッチに基づいて修正している
ので、正確な音程の歌音声を合成できるようになってい
る。[Function] The present invention is configured as described above, and is used in an apparatus for creating song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data. Only character string data is extracted, eliminating the need for keyboard input of height data, making data entry easier.In addition, the system extracts metrical information based on the actual singing voice. This allows higher quality singing voices to be output from the speech synthesis circuit, and since the pitch of the height data extracted from the singing voices is corrected based on the pitch of the pitch-scale table, it is possible to output higher quality singing voices from the speech synthesis circuit. It is now possible to synthesize singing voices with a certain pitch.

［実施例］第１図は本発明一実施例を示すもので、音韻情報および
韻律情報よりなる音声データに基づいて歌音声の規則合
成を行う音声合成回路χに入力される歌音声合成データ
の作成装置Ｙにおいて、楽譜と歌音声とを入力とし、入
力される楽譜から音韻情報である文字列データを抽出す
る楽譜情報抽出手段と、入力される歌音声から韻律情報
である高さデータ、強さデータおよび長さデータを抽出
する歌音声情報抽出手段とを設けるとともに、抽出され
た高さデータのピッチを最も近い音階のビ・ソチに修正
する高さデータ修正部７を設け、文字列データ、修正さ
れた高さデータ、強さデータおよび長さデータを音声デ
ータとして出力するようにしたものである。ここに、実
施例では、楽譜情報抽出手段は、楽譜の文字列データを
自動的に読み取る光学的楽譜読み取り装置よりなる楽譜
入力部１と、音韻情報抽出部２とで形成され、音韻情報
抽出手段は、マイクロフォンのような歌音声入力部３と
、入力された歌音声の特徴を抽出する特徴抽出部４と、
特徴抽出された歌音声を音韻毎に分割するセグメンテー
ション処理部５と、分割された各音韻の高さデータ、強
さデータおよび長さデータを抽出する韻律情報抽出部６
と、抽出された高さデータをピッチ−音階テーブル７ａ
に基づいて修正する高さデータ修正部７とで形成され、
文字列データ、強さデータ、長さデータおよび修正され
た高さデータは、符号化部６にて所定の符号化が施され
て音声合成回路Ｘに入力される音声データとして出力さ
れるようになっている６いま、実施例では、楽譜と歌音
声を入力として規則合成方式の音声合成回路Ｘに対する
音声データ（音韻情報と韻律情報）を出力とするように
なっており、楽譜からは文字列（音韻列）データを光学
的読み取り装置によって自動的に抽出する。なお、人間
が楽譜を読取ってデータ化してキーボードから入力して
も良い。[Embodiment] FIG. 1 shows an embodiment of the present invention, in which song speech synthesis data input to a speech synthesis circuit χ that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information is shown. The creation device Y receives a musical score and singing voice as input, and includes a musical score information extraction means for extracting character string data, which is phonetic information, from the input musical score, and height data, which is prosody information, and strength data, which is prosody information, from the input singing voice. In addition, a height data correction unit 7 is provided to correct the pitch of the extracted height data to the nearest bi-sochi scale. , the corrected height data, strength data, and length data are output as audio data. Here, in the embodiment, the musical score information extraction means is formed by a musical score input section 1 consisting of an optical musical score reading device that automatically reads character string data of the musical score, and a phonetic information extraction section 2. includes a singing voice input section 3 such as a microphone, a feature extraction section 4 that extracts features of the input singing voice,
A segmentation processing unit 5 that divides the feature-extracted singing voice into each phoneme, and a prosody information extraction unit 6 that extracts height data, strength data, and length data of each divided phoneme.
and the extracted height data as pitch-scale table 7a.
a height data correction unit 7 that corrects the height data based on the height data correction unit 7;
The character string data, strength data, length data, and corrected height data are subjected to predetermined encoding in the encoding unit 6 so that they are output as voice data input to the voice synthesis circuit X. 6 In the present embodiment, the musical score and singing voice are input, and the voice data (phonological information and prosody information) is output to the voice synthesis circuit X of the regular synthesis method, and the character string is (phoneme sequence) data is automatically extracted by an optical reader. Note that a human may read the musical score, convert it into data, and input it from a keyboard.

一方、入力される歌音声からは高さデータ、強さデータ
および長さデータを抽出するようになっており、このた
めには、歌音声を各音韻毎に分割する（セグメンテーシ
ョン）処理が必要となり、この際に、楽譜から入力した
音韻列の情報を用いればセグメンテーションが容易に行
える（日本音響学会講演論文集（昭和６３年３月＞２−
４−１「トップダウン音素セグメンテーションによる音
素辞書自動作成Ｊ）、このセグメンテーション結果に基
づき各モーラの高さデータ、強さデータや各音韻の長さ
データを抽出する。On the other hand, height data, strength data, and length data are extracted from the input singing voice, and for this purpose, it is necessary to segment the singing voice into each phoneme. , at this time, segmentation can be easily performed by using information on phoneme sequences input from the musical score (Proceedings of the Acoustical Society of Japan (March 1986 > 2-
4-1 "Automatic creation of phoneme dictionary by top-down phoneme segmentation J)" Based on the segmentation results, height data and strength data of each mora and length data of each phoneme are extracted.

次に、高さデータ修正部７では、実際の歌音声から求め
た高さデータのピッチを、第２図に示したようなピッチ
−音階テーブル７ａ中の最も近いピッチに修正する。す
なわち、ピッチを求めた点に対応すると考えられる音階
に対するピッチに修正する。このようにして、歌音声か
ら抽出した高さデータの修正を行うことにより、楽譜か
ら高さデータを入力する場合と同じ結果が得られること
になる。Next, the height data correction section 7 corrects the pitch of the height data obtained from the actual singing voice to the closest pitch in the pitch-scale table 7a as shown in FIG. That is, the pitch is corrected to a scale that is considered to correspond to the point where the pitch was found. By correcting the height data extracted from the singing voice in this way, the same result as when inputting the height data from the musical score can be obtained.

以上のようにして得られた各データを符号化部６で符号
化し、記憶媒体に記憶させたり、伝送路に出力し、音声
合成回路Ｘに対する音声データとして使用され、音程が
正確で自然な歌音声が合成されることになる。この場合
、楽譜から文字列データのみを読み取るようにしており
、高さデータのキーボード入力が不要になるので、デー
タ入力作業を容易に行うことができる。また、実際の歌
音声に基づいて韻律情報の抽出を行っているので、より
品質の高い歌音声を音声合成回路から出力させることが
でき、しかも歌音声から抽出された高さデータのピッチ
をピッチ−音階テーブルのピッチに基づいて修正してい
るので、正確な音程の歌音声を合成できることになる。Each data obtained in the above manner is encoded by the encoding unit 6, stored in a storage medium or output to a transmission line, and used as voice data for the voice synthesis circuit X to produce a natural song with accurate pitch. The audio will be synthesized. In this case, only character string data is read from the musical score, and input of height data from a keyboard is not necessary, so data input work can be easily performed. In addition, since the prosody information is extracted based on the actual singing voice, it is possible to output higher quality singing voices from the speech synthesis circuit, and the pitch of the height data extracted from the singing voices can be changed to -Since it is corrected based on the pitch of the scale table, it is possible to synthesize singing voices with accurate pitches.

［発明の効果］本発明は上述のように構成されており、音声データに基
づいて歌音声の規則合成を行う音声合成回路に入力され
る歌音声合成データの作成装置において、楽譜から音韻
情報である文字列データのみを抽出するようにし、高さ
データのキーボード入力を不要にしているので、データ
入力作業を容易に行うことができ、また、実際の歌音声
に基づいて韻律情報の抽出を行っているので、より品質
の高い歌音声を音声合成回路から出力させることができ
、しかも、歌音声から抽出された高さデータのピッチを
ピッチ−音階テーブルのピッチに基づいて修正している
ので、正確な音程の歌音声を合成できるという効果があ
る。[Effects of the Invention] The present invention is configured as described above, and in a device for creating song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data, phonetic information is extracted from a musical score. Since only certain character string data is extracted, eliminating the need for keyboard input of height data, data entry work can be done easily.In addition, prosodic information can be extracted based on the actual singing voice. This allows higher quality singing voices to be output from the speech synthesis circuit, and since the pitch of the height data extracted from the singing voices is corrected based on the pitch of the pitch-scale table, This has the effect of being able to synthesize singing voices with accurate pitches.

[Brief explanation of the drawing]

第１図は本発明一実施例の概略構成を示すブロック回路
図、第２図は同上の動作説明図である。Ｘは音声合成回路、Ｙは歌音声合成データ作成装置、１
は楽譜入力部、２は音韻情報抽出部、３は歌音声入力部
、４は特徴抽出部、５はセグメンテーション処理部、６
は韻律情報抽出部、７は高さデータ修正部、７ａはピッ
チ−音階テーブル、８は符号化部である。FIG. 1 is a block circuit diagram showing a schematic configuration of an embodiment of the present invention, and FIG. 2 is an explanatory diagram of the same operation. X is a speech synthesis circuit, Y is a song speech synthesis data creation device, 1
2 is a musical score input section, 2 is a phonetic information extraction section, 3 is a song voice input section, 4 is a feature extraction section, 5 is a segmentation processing section, 6
7 is a prosodic information extraction section, 7 is a height data correction section, 7a is a pitch-scale table, and 8 is an encoding section.

Claims

[Claims]

(1) In a device for creating song speech synthesis data that is input to a speech synthesis circuit that performs regular synthesis of song speech based on speech data consisting of phonological information and prosody information, musical score and singing speech are input. Musical score information extraction means for extracting character string data as phonological information from a musical score, and song voice information extraction means for extracting metrical information such as height data, intensity data and length data from input song voice are provided. In addition, a height data correction section is provided to correct the pitch of the extracted height data to the pitch of the closest musical scale, and the character string data, corrected height data, strength data, and length data are output as audio data. A device for creating song speech synthesis data.