JPH037995A

JPH037995A - Generating device for singing voice synthetic data

Info

Publication number: JPH037995A
Application number: JP1142403A
Authority: JP
Inventors: Kanji Kunisawa; 国澤　寛治; Noboru Uechi; 上地　登; Akira Yamamura; 山村　彰; Junko Omukai; 大向　順子
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1989-06-05
Filing date: 1989-06-05
Publication date: 1991-01-16

Abstract

PURPOSE:To improve the quality of data by segmenting a singing voice by using character string, pitch, strength, and length data generated from a score as reference data, and extracting pitch, strength, and length data. CONSTITUTION:The reference data generating means consisting of a score input part 1 and a phoneme information generation part 2 generates phoneme information and rhythm information as voice data from the score. Then the segmentation processing part 5 of a rhythm information extracting means segments the input singing voice by referring to the reference voice data. Data on the length in the voice data corresponding to the song voice inputted by the segmentation and data on the strength and pitch are found. The found rhythm information from the input song voice and the phoneme information found from the score are stored as voice data on a storage medium or outputted to a transmission line, so that the data is used as the voice data of a voice synthesizing circuit X.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、音韻情報および韻律情報よりなる音声データ
に基づいて歌音声の規則合成を行う音声合成回路に入力
される歌音声合成データの作成装置に関するものである
。[Detailed Description of the Invention] [Industrial Application Field] The present invention is directed to the creation of song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. It is related to the device.

［従来の技術］一般に、音声の規則合成方式は、音韻情報を制御する音
韻パラメータと韻律情報を制御する韻律パラメータを入
力とし、音声学的・言語学的規則に基づいて音声を生成
する方式であって、処理する情報量（外部入力情報、内
部蓄積情報）は非常に少ないにも拘らず、大量の単語や
文章の合成音声を出力することができるようにした音声
合成方式である。[Prior Art] In general, a speech rule synthesis method is a method that generates speech based on phonetic and linguistic rules by inputting phonological parameters that control phonological information and prosodic parameters that control prosodic information. Although the amount of information to be processed (external input information, internally stored information) is very small, it is a speech synthesis method that can output synthesized speech of a large number of words and sentences.

一方、テキスト合成方式は、文字列のみから音声合成を
行なうという究極の音声合成方式であり、文字列から音
韻情報と韻律情報を自動的に作成しそれらの情報に基づ
いて規則合成を行なうことにより実現されるものである
。しかしながら、現在盛んに研究されているものの、合
成音声の品質はまだ低い。On the other hand, the text synthesis method is the ultimate speech synthesis method that performs speech synthesis only from character strings. It automatically creates phonological information and prosody information from character strings and performs rule synthesis based on that information. It will be realized. However, although it is currently being actively researched, the quality of synthesized speech is still low.

ここで、テキスト合成を、文字列から音韻情報と韻律情
報を自動的に作成し、それらを音声デー夕とし記憶媒体
に記憶させたり、伝送路に出力したりする部分と、記憶
媒体や伝送路から音声データを取り出し、規則合成によ
って音声を出力する部分の２つに分けることが考えられ
、このようにすると合成音声を出力する部分はコンパク
トになる、すなわち、小型、軽量化、低コスト化が図れ
る（特願昭６２−２６３２９８号参照）。Here, text synthesis consists of two parts: automatically creating phonological information and prosody information from character strings, storing them as audio data on a storage medium, or outputting them to a transmission line, and the other part: It is conceivable to extract voice data from the system and divide it into two parts, a part that outputs the sound by rule synthesis.In this way, the part that outputs the synthesized speech can be made compact, that is, it can be made smaller, lighter, and lower in cost. (See Japanese Patent Application No. 62-263298).

ところで、歌音声を合成する場合について考えると、こ
の場合の究極の形は、楽譜を入力すると歌音声が出力さ
れるものということになるやこれをテキスト合成方式と
比べると、楽譜の場合は文字列（歌詞）だけでなく、そ
の他の情報も含まれているので韻律情報の生成が容易で
あり、テキスト合成方式よりも簡単な構成で品質も高い
ものが得られることになる（特願昭６３−１７７３１４
号参照）。By the way, if we think about the case of synthesizing singing voices, the ultimate form in this case would be to input musical scores and output singing voices.Comparing this to the text synthesis method, in the case of musical scores, text is generated. Since it includes not only strings (lyrics) but also other information, it is easy to generate prosodic information, and it is possible to obtain higher quality with a simpler structure than the text synthesis method. -177314
(see issue).

この場合についても、テキスト合成のところで考えたの
と同じように、楽譜から音韻情報と韻律情報を自動的に
作成し、それを音声データとし記憶媒体に記憶させたり
、伝送路に出力したりする部分と、記憶媒体や伝送路か
ら音声データを取り出し、規則合成によって音声を出力
する部分の２つに分けることが考えられる。In this case as well, in the same way as we considered text synthesis, we can automatically create phonological information and prosody information from the musical score, and store it as audio data on a storage medium or output it to a transmission path. It is conceivable to divide it into two parts: a part that extracts audio data from a storage medium or a transmission path, and outputs audio through rule synthesis.

さて、楽譜を入力すると歌音声が出力されるものは究極
の形であるが、必ずしもそうとは言えない場合がある。Now, in the ultimate form, when a musical score is input, a singing voice is output, but this is not always the case.

すなわち、プロの歌手は楽譜から少しずらして歌うこと
によって、より人間味のある芸術性の高い歌になる場合
がある。そこで、楽譜からではなく、実際の歌音声を入
力としてそれから音声データを作成することが考えられ
、特願昭６０−１３８５１７号の「音声コード作成方法
」を応用することが考えられる。In other words, professional singers may be able to create a song with a more human touch and higher artistic quality by singing it slightly off the sheet music. Therefore, it is conceivable to create audio data by inputting actual singing voices instead of musical scores, and applying the ``audio code creation method'' of Japanese Patent Application No. 138517-1980.

［発明が解決しようとする課題］すなわち、音声を入力として音声データを作成するなめ
には、各音［なは音節に対する高さデータ、強さデータ
、長さデータが必要（第３図参照）であるので、まず入
力音声のセグメンテーションが必要となる。このセグメ
ンテーションについて考えると、文字列（音韻列）も入
力するので音韻既知のセグメンテーションとなり、音声
認識などで行われる音韻未知のセグメンテーションと比
べると容易である。しかしながら、現在の技術レベルで
はそれでも完全なセグメンテーションが行われるわけで
はなく、その結果として品質の低い音声データを作成す
ることになってしまうという問題があった。また、高さ
のデータを求めるためのピッチ抽出のアルゴリズムも１
００％完全なものはなく、真の値の１／２（半ピツチ）
や２倍（倍ピツチ）の値となることがあり、このことに
よっても品質の低い音声データが作成されるという問題
があった。[Problems to be Solved by the Invention] In other words, in order to create audio data by inputting audio, height data, intensity data, and length data for each sound (syllable) are required (see Figure 3). Therefore, we first need to segment the input audio. Considering this segmentation, since a character string (phoneme string) is also input, it is a segmentation with known phoneme, which is easier than segmentation with unknown phoneme performed in speech recognition or the like. However, with the current level of technology, complete segmentation is still not possible, and as a result, there is a problem in that voice data of low quality is created. In addition, the pitch extraction algorithm for obtaining height data is also 1.
There is no such thing as 00% perfect, but 1/2 (half pitch) of the true value.
The value may be doubled (double pitch), and this also causes the problem of creating voice data of low quality.

本発明は上述の点に鑑みて為されたものであり、その目
的とするところは、歌の実音声に対して精度の高いセグ
メンテーションを行い、その結果として、歌音声の規則
合成器に対して、より品質の高い歌音声を出力させるた
めの音声データを作成する歌音声合成データの作成装置
を提供することにある。The present invention has been made in view of the above-mentioned points, and its purpose is to perform highly accurate segmentation of actual singing voices, and as a result, to provide a rule-based synthesizer for singing voices. An object of the present invention is to provide a song voice synthesis data creation device that creates voice data for outputting higher quality song voices.

［課題を解決するための手段］本発明の歌音声合成データの作成装置は、音韻情報およ
び韻律情報よりなる音声データに基づいて歌音声の規則
合成を行う音声合成回路に入力される歌音声合成データ
の作成装置において、入力された楽譜から文字列と、規
則により高さ、強さおよび長さデータを作成する参照デ
ータ作成手段と、上記各データを参照データとして入力
された歌音声のセグメンテーションを行って高さ、強さ
および長さデータを抽出する韻律情報抽出手段とを設け
、上記韻律情報抽出手段にて抽出されたデータを音声デ
ータとして出力するようにしたものである。[Means for Solving the Problems] The song speech synthesis data creation device of the present invention provides song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. The data creation device includes a reference data creation means that creates a character string from the input musical score, height, intensity, and length data according to rules, and segmentation of the input singing voice using each of the above data as reference data. and a prosodic information extracting means for extracting height, intensity, and length data, and the data extracted by the prosodic information extracting means is output as audio data.

［作　用］本発明は上述のように構成されており、音韻情報および
韻律情報よりなる音声データに基づいて歌音声の規則合
成を行う音声合成回路に入力される歌音声合成データの
作成装置において、入力された楽譜から文字列と、規則
により高さ、強さおよび長さデータを作成し、上記各デ
ータを参照データとして入力された歌音声のセグメンテ
ーションを行って高さ、強さおよび長さデータを抽出し
、抽出されたデータを音声データとして出力するように
したものであり、実際の歌音声に基づいて強さデータと
長さデータを抽出しているので、規則合成方式の音声合
成回路に入力される歌音声合成データとして、自然な歌
音声に対する音声データが得られ、しかも、楽譜から作
成した音声データを参照して歌音声のセグメンテーショ
ンを行って音声データを抽出しているので、高品質の歌
音声を合成できる音声データを簡単な回路構成で、安価
に得られるようになっている。[Function] The present invention is configured as described above, and is provided in an apparatus for creating song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. , creates character strings and height, strength, and length data based on rules from the input musical score, and uses the above data as reference data to segment the input singing voice and calculate the height, strength, and length. This is a system that extracts data and outputs the extracted data as audio data.Since the intensity data and length data are extracted based on the actual singing voice, it is a speech synthesis circuit using a rule synthesis method. As the singing voice synthesis data input into the software, voice data for natural singing voices is obtained, and the voice data is extracted by segmenting the singing voice by referring to the voice data created from the musical score. Audio data that can be used to synthesize high-quality singing voices can be obtained at low cost with a simple circuit configuration.

［実施例コ第１図は本発明一実施例を示すもので、音韻情報および
韻律情報よりなる音声データに基づいて歌音声の規則合
成を行う音声合成回路に入力される歌音声合成データの
作成装置において、入力された楽譜から文字列と、規則
により高さ、強さおよび長さデータを作成する参照デー
タ作成手段と、上記各データを参照データとして入力さ
れた歌音声のセグメンテーションを行って高さ、強さお
よび長さデータを抽出する韻律情報抽出手段とを設け、
上記韻律情報抽出手段にて抽出されたデータを音声デー
タとして出力するようにしたものである。ここに、参照
データ作成手段は、楽譜の文字列データ、高さデータを
入力するキー人力装置あるいは光学的楽譜読み取り装置
よりなる楽譜入力部１と、入力された楽譜のデータから
規則によって韻律情報を作成する韻律情報作成部２とで
構成されている。[Embodiment] Figure 1 shows an embodiment of the present invention, in which song speech synthesis data is created to be input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. The device includes a reference data creation means that creates a character string from the input musical score, height, intensity, and length data according to rules, and performs segmentation of the input singing voice using each of the above data as reference data. and prosodic information extraction means for extracting length, strength, and length data,
The data extracted by the prosody information extraction means is output as audio data. Here, the reference data creation means includes a musical score input unit 1 consisting of a manual key device or an optical musical score reader that inputs character string data and height data of the musical score, and a musical score input unit 1 that extracts prosody information from the input musical score data according to rules. It is composed of a prosodic information creation section 2 that creates prosody information.

また、マイクロフォンのような歌音声入力部３と、入力
された歌音声の特徴を抽出する特徴抽出部４と、特徴抽
出された歌音声を参照データ作成手段にて作成された各
データを参照して音韻毎に分割するセグメンテーション
処理部５と、分割された各音韻の高さ、強さデータおよ
び長さデータを抽出する韻律情報抽出部６とで構成され
ている。In addition, a singing voice input section 3 such as a microphone, a feature extraction section 4 that extracts the features of the input singing voice, and a singing voice from which the features have been extracted are referred to each data created by a reference data creation means. It is comprised of a segmentation processing section 5 that divides each phoneme into segments, and a prosodic information extraction section 6 that extracts height, strength data, and length data of each segmented phoneme.

また、文字列データおよび高さ、強さ、長さデータは、
データを符号化する符号化部７を介して音声データとし
て出力されるようになっている。In addition, string data and height, strength, and length data are
The data is output as audio data via an encoding unit 7 that encodes the data.

いま、参照データ作成手段では、楽譜から音声データと
して音韻情報と韻律情報を作成する。なお、本実施例に
あっては、韻律情報について、強さのデータは不要であ
り高さと長さのデータのみ参照音声データとして作成す
る６次に、韻律情報抽出手段のセグメンテーション処理
部５では、参照音声データを参照して入力された歌音声
のセグメンテーションを行う。この処理は、例えば次の
ようにして行われる。Now, the reference data creation means creates phonological information and prosody information as audio data from the musical score. In this embodiment, regarding prosody information, strength data is not necessary and only height and length data are created as reference speech data.Next, the segmentation processing unit 5 of the prosody information extraction means The input song voice is segmented by referring to the reference voice data. This process is performed, for example, as follows.

まず、作成された参照音声データの中の長さに関するデ
ータに基づき、各音韻の境界点を算出し、それによって
得られる音声の長さが入力された歌音声の長さと一致す
るように伸縮させ、それに応じて各音韻の境界点を移動
させる。そのようにして求めた各音韻の境界点の近傍を
入力歌音声のそれに対応した境界点が存在する範囲とし
く第２図参照）、その前後の音韻の性質を考慮して境界
点を決定する。なお、第２図（ＩＬ）は楽譜から作成し
た音声データによる各音韻の境界点を示すもので、第２
図（ｂ）は入力される歌音声の長さと合わせるために境
界点を線形伸長させた場合を示すものであり、第２図（
ｃ）はエネルギ、第２図（ｄ）はフォルマント周波数を
示している。First, the boundary points of each phoneme are calculated based on the length data in the created reference audio data, and the length of the resulting audio is expanded or contracted so that it matches the length of the input song audio. , and move the boundary points of each phoneme accordingly. The vicinity of the boundary point of each phoneme found in this way is defined as the range where the corresponding boundary point of the input song voice exists (see Figure 2), and the boundary point is determined by considering the properties of the phonemes before and after it. . Furthermore, Figure 2 (IL) shows the boundary points of each phoneme based on the audio data created from the musical score.
Figure (b) shows the case where the boundary points are linearly extended to match the length of the input singing voice, and Figure 2 (
c) shows the energy, and FIG. 2(d) shows the formant frequency.

次に、このようにして得られたセグメンテーション結果
から入力された歌音声に対応する音声データの内の長さ
のデータ（例えば各母音、子音の長さ）と強さのデータ
（例えば各モーラの代表点のパワー）を求める。Next, from the segmentation results obtained in this way, length data (for example, the length of each vowel and consonant) and strength data (for example, the length of each mora) of the voice data corresponding to the input singing voice are calculated. Find the power of the representative point).

次に、高さデータを求める。例えば、そのデータを各モ
ーラの代表点のピッチとすると、求める点に対するピッ
チを求める際に、ピッチの探索範囲を楽譜から求めた音
声データによる値の近傍とし、その範囲内で求める。こ
のようにして求めると探索範囲が狭いので、精度も高く
（半ピツチや倍ピツチのエラーがなくなる）、演算も速
くなる。Next, obtain height data. For example, if the data is the pitch of the representative point of each mora, when finding the pitch for the desired point, the pitch search range is set near the value of the audio data found from the musical score, and the pitch is found within that range. When obtained in this way, the search range is narrow, so accuracy is high (half-pitch or double-pitch errors are eliminated), and calculations are faster.

このようにして求めた入力された歌音声からの韻律情報
と楽譜から求めた音韻情報を音声データとし、記憶媒体
に記憶したり、伝送路に出力しなりして音声合成回路Ｘ
の音声データとして使用する。The prosody information from the input singing voice obtained in this way and the phonological information obtained from the musical score are converted into audio data, which can be stored in a storage medium or output to a transmission path to form the speech synthesis circuit X.
used as audio data.

以上のようにして、実際の歌音声に基づいて作成された
音声データを音声合成回路Ｘに入力することにより、自
然な歌音声が得られることになる。As described above, by inputting the voice data created based on the actual singing voice to the voice synthesis circuit X, a natural singing voice can be obtained.

また、楽譜から作成した音声データを参照して歌音声の
セグメンテーションを行って音声データを抽出している
ので、高品質の歌音声を合成できる音声データを簡単な
回路構成で、安価に得られるようになっている。In addition, since the audio data created from the musical score is extracted by performing segmentation of the singing audio, it is possible to obtain audio data that can be synthesized into high-quality singing audio with a simple circuit configuration and at low cost. It has become.

［発明の効果コ本発明は上述のように構成されており、音韻情報および
韻律情報よりなる音声データに基づいて歌音声の規則合
成を行う音声合成回路に入力される歌音声合成データの
作成装置において、入力された楽譜から文字列と、規則
により高さ、強さおよび長さデータを作成し、上記各デ
ータを参照データとして入力された歌音声のセグメンテ
ーションを行って高さ、強さおよび長さデータを抽出し
、抽出されたデータを音声データとして出力するように
したものであり、実際の歌音声に基づいて強さデータと
長さデータを抽出しているので、規則合成方式の音声合
成回路に入力される歌音声合成データとして、自然な歌
音声に対する音声データが得られ、しかも、楽譜から作
成した音声データを参照して歌音声のセグメンテーショ
ンを行って音声データを抽出しているので、高品質の歌
音声を合成できる音声データを簡単な回路構成で、安価
に得られるという効果がある。[Effects of the Invention] The present invention is configured as described above, and provides an apparatus for creating song speech synthesis data that is input to a speech synthesis circuit that performs rule synthesis of song speech based on speech data consisting of phonological information and prosody information. In this step, a character string and height, strength, and length data are created according to rules from the input musical score, and the above data is used as reference data to segment the input singing voice and calculate the height, strength, and length. This method extracts the sound data and outputs the extracted data as audio data.Since the intensity data and length data are extracted based on the actual singing voice, it is not possible to synthesize speech using the regular synthesis method. As the singing voice synthesis data input to the circuit, voice data for natural singing voices is obtained, and the voice data is extracted by segmenting the singing voice by referring to the voice data created from the musical score. The effect is that voice data capable of synthesizing high-quality singing voices can be obtained at low cost with a simple circuit configuration.

[Brief explanation of the drawing]

第１図は本発明による歌音声の音声データ作成装置の例
、第２図および第３図は同上の動作説明図である。Ｘは音声合成回路、Ｙは歌音声合成データ作成装置、１
は楽譜入力部、２は韻律情報作成部、３は歌音声入力部
、４は特徴抽出部、５はセグメンテーション処理部、６
は韻律情報抽出部、７は符号化部でる。FIG. 1 is an example of a song voice audio data creation device according to the present invention, and FIGS. 2 and 3 are explanatory diagrams of the same operation. X is a speech synthesis circuit, Y is a song speech synthesis data creation device, 1
2 is a musical score input section, 2 is a prosodic information creation section, 3 is a singing voice input section, 4 is a feature extraction section, 5 is a segmentation processing section, 6
7 is a prosodic information extraction section, and 7 is an encoding section.

Claims

[Claims]

(1) In a device for creating singing speech synthesis data that is input to a speech synthesis circuit that synthesizes singing speech according to rules based on speech data consisting of phonological information and prosody information, character strings and high reference data creation means for creating pitch, strength and length data; and prosodic information extraction means for extracting pitch, strength and length data by segmenting the input singing voice using each of the above data as reference data. A device for creating song speech synthesis data, characterized in that the data extracted by the prosodic information extraction means is output as speech data.