JP4622356B2

JP4622356B2 - Script generator for speech synthesis and script generation program for speech synthesis

Info

Publication number: JP4622356B2
Application number: JP2004209635A
Authority: JP
Inventors: 毅彦川▲原▼
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-07-16
Filing date: 2004-07-16
Publication date: 2011-02-02
Anticipated expiration: 2024-07-16
Also published as: JP2006030610A

Description

本発明は、音声合成用スクリプトを生成する音声合成用スクリプト生成装置及び音声合成用スクリプト生成プログラムに関する。 The present invention relates to a script generating program for speech synthesis script generating apparatus and speech synthesis to generate a script for speech synthesis.

合成しようとする音声に対応する文字列と、音の高さ、長さ、強さなどの韻律を表す記号を含む音声合成用スクリプトを入力し、規則に従って音声合成する音声合成装置が知られている。
例えば、特許文献１には、楽譜から文字列と高さ、強さ及び長さデータからなる参照データを作成し、該参照データを用いて、実際の歌音声のセグメンテーションを行い、音声合成用データを生成する歌音声合成データの作成装置であって、参照データの一つである高さデータを各モーラの代表点のピッチとするものが記載されている。
特開平３−７９９５号公報 A speech synthesizer is known in which a speech synthesis script including a character string corresponding to a speech to be synthesized and a symbol representing a prosody such as pitch, length, and strength is synthesized and speech synthesis is performed according to a rule. Yes.
For example, in Patent Document 1, reference data composed of a character string and height, strength, and length data is created from a musical score, and actual singing voice segmentation is performed using the reference data. Is a device for creating voice synthesis data of singing voice, in which height data, which is one of reference data, is used as a pitch of representative points of each mora.
JP-A-3-7995

上述のように、自然な音声合成を行うため、実際の音声データを分析して音声合成用データを作成する従来技術があったが、生成した音声合成用データによる合成では各モーラのピッチ指定の自由度がないため、合成音声の自然性が不足していた。 As described above, in order to perform natural speech synthesis, there has been a conventional technique in which actual speech data is analyzed to create speech synthesis data. However, in the synthesis based on the generated speech synthesis data, the pitch designation of each mora is specified. Since there was no degree of freedom, the naturalness of synthesized speech was insufficient.

例えば、実際の歌声を元にピッチを抽出すると、一定のピッチで歌っているようでも、音の立上がりなどでは、前の音の音程からのピッチの変化が観られる。
図１０は、歌声におけるピッチ変化の一例を示す図である。
このような歌を音声合成により歌わせようとしたときに、図１０の（１）で示すタイミングのピッチＳ１を基準としてこのセグメントの音の高さ（音階）を決定したとすると、その後のピッチの変化をピッチの上げ下げを指示する高低アクセント記号を用いて記述しても、ピッチの上げ下げ記号によるピッチの変化幅がピッチがほぼ一定となったときの音階と一致しない場合には、音痴に聞こえてしまう。
歌を歌わせるときは、文字などのテキストに記述する文字単位（音節、モーラなど）でのピッチ変化の中で、ピッチが一定の区間やビブラート区間などのそもそもその文字を歌う音程のところでの音階を出力し、それからのピッチの差分をピッチを上げ下げする記号で表現すれば、上記のような不自然さをなくすことができる。図１０の場合には、（２）のタイミングあたりのピッチ（Ｓ２）の音階をこのセグメントの基準とすることにより、よい結果を得ることができる。 For example, if the pitch is extracted based on the actual singing voice, even if it is sung at a constant pitch, a change in pitch from the pitch of the previous sound can be seen at the rise of the sound.
FIG. 10 is a diagram illustrating an example of a pitch change in a singing voice.
When such a song is sung by voice synthesis, if the pitch (scale) of this segment is determined with reference to the pitch S1 at the timing shown in FIG. Even if the change in pitch is described using high and low accent symbols that indicate pitch increase / decrease, if the pitch change width due to the pitch increase / decrease symbol does not match the scale when the pitch is almost constant, it will be heard. End up.
When singing a song, the pitch at the pitch where the pitch is constant or vibrato, etc. in the first place in the pitch change in the character units (syllables, mora, etc.) described in the text such as letters. And the difference in pitch from that is expressed by a symbol that raises or lowers the pitch, the above-described unnaturalness can be eliminated. In the case of FIG. 10, a good result can be obtained by using the scale of the pitch (S2) per timing (2) as the reference of this segment.

そこで、本発明は、少ないデータ量で高品質の音声合成を可能とする音声合成用スクリプトを生成することができる音声合成用スクリプト生成装置及び音声合成用スクリプト生成プログラムを提供することを目的としている。 The present invention has as object to provide a small amount of data in a script generating program for generating apparatus and the speech synthesis script for speech synthesis which can generate a script for speech synthesis that enables high-quality voice synthesis Yes.

上記目的を達成するために、本発明の音声合成用スクリプト生成装置及び音声合成用スクリプト生成プログラムは、合成しようとする音声のピッチ遷移情報をみて、なるべくピッチが一定である区間又はビブラートのかかっている区間である基準ピッチ区間を検出し、該基準ピッチ区間における平均ピッチなどの基準ピッチを決定し、前記ピッチ遷移情報を前記基準ピッチに対する相対変化で表現する音声合成用スクリプトを生成することを主要な特徴としている。 In order to achieve the above object, the speech synthesis script generation device and the speech synthesis script generation program according to the present invention can detect a section or vibrato where the pitch is as constant as possible by looking at the pitch transition information of the speech to be synthesized. Main reference is to detect a reference pitch section that is a certain section, determine a reference pitch such as an average pitch in the reference pitch section, and generate a speech synthesis script that expresses the pitch transition information as a relative change with respect to the reference pitch. Features.

本発明の音声合成用スクリプト生成装置又は音声合成用スクリプト生成プログラムによれば、少ないデータ量で高品質の音声合成を可能とする音声合成用スクリプトを生成することができる。
特に、基準ピッチ区間の基準ピッチからの相対変化でピッチ遷移情報を表現しているため、ピッチ変化が重要な意味を持つ歌唱合成において、有用である。 According to the speech synthesis script generation apparatus or the speech synthesis script generation program of the present invention, it is possible to generate a speech synthesis script that enables high-quality speech synthesis with a small amount of data.
In particular, since it represents the pitch transition information relative change from the reference pitch of the reference pitch intervals, the singing synthesis pitch change has a significant useful.

図１は、本発明の音声合成用スクリプト生成装置の一実施の形態の構成を示す機能ブロック図である。
この図において、１は入力データ、２は入力データのピッチがあるしきい値内の値を取り続けるピッチ一定区間を検出するピッチ一定区間検出部、３はピッチが振動するビブラート区間を検出するビブラート区間検出部、４は前記ピッチ一定区間検出部２及び前記ビブラート区間検出部３からの出力に基づいて入力データの基準ピッチの最寄りの音階情報（音名）を出力する音階抽出部、５は前記入力データ及び前記音階抽出部４からの出力に基づいて音声合成用スクリプトを作成する合成用スクリプト作成部である。
ここで、本発明における音声合成用スクリプトは、例えば、発話長が例えば長音記号「−」の数に応じて変更でき、各長音記号ごとに高低アクセント記号を記述することで、その長音記号のタイミングで高低アクセント記号によるピッチ変化を行い、さらに、高低アクセント記号による変化レベルが高低アクセント記号の直後に数値を記述することで実現することができるものとされている。
なお、前記各構成要素は、それぞれ個別の処理部として実現することもできるが、コンピュータにおけるプログラム処理により実現することもできる。 FIG. 1 is a functional block diagram showing the configuration of an embodiment of a script generator for speech synthesis according to the present invention.
In this figure, 1 is input data, 2 is a constant pitch interval detecting unit for detecting a constant pitch interval where the pitch of the input data continues to take a value within a certain threshold, and 3 is a vibrato interval for detecting a vibrato interval in which the pitch vibrates. A detecting unit 4 outputs a scale information (pitch name) nearest to a reference pitch of input data based on outputs from the constant pitch interval detecting unit 2 and the vibrato interval detecting unit 3, and 5 is the input It is a synthesis script creation unit that creates a speech synthesis script based on data and an output from the scale extraction unit 4.
Here, in the speech synthesis script according to the present invention, for example, the utterance length can be changed according to the number of the long sound symbols “−”, for example. The pitch change by the high and low accent symbols is performed, and the change level by the high and low accent symbols can be realized by describing a numerical value immediately after the high and low accent symbols.
In addition, although each said component can also be implement | achieved as each separate process part, it can also implement | achieve by the program processing in a computer.

前記入力データ１は、実際の発話を分析したり、直接入力するなどされた、合成する文字列（モーラ情報）、各モーラが発話されるタイミングのセグメント情報（発話長情報、ある音素区間の発声開始時間と発声終了時間）、ピッチ遷移情報（ピッチ列）及び音量遷移情報（音量列）などである。なお、音量遷移情報は含まれていなくてもよい。
図２は、このような入力データ１を音声分析により得るための構成の一例を示す図である。この構成は、コンピュータプログラムによっても実現することができる。
この図において、１１は発話内容を書き起こした入力音素記号列（かな、アルファベットなど）と音声波形が入力され、セグメンテーションを行って、各モーラが音声波形の中で発話される区間（セグメント位置）を求め、文字列（かな、アルファベットなど）と該セグメント位置を出力するセグメンテーション部である。なお、セグメンテーション部ではなく、音声認識部を用いる場合には、前記入力音素記号列を入力することなく、文字列とセグメント位置を得ることができる。
また、１２は入力される前記音声波形に対してゼロクロス区間測定などの方法を用いてピッチを求め、時系列のピッチ変化を表したピッチ列を出力するピッチ検出部である。
１３は入力音声波形の一定区間の平均エネルギーを求めるなどの方法で音量の時系列の変化を表した音量列を出力する音量検出部である。
このように実際の音声波形を分析することにより、前記入力データ１を得ることができる。 The input data 1 includes a character string to be synthesized (mora information), segment information (speech length information, utterance of a phoneme section) when each mora is uttered, by analyzing an actual utterance or directly inputting the utterance Start time and utterance end time), pitch transition information (pitch sequence), and volume transition information (volume sequence). Note that the volume transition information may not be included.
FIG. 2 is a diagram showing an example of a configuration for obtaining such input data 1 by voice analysis. This configuration can also be realized by a computer program.
In this figure, reference numeral 11 is an input phoneme symbol string (kana, alphabet, etc.) that transcribes the utterance content and a speech waveform, is segmented, and a section (segment position) in which each mora is uttered in the speech waveform. Is a segmentation unit for obtaining a character string (kana, alphabet, etc.) and outputting the segment position. When a speech recognition unit is used instead of a segmentation unit, a character string and a segment position can be obtained without inputting the input phoneme symbol string.
Reference numeral 12 denotes a pitch detection unit that obtains a pitch from the input voice waveform by using a method such as zero cross section measurement and outputs a pitch sequence representing a time-series pitch change.
Reference numeral 13 denotes a volume detection unit that outputs a volume sequence that represents a time-series change in volume by a method such as obtaining an average energy of a certain section of the input voice waveform.
Thus, the input data 1 can be obtained by analyzing the actual speech waveform.

図３は、前記入力データ１の一例を示す図である。
ここでは、（ａ）に示すようなピッチの変化と発話長で「あさ」と発音する際に、得られる入力データの例を（ｂ）に示している。
この図に示すように、各文字（モーラ）に対するセグメント情報（発声開始位置、発声終了位置）と各文字（モーラ）が持つ単位時間当たりのピッチ変化列（例えば、１０msecごとのピッチ値の列）がセットになっている。 FIG. 3 is a diagram showing an example of the input data 1.
Here, (b) shows an example of input data obtained when “asa” is pronounced with the pitch change and utterance length as shown in (a).
As shown in this figure, segment information (speech start position, utterance end position) for each character (mora) and a pitch change sequence per unit time of each character (mora) (for example, a sequence of pitch values every 10 msec) Is a set.

このような入力データ１は、前記ピッチ一定区間検出部２、ビブラート区間検出部３及び音階抽出部４からなる音階抽出処理部に入力される。この音階抽出処理部では、各音節（モーラ）単位のピッチ情報より、あるしきい値内の値を取り続けるピッチ一定区間又は同じピッチを何度もクロスするビブラート区間を検出し、ピッチ一定区間又はビブラート区間のうち長い方の区間を基準ピッチ区間として、その基準ピッチ区間における平均ピッチの最寄りの音名を音階情報として出力する。また、ピッチ一定区間もビブラート区間も検出されないときには、ピッチ情報の平均値の最寄りの音名を音階情報として出力する。 Such input data 1 is input to a scale extraction processing unit including the constant pitch interval detection unit 2, the vibrato interval detection unit 3, and the scale extraction unit 4. In this scale extraction processing unit, a pitch constant section that continues to take a value within a certain threshold or a vibrato section that crosses the same pitch many times is detected from pitch information of each syllable (mora) unit, and a constant pitch section or vibrato The longer one of the sections is set as a reference pitch section, and the nearest pitch name of the average pitch in the reference pitch section is output as scale information. When neither a constant pitch section nor a vibrato section is detected, the nearest pitch name of the average pitch information is output as scale information.

図４を参照して、前記ピッチ一定区間と前記ビブラート区間について説明する。
図４の（ａ）はピッチ一定区間を示す図であり、図示するように、ピッチの変動が所定のしきい値（一定区間判別しきい値）以内である区間がピッチ一定区間である。
また、図４の（ｂ）に示すように、ピッチが同じピッチを何度もクロスする区間（ピッチが振動している区間）がビブラート区間である。
前記入力データ中のピッチ列から、前記ピッチ一定区間検出部２は図４の（ａ）に示すピッチ一定区間を検出し、前記ビブラート区間検出部３は図４の（ｂ）に示すビブラート区間を検出する。 The fixed pitch interval and the vibrato interval will be described with reference to FIG.
FIG. 4A is a diagram showing a constant pitch section. As shown in the figure, a section in which the fluctuation of the pitch is within a predetermined threshold value (a constant section determination threshold value) is a constant pitch section.
Further, as shown in FIG. 4B, a section where the same pitch is crossed many times (section where the pitch vibrates) is a vibrato section.
From the pitch sequence in the input data, the constant pitch interval detector 2 detects the constant pitch interval shown in FIG. 4A, and the vibrato interval detector 3 detects the vibrato interval shown in FIG. 4B. To detect.

図５は、前記ピッチ一定区間検出部２において実行されるピッチ一定区間検出処理の流れを示すフローチャートである。
まず、最大一定区間長（MaxLeglen）及び最大一定区間開始位置（MaxLegStartPos）を初期値「０」に設定する（ステップＳ１）。なお、一定区間判別しきい値（Δ）、一定区間であると認める最低限の長さ（MinimumLegLen）は、それぞれ所定の値に設定されており、各モーラに対する発声開始位置から発声終了位置の区間の時間長であるセグメント区間長（SegmentLen）は前記入力データ１により得られている。
次に、検索開始点（StartPos）を「０」とする（ステップＳ２）。
そして、検索開始点に最大一定区間長を加えた値（StartPos+MaxLeglen）がセグメント区間長（SegmentLen）より小さいか否かを判定し（ステップＳ３）、小さい場合には、検索点ｉを検索開始点＋１（StartPos+1）として（ステップＳ４）、検索点ｉにおけるピッチ（Pitch[i]）と検索開始点のピッチ（Pitch[StartPos]）との差がΔより大きいか否かを判定する（ステップＳ５）。その結果Δよりも大きいときは、検索開始点からｉまでの距離（i-StartPos）が最大一定区間長（MaxLeglen）よりも大きいか否かを判定し（ステップＳ６）、大きいときは（i-StartPos）を新たな最大一定区間長（MaxLeglen）、その検索開始点（StartPos）を最大一定区間開始位置（MaxLegStartPos）として（ステップＳ７）、また、（i-StartPos）が最大一定区間長（MaxLeglen）以下のときはそのまま、検査開始点（StartPos）を（StartPos+1）に更新する（ステップＳ８）。そして、前記ステップＳ３に戻る。 FIG. 5 is a flowchart showing a flow of constant pitch interval detection processing executed in the constant pitch interval detector 2.
First, the maximum constant section length (MaxLeglen) and the maximum constant section start position (MaxLegStartPos) are set to the initial value “0” (step S1). Note that the fixed section discrimination threshold (Δ) and the minimum length (MinimumLegLen) that is recognized as a fixed section are set to predetermined values, and the section from the utterance start position to the utterance end position for each mora. The segment length (SegmentLen), which is the time length of, is obtained from the input data 1.
Next, the search start point (StartPos) is set to “0” (step S2).
Then, it is determined whether or not the value obtained by adding the maximum constant section length to the search start point (StartPos + MaxLeglen) is smaller than the segment section length (SegmentLen) (step S3). As a point +1 (StartPos + 1) (step S4), it is determined whether or not the difference between the pitch (Pitch [i]) at the search point i and the pitch (Pitch [StartPos]) at the search start point is greater than Δ (step S4). Step S5). When the result is larger than Δ, it is determined whether or not the distance from the search start point to i (i-StartPos) is larger than the maximum constant section length (MaxLeglen) (step S6). StartPos) is the new maximum constant section length (MaxLeglen), its search start point (StartPos) is the maximum constant section start position (MaxLegStartPos) (step S7), and (i-StartPos) is the maximum constant section length (MaxLeglen) In the following cases, the inspection start point (StartPos) is updated to (StartPos + 1) as it is (step S8). Then, the process returns to step S3.

一方、前記ステップＳ５において、検索点ｉにおけるピッチ（Pitch[i]）と検索開始点（StartPos）のピッチ（Pitch[StartPos]）との差がΔ以下であるときは、ｉを（ｉ＋１）に更新して検索点を進め（ステップＳ９）、ｉがセグメント区間長（SegmentLen）以下（ステップＳ１０がＮＯ）のときは、前記ステップＳ５に戻る。
また、ｉがセグメント区間長（SegmentLen）を超えた（ステップＳ１０がＹＥＳ）ときは、ｉと検索開始点との間の距離（i-StartPos）が最大一定区間長（MaxLeglen）より大きいか否かを判定する（ステップＳ１１）。その結果、最大一定区間長よりも大きいときは（i-StartPos）を新たな最大一定区間長（MaxLeglen）、検索開始点（StartPos）を最大一定区間開始位置（MaxLegStartPos）として（ステップＳ１２）、そうでないときはそのまま、ステップＳ１３に進み、最大一定区間長（MaxLeglen）が一定区間であると認める最低限の長さ（MinimumLegLen）を超えているか否かを判定する。最大一定区間長（MaxLeglen）が最低限の長さ（MinimumLegLen）を超えているときは、最大一定区間開始位置（MaxLegStartPos）から（MaxLegStartPos+MaxLeglen）までの区間を一定区間と決定し（ステップＳ１４）、最低限の長さ（MinimumLegLen）以下のときは「一定区間なし」とする（ステップＳ１５）。
このようにして、前記ピッチ一定区間検出部２は、ピッチの変動が所定のしきい値（Δ）以下である区間のうちの最も長い区間をピッチ一定区間と決定する。 On the other hand, when the difference between the pitch (Pitch [i]) at the search point i and the pitch (Pitch [StartPos]) at the search start point (StartPos) is equal to or less than Δ in step S5, i is set to (i + 1). Update and advance the search point (step S9). If i is equal to or less than the segment section length (SegmentLen) (step S10 is NO), the process returns to step S5.
If i exceeds the segment section length (SegmentLen) (YES in step S10), whether or not the distance (i-StartPos) between i and the search start point is greater than the maximum constant section length (MaxLeglen). Is determined (step S11). As a result, if it is larger than the maximum fixed section length, (i-StartPos) is set as the new maximum fixed section length (MaxLeglen), and the search start point (StartPos) is set as the maximum fixed section start position (MaxLegStartPos) (step S12). If not, the process proceeds to step S13, and it is determined whether or not the maximum constant section length (MaxLeglen) exceeds a minimum length (MinimumLegLen) that is recognized as a constant section. When the maximum constant section length (MaxLeglen) exceeds the minimum length (MinimumLegLen), the section from the maximum constant section start position (MaxLegStartPos) to (MaxLegStartPos + MaxLeglen) is determined as a constant section (step S14). When the length is not more than the minimum length (MinimumLegLen), “no fixed interval” is set (step S15).
In this way, the constant pitch interval detection unit 2 determines the longest interval among the intervals in which the pitch variation is equal to or less than the predetermined threshold value (Δ) as the constant pitch interval.

図６は、前記ビブラート区間検出部３において実行されるビブラート区間検出処理の流れを示すフローチャートである。
まず、最大ビブラート区間長（MaxVibLen）と最大ビブラート区間開始位置（MaxVibStartPos）をともに初期値「０」に設定し（ステップＳ２１）、検索開始点（StartPos）を「０」に設定する（ステップＳ２２）。なお、ビブラート区間であると認める最低限の長さ（MinimumVibLen）はあらかじめ所定の値に設定されており、セグメント区間長（SegmentLen）は入力データ１より得られている。
次に、検索開始点に最大ビブラート区間長を加えた値（StartPos+MaxViblen）がセグメント区間長（SegmentLen）よりも小さいか否かを判定し（ステップＳ２３）、小さい場合には、検索点ｉを検索開始点＋１（StartPos+1）として（ステップＳ２４）、ステップＳ２５のクロスポイントの判定に進む。 FIG. 6 is a flowchart showing a flow of vibrato section detection processing executed in the vibrato section detecting unit 3.
First, both the maximum vibrato section length (MaxVibLen) and the maximum vibrato section start position (MaxVibStartPos) are set to the initial value “0” (step S21), and the search start point (StartPos) is set to “0” (step S22). . The minimum length (MinimumVibLen) that is recognized as a vibrato section is set to a predetermined value in advance, and the segment section length (SegmentLen) is obtained from the input data 1.
Next, it is determined whether or not the value obtained by adding the maximum vibrato section length to the search start point (StartPos + MaxViblen) is smaller than the segment section length (SegmentLen) (step S23). As the search start point + 1 (StartPos + 1) (step S24), the process proceeds to the cross point determination in step S25.

ステップＳ２５では、検索点ｉのピッチ（Pitch[i]）が検索開始点のピッチ（Pitch[StartPos]）以上で、かつ、その次の検索点（ｉ＋１）のピッチ（Pitch[i+1]）が検索開始点のピッチ（Pitch[StartPos]）以下であるか、あるいは、逆に、検索点ｉのピッチ（Pitch[i]）が検索開始点のピッチ（Pitch[StartPos]）以下で、かつ、その次の検索点（ｉ＋１）のピッチ（Pitch[i+1]）が検索開始点のピッチ（Pitch[StartPos]）以上であるときには、その検索点ｉをクロスポイントであると判定する。
その検索点ｉがクロスポイントである場合には、検索開始点（StartPos）から検索点ｉまでの長さ（i-StartPos）が最大ビブラート区間長（MaxViblen）より大きいか否かを判定し（ステップＳ２６）、大きいときは、その検索開始点から検索点ｉまでの長さ（i-StartPos）を新たな最大ビブラート区間長（MaxViblen）とし、その検索開始点（StartPos）を最大ビブラート区間開始位置（MaxVibStartPos）とする（ステップＳ２７）。そして、ｉをｉ＋１として検索点を進める（ステップＳ２８）。
また、前記ステップＳ２５でクロスポイントでないと判定されたとき、及び、前記ステップＳ２６で最大ビブラート区間長以下であると判定されたときは、そのまま、検索点を進める（ステップＳ２８）。 In step S25, the pitch of the search point i (Pitch [i]) is equal to or greater than the pitch of the search start point (Pitch [StartPos]) and the pitch of the next search point (i + 1) (Pitch [i + 1]). Is less than or equal to the pitch of the search start point (Pitch [StartPos]), or conversely, the pitch of the search point i (Pitch [i]) is less than or equal to the pitch of the search start point (Pitch [StartPos]), and When the pitch (Pitch [i + 1]) of the next search point (i + 1) is equal to or greater than the pitch (Pitch [StartPos]) of the search start point, the search point i is determined to be a cross point.
If the search point i is a cross point, it is determined whether the length from the search start point (StartPos) to the search point i (i-StartPos) is greater than the maximum vibrato section length (MaxViblen) (step S26) When the value is larger, the length (i-StartPos) from the search start point to the search point i is set as a new maximum vibrato section length (MaxViblen), and the search start point (StartPos) is set as the maximum vibrato section start position ( MaxVibStartPos) (step S27). Then, the search point is advanced by setting i to i + 1 (step S28).
If it is determined in step S25 that it is not a cross point and if it is determined in step S26 that it is equal to or shorter than the maximum vibrato section length, the search point is advanced as it is (step S28).

検索点の位置ｉがセグメント区間長（SegmentLen）を超えるまで、前記ステップＳ２５（クロスポイントの判定）以降の処理を繰り返し（ステップＳ２９）、セグメント区間長を超えたときには、前記検索開始点（StartPos）を１進めて（ステップＳ３０）、前記ステップＳ２３に戻り、検索開始点と最大ビブラート区間長の和（StartPos+MaxViblen）がセグメント区間長より小さい間は、上記ステップＳ２４以降の処理を繰り返す。
そして、検索開始点と最大ビブラート区間長の和がセグメント区間長以上となったときは（ステップＳ２３がＹＥＳ）、ステップＳ３１に進み、得られた最大ビブラート区間長（MaxViblen）がビブラート区間であると認める最低限の長さ（MinimumVibLen）より大きいか否かを判定し、大きいときは、最大ビブラート区間開始位置（MaxVibStartPos）から最大ビブラート区間開始位置＋最大ビブラート区間長（MaxVibStartPos+MaxViblen）までの区間をビブラート区間であると決定し（ステップＳ３２）、そうでないときはビブラート区間なしと決定する（ステップＳ３３）。
このようにして、前記ビブラート区間検出部３は、ビブラート区間を検出する。 Until the position i of the search point exceeds the segment section length (SegmentLen), the processing after step S25 (determination of the cross point) is repeated (step S29). When the search section position i exceeds the segment section length, the search start point (StartPos) Is advanced by 1 (step S30), the process returns to step S23, and the processing from step S24 onward is repeated while the sum of the search start point and the maximum vibrato section length (StartPos + MaxViblen) is smaller than the segment section length.
When the sum of the search start point and the maximum vibrato section length is equal to or greater than the segment section length (YES in step S23), the process proceeds to step S31, and the obtained maximum vibrato section length (MaxViblen) is a vibrato section. It is determined whether or not it is larger than the minimum length (MinimumVibLen) that is allowed. It is determined that it is a vibrato section (step S32). Otherwise, it is determined that there is no vibrato section (step S33).
In this way, the vibrato section detection unit 3 detects a vibrato section.

前述のように、前記音階抽出部４は、前記ピッチ一定区間検出部２により検出されたピッチ一定区間と前記ビブラート区間検出部３により検出されたビブラート区間のうちの長いほうの区間を基準ピッチ区間と決定し、その基準ピッチ区間における平均ピッチを基準ピッチとして該基準ピッチに最も近い周波数を有する音名を音階情報として出力する。また、前記ピッチ一定区間又は前記ビブラート区間がいずれも検出されなかったときは、そのセグメント区間における平均ピッチに最も近い周波数を有する音名を出力する。
なお、前記基準ピッチは、基準ピッチ区間の平均ピッチに限られることはなく、基準ピッチ区間における最も低いピッチや最も高いピッチなどを基準ピッチとしてもよい。 As described above, the scale extraction unit 4 uses the longer interval of the constant pitch interval detected by the constant pitch interval detection unit 2 and the vibrato interval detected by the vibrato interval detection unit 3 as a reference pitch interval. And the pitch name having the frequency closest to the reference pitch is output as the scale information with the average pitch in the reference pitch section as the reference pitch. When neither the constant pitch interval nor the vibrato interval is detected, a pitch name having a frequency closest to the average pitch in the segment interval is output.
The reference pitch is not limited to the average pitch in the reference pitch section, and the lowest pitch or the highest pitch in the reference pitch section may be used as the reference pitch.

前記合成用スクリプト作成部５は、前記入力データ１として入力された文字（モーラ）列、セグメント位置、ピッチ列及び音量列の各データと、前記音階抽出部４で抽出された音階情報を元に、対応する音声合成システムにあった音声合成用スクリプトを作成する。
前述のように、本発明における音声合成用スクリプトは、音の高さを表す音階情報、発話長を表す記号及びピッチの上げ下げを表す高低アクセント記号を含んでいる。
発話長を表す記号は、例えば、「−」（長音記号）であり、この記号により所定単位の発話長を表す。
図７は、高低アクセント記号の一例を示す図である。この図において、（ａ）はピッチの上げを表す記号「’」（アポストロフィ）、（ｂ）はピッチの下げを表す記号「＿」（アンダースコア）、（ｃ）はピッチの下げシフトを表す記号「＄」（ドル記号）である。
図７に示す例では、「’」により所定の時間単位Ｔ０（例えば、「−」による発話長の単位と同じ時間単位）においてピッチが所定値（Ｐ０）上昇することを表し、「＿」によりピッチが所定値（Ｐ０）下降することを表し、「＄」によりピッチが所定値（Ｐ０）だけ下方にシフトすることを表すことができる。なお、この変化レベル（Ｐ０）は、基準ピッチに対する変化量である。 Based on the character (mora) string, segment position, pitch string, and volume string data inputted as the input data 1 and the scale information extracted by the scale extracting part 4 Create a speech synthesis script suitable for the corresponding speech synthesis system.
As described above, the speech synthesis script according to the present invention includes scale information representing the pitch of a sound, a symbol representing a speech length, and a high / low accent symbol representing a pitch increase / decrease.
The symbol representing the utterance length is, for example, “-” (long sound symbol), and this symbol represents the utterance length in a predetermined unit.
FIG. 7 is a diagram illustrating an example of high and low accent symbols. In this figure, (a) is a symbol "'" (apostrophe) representing a pitch increase, (b) is a symbol "_" (underscore) representing a pitch decrease, and (c) is a symbol representing a pitch shift down. “$” (Dollar sign).
In the example shown in FIG. 7, “′” indicates that the pitch increases by a predetermined value (P0) in a predetermined time unit T0 (for example, the same time unit as the utterance length unit by “−”), and “_” It represents that the pitch is lowered by a predetermined value (P0), and “$” can represent that the pitch is shifted downward by a predetermined value (P0). The change level (P0) is a change amount with respect to the reference pitch.

このような高低アクセント記号とそれに対応するピッチの相対変化の対応を表す高低アクセント記号・ピッチ変換テーブルを記憶しておき、このテーブルを参照することにより、高低アクセント記号によるピッチの相対変化の態様を選択することもできる。例えば、前記図７の（ａ）〜（ｃ）における所定の時間単位Ｔ０におけるピッチの変化レベル（Ｐ０）の大きさが異なるものや、前記図７の（ａ）や（ｂ）のように直線的にピッチが変更されるのではなく、曲線的に変更されるものなど各種のピッチ変化に対応した複数のデータを記憶しておき、これを選択して使用することができる。 By storing a high / low accent symbol / pitch conversion table that indicates the correspondence between such high / low accent symbols and the relative change in pitch corresponding to the high / low accent symbols, by referring to this table, the mode of relative pitch change by high / low accent symbols can be changed. You can also choose. For example, the pitch change level (P0) in the predetermined time unit T0 in FIGS. 7 (a) to 7 (c) is different, or a straight line as shown in FIGS. 7 (a) and 7 (b). In this case, a plurality of data corresponding to various pitch changes, such as those that are changed in a curved line, are stored and can be selected and used.

図８は、前記合成用スクリプト作成部５により作成される音声合成用スクリプトの一例を示す図である。この図において、（ａ）はピッチ列の例、（ｂ）は該ピッチ列に対応する音声合成用スクリプトを示す。
図８の（ａ）の実線に示すようなピッチ列をもつ「あ」という文字（モーラ）が入力された場合、ピッチの後半部分がビブラート区間であるとみなされ、その区間の平均ピッチから音階情報が「Ｄ＃４」と決定される。
そして、前記合成用スクリプト作成部５において使用する高低アクセント記号・ピッチ変換テーブルを決定すると、１つの長音記号での発話長（Ｔ０）と１つのアクセント記号でのピッチの変化レベル（Ｐ０）が決定される。
そこで、まず、図８の（ａ）に実線で示すこの発音単位における最初の（ｔ０における）ピッチの値と前記音階情報（この場合は、Ｄ＃４）との前記ピッチの変化レベルを単位とする差（この場合は、−２×Ｐ０）を検出し、ピッチを音階情報から２×Ｐ０だけ下方にシフトさせるアクセント記号「＄２」を決定する。次に、前記発話長時間（Ｔ０）経過後の入力ピッチに最も近いピッチの変化量（Ｐ０）を単位とする値を決定し、それに対応する高低アクセント記号を決定する処理を繰り返す。図８の（ａ）において、時刻ｔ１からｔ２の間にピッチが２×Ｐ０だけ上昇しているため、これを表すアクセント記号「’２」と長音記号「−」が決定される。次に、時刻ｔ２からｔ３にかけてのピッチの変化量を検出すると、Ｐ０の上昇であるため、「’−」が決定され、時刻ｔ３からｔ４にかけては２×Ｐ０だけ下がっているため、「＿２−」というアクセント記号が決定される。以下同様にして、図８の（ｂ）に示す音声合成用スクリプトが作成される。なお、スクリプトの先頭部分には、その音階情報（Ｄ＃４）が記述される。
このようにして、図８の（ａ）に実線で示したピッチの変化を破線のように表すことができる。
なお、音量列（音量遷移情報）も入力されている場合には、該音量列についても、音量の上げ下げを表す記号（例えば、「＞」，「＜」など）を用いて、上述したピッチ列の場合と同様に、音量の遷移を記号で表すことができる。 FIG. 8 is a diagram showing an example of a speech synthesis script created by the synthesis script creation unit 5. In this figure, (a) shows an example of a pitch sequence, and (b) shows a speech synthesis script corresponding to the pitch sequence.
When a character “a” having a pitch sequence as shown by the solid line in FIG. 8A is input, the latter half of the pitch is regarded as a vibrato section, and the scale is calculated from the average pitch of the section. The information is determined as “D # 4”.
When the high / low accent symbol / pitch conversion table to be used in the synthesizing script creation unit 5 is determined, the utterance length (T0) for one long sound symbol and the pitch change level (P0) for one accent symbol are determined. Is done.
Therefore, first, the pitch change level between the first pitch value (at t0) and the scale information (D # 4 in this case) in this sound generation unit indicated by the solid line in FIG. Difference (in this case, −2 × P0) is detected, and the accent symbol “$ 2” that shifts the pitch downward by 2 × P0 from the scale information is determined. Next, the process of determining a value in units of the change amount (P0) of the pitch closest to the input pitch after the utterance long time (T0) has elapsed, and determining the corresponding high and low accent symbols is repeated. In FIG. 8A, since the pitch is increased by 2 × P0 between times t1 and t2, the accent symbol “′ 2” and the long sound symbol “−” representing this are determined. Next, when the amount of change in pitch from time t2 to t3 is detected, it is an increase in P0, so "'-" is determined. From time t3 to t4, it decreases by 2 × P0. "Is determined. Similarly, the speech synthesis script shown in FIG. 8B is created. Note that the scale information (D # 4) is described at the beginning of the script.
In this way, the change in pitch indicated by the solid line in FIG. 8A can be expressed as a broken line.
When a volume column (volume transition information) is also input, the pitch column described above is also used for the volume column using symbols (for example, “>”, “<”, etc.) that indicate an increase or decrease in volume. As in the case of, the volume transition can be represented by a symbol.

次に、このようにして作成された音声合成用スクリプトにより音声を合成する音声合成装置について、図９を参照して説明する。なお、この音源合成装置も、コンピュータにおけるプログラム処理により実現することができる。
図９において、入力された音声合成用スクリプトは、スクリプト解析部２１で解析され、モーラ（文字）情報と、ピッチ情報（音階、高低アクセント記号）、音量情報に分けられる。
音韻データ作成部２２では、モーラ情報からそれにあった音韻データ（例えば、「あ」のフォルマント情報や波形データ）を音韻データベース２３から取り出す。
ピッチ列生成部２４では、前記音声合成用スクリプト生成装置において用いられたものと同じ高低アクセント記号・ピッチ変換テーブル２５を用いて、その音階の高さに、高低アクセント記号に応じたピッチ変化を順次適用して、ピッチ列（ピッチ遷移情報）を生成する。また、音量情報についても、同様に音量列生成部２６で音量列（音量遷移情報）を生成する。
音声合成部２７では、前記音韻データベース２３から取り出した音韻データに対してピッチ列と音量列を元にピッチと音量を付けて音声を出力する。
ここで、合成方式は、一般的なフォルマント合成や波形合成など、ピッチと音量がつけられる方式であるならば、どの方式であってもよい。 Next, a speech synthesizer that synthesizes speech using the speech synthesis script created in this way will be described with reference to FIG. This sound source synthesizer can also be realized by program processing in a computer.
In FIG. 9, the input speech synthesis script is analyzed by the script analysis unit 21 and divided into mora (character) information, pitch information (scale, pitch accent symbol), and volume information.
The phoneme data creation unit 22 extracts phoneme data (for example, form information and waveform data of “A”) from the phoneme database 23 from the mora information.
The pitch string generator 24 uses the same pitch accent / pitch conversion table 25 as that used in the speech synthesis script generator to sequentially change the pitch according to the pitch accent to the pitch of the scale. Apply to generate a pitch train (pitch transition information). Similarly, for the volume information, the volume sequence generator 26 generates a volume sequence (volume transition information).
The speech synthesizer 27 outputs the speech with the pitch and volume added to the phoneme data extracted from the phoneme database 23 based on the pitch sequence and the volume sequence.
Here, the synthesis method may be any method as long as pitch and volume can be added, such as general formant synthesis and waveform synthesis.

なお、上記における高低アクセント記号は上述したものに限られることはなく、スクリプトの文法として規定された記号であれば何でもよい。
また、上記においては、基準ピッチを音名（音階情報）で表していたが、これに限られることはなく、ピッチ周波数そのものなど他の方法で表現してもよい。 Note that the high and low accent symbols in the above are not limited to those described above, and any symbol may be used as long as it is defined as a script grammar.
In the above description, the reference pitch is represented by the pitch name (scale information). However, the present invention is not limited to this and may be represented by other methods such as the pitch frequency itself.

本発明の音声合成用スクリプト生成装置の一実施の形態の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of one Embodiment of the speech synthesis script production | generation apparatus of this invention. 入力データを音声分析により得るための構成の一例を示す図である。It is a figure which shows an example of the structure for obtaining input data by audio | voice analysis. 入力データの一例を示す図である。It is a figure which shows an example of input data. ピッチ一定区間とビブラート区間について説明するための図であり、（ａ）はピッチ一定区間を示し、（ｂ）はビブラート区間を示す図である。It is a figure for demonstrating a pitch constant area and a vibrato area, (a) shows a pitch constant area, (b) is a figure which shows a vibrato area. ピッチ一定区間検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a fixed pitch area detection process. ビブラート区間検出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a vibrato area detection process. 高低アクセント記号の一例を示す図であり、（ａ）はピッチの上げを示す記号、（ｂ）はピッチの下げを示す記号、（ｃ）はピッチのシフトを示す記号の例を示す図である。It is a figure which shows an example of a high / low accent symbol, (a) is a symbol which shows the raise of a pitch, (b) is a symbol which shows the fall of a pitch, (c) is a figure which shows the example of the symbol which shows the shift of a pitch. . 音声合成用スクリプトの一例を示す図であり、（ａ）はピッチ列、（ｂ）は該ピッチ列に対応する音声合成用スクリプトを示す図である。It is a figure which shows an example of the script for speech synthesis, (a) is a pitch sequence, (b) is a diagram which shows the script for speech synthesis corresponding to this pitch sequence. 作成された音声合成用スクリプトにより音声を合成する音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which synthesize | combines speech with the created script for speech synthesis. 歌声におけるピッチの一例を示す図である。It is a figure which shows an example of the pitch in a singing voice.

Explanation of symbols

１：入力データ、２：ピッチ一定区間検出部、３：ビブラート区間検出部、４：音階抽出部、５：合成用スクリプト作成部、１１：セグメンテーション部、１２：ピッチ検出部、１３：音量検出部、２１：スクリプト解析部、２２：音韻データ作成部、２３：音韻データベース、２４：ピッチ列生成部、２５：高低アクセント記号・ピッチ変換テーブル、２６：音量列生成部 1: input data, 2: pitch constant interval detection unit, 3: vibrato interval detection unit, 4: scale extraction unit, 5: synthesis script creation unit, 11: segmentation unit, 12: pitch detection unit, 13: volume detection unit , 21: script analysis unit, 22: phoneme data creation unit, 23: phoneme database, 24: pitch sequence generation unit, 25: pitch accent / pitch conversion table, 26: volume sequence generation unit

Claims

A speech synthesis script generation device that generates a speech synthesis script to be supplied to a speech synthesis device,
Means for inputting mora information of speech to be synthesized, segment information indicating the timing at which each mora is uttered, and pitch transition information indicating a pitch change of each mora;
A constant pitch interval detecting means for detecting a constant pitch interval in which a variation in pitch is within a predetermined threshold from the pitch transition information;
Vibrato section detecting means for detecting a vibrato section in which the pitch is oscillating from the pitch transition information;
The longer one of the constant pitch section detected by the constant pitch section detection means and the vibrato section detected by the vibrato section detection means is determined as a reference pitch section, and a reference pitch in the reference pitch section is detected. , Scale extraction means for outputting the nearest scale information of the reference pitch,
Based on the mora information, the segment information, the pitch transition information, and the scale information, the mora information, the scale information, and a voice including a high and low accent symbol that represents a pitch change amount with respect to a reference pitch in a predetermined time unit. A speech synthesis script generation device comprising: synthesis script creation means for creating a synthesis script.

The speech synthesis script generation apparatus according to claim 1, further comprising means for extracting the mora information, the segment information, and the pitch transition information from speech waveform information.

3. The voice synthesis according to claim 1, wherein the reference pitch is an average pitch in the reference pitch section, a lowest pitch in the reference pitch section, or a highest pitch in the reference pitch section. Script generator.

A table representing correspondence between the height accent symbols and the pitch change modes corresponding to the height accent symbols;
The speech synthesis script generation apparatus according to claim 1, wherein the synthesis script creation unit creates the speech synthesis script with reference to the table.

On the computer,
Inputting the mora information of the speech to be synthesized, the segment information indicating the timing when each mora is uttered, and the pitch transition information indicating the pitch change of each mora;
A constant pitch interval detecting step for detecting a constant pitch interval in which a variation in pitch is within a predetermined threshold from the pitch transition information;
A vibrato section detecting step for detecting a vibrato section in which the pitch is oscillating from the pitch transition information;
The longer one of the constant pitch interval detected by the constant pitch interval detection step and the vibrato interval detected by the vibrato interval detection step is verified as a reference pitch interval, and a reference pitch in the reference pitch interval is detected. A scale extraction step for outputting the nearest scale information of the reference pitch;
Based on the mora information, the segment information, the pitch transition information, and the scale information, the mora information, the scale information, and a voice including a high and low accent symbol that represents a pitch change amount with respect to a reference pitch in a predetermined time unit. A speech synthesis script generation program that executes a synthesis script creation step for creating a synthesis script.