JPH08160990A

JPH08160990A - Speech synthesizing device

Info

Publication number: JPH08160990A
Application number: JP6306165A
Authority: JP
Inventors: Kaoru Tsukamoto; 薫塚本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1994-12-09
Filing date: 1994-12-09
Publication date: 1996-06-21

Abstract

PURPOSE: To provide a speech synthesizing device which can generate a more natural synthesized speech. CONSTITUTION: A text analysis part 11 generate a phoneme and rhythm symbol sequence from input character information, and a speech unit dictionary 16 contains speech unit labels and storage positions in a speech element piece data storage part 14 as to speech units stored in a speech element piece data storage part 14. In a phoneme continuance table 17, naturally spoken continuous speeches are classified having phoneme environments of >=2 front and rear sounds according to the use frequencies of phonemes and phoneme continuance in the classified phoneme environments is described by a voiceless part, a consonant part, and a vowel part. A synthesis parameter generation part 13 retrieves the phoneme continuance table with the phoneme environment of each phoneme according to the phoneme and rhythm symbol sequence to determine phoneme continuance, thereby setting natural continuance by the consonant part and vowel part.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、入力された文字列情
報に基づき音声を合成して出力する音声合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing apparatus for synthesizing and outputting voice based on input character string information.

【０００２】[0002]

【従来の技術】文字情報をを入力してそれを音声に変換
して出力する音声合成装置は、出力語彙の制限が無いこ
とから録音再生型の音声合成技術にとって代わる音声合
成技術として種々の分野での応用が期待できる。例え
ば、ワードプロセッサ等で作成されたテキストデータを
音声に変換して出力させたり、また、テキストを編集す
るだけで簡単に応答メッセージを作成、変更することが
出来るので、電話等の通信サービスなどにも利用でき
る。2. Description of the Related Art A voice synthesizing device for inputting character information, converting it into voice and outputting it is output in various fields as a voice synthesizing technique which replaces a recording / playback type voice synthesizing technique because there is no limitation in output vocabulary. Can be expected to be applied in. For example, text data created by a word processor etc. can be converted into voice and output, or a response message can be created and changed simply by editing the text, so it can be used for communication services such as telephones. Available.

【０００３】図２は、日本語（漢字かな混じり文）を入
力とする従来の音声合成装置（日本語テキスト音声変換
装置）の構成を示したものであり、この図を参照して従
来の音声合成装置の概要を説明する。FIG. 2 shows a configuration of a conventional speech synthesizer (Japanese text-to-speech converter) which inputs Japanese (Kanji / Kana mixed sentence) as an input. The outline of the synthesizer will be described.

【０００４】図２において、テキスト解析部（１０１）
は、発音辞書（１０２）を利用して、文字情報入力部
（１００）より入力された漢字かな混じり文から音韻韻
律記号列を生成する。この音韻韻律記号列とは、入力文
の読み、アクセント、イントネーション等を文字列とし
て記述したものであり、中間言語と呼ばれる。各単語の
読みとアクセントは、発音辞書（１０２）に登録されて
おり、テキスト解析部（１０１）はこの発音辞書（１０
２）を参照しながら音韻韻律記号列を生成する。In FIG. 2, a text analysis unit (101)
Uses the pronunciation dictionary (102) to generate a phonological prosodic symbol string from a kanji-kana mixed sentence input from the character information input unit (100). This phonological prosodic symbol string describes the reading, accent, intonation, etc. of the input sentence as a character string and is called an intermediate language. The reading and accent of each word are registered in the pronunciation dictionary (102), and the text analysis unit (101) uses this pronunciation dictionary (10).
A phonological prosody symbol string is generated with reference to 2).

【０００５】合成パラメータ生成部（１０３）では、音
韻韻律記号列に基づき音声素片を取り出し、予め定めら
れた規則により音韻継続時間、基本周波数パターンとい
った音声合成用のパラメータを生成する。このうち、音
声素片は単語等を発音した時の発声データから分析生成
されるもので、音声合成のための音声の基本単位であ
り、これらを重ね合わせていくことによって合成波形が
生成される。尚、以下の説明ではＣＶ（子音−母音）、
ＶＣＶ（母音−子音−母音）等の音声の基本要素の組み
合わせ自体を音声単位と呼び、その音声単位の波形を実
現する要素を音声素片と呼ぶ。各音声単位は、例えば、
複数の音声素片から成る組に対応する。音声素片データ
は、ＲＯＭ等でなる音声素片データ記憶部（１０４）に
格納されており、合成パラメータ生成部（１０３）は音
韻韻律記号列から音声単位を認識して対応する音声素片
データを取り出す。A synthesis parameter generation unit (103) extracts a voice segment based on a phoneme prosodic symbol string and generates a voice synthesis parameter such as a phoneme duration or a fundamental frequency pattern according to a predetermined rule. Of these, the speech unit is generated by analysis from the utterance data when a word or the like is pronounced, and is a basic unit of speech for speech synthesis, and a synthetic waveform is generated by superposing these. . In the following description, CV (consonant-vowel),
A combination itself of basic elements of voice such as VCV (vowel-consonant-vowel) is called a voice unit, and an element that realizes a waveform of the voice unit is called a voice unit. Each voice unit is, for example,
It corresponds to a set consisting of a plurality of speech units. The voice unit data is stored in a voice unit data storage unit (104) such as a ROM, and the synthesis parameter generation unit (103) recognizes the voice unit from the phonological prosodic symbol string and corresponds to the voice unit data. Take out.

【０００６】音声合成部（１０５）は、合成パラメータ
生成部（１０３）が生成した合成パラメータに基づいて
合成波形（音声信号）を生成する。このような合成音声
信号が、スピーカ（１０６）を介して音声出力された
り、回線を介して他の装置に伝送されたりする。The voice synthesis unit (105) generates a synthetic waveform (voice signal) based on the synthesis parameter generated by the synthesis parameter generation unit (103). Such a synthetic voice signal is output as voice through the speaker (106) or is transmitted to another device through a line.

【０００７】上記従来技術では、予め決められた規則に
基づいて音韻継続時間等の合成パラメータを決定してい
たが、合成音の自然性を高めるために、実音声の音韻を
前後の音韻環境毎に分析した結果を統計処理によって与
える方法が、例えば特開平０３−１６１８００に開示さ
れている。In the above prior art, the synthesis parameters such as the phoneme duration are determined based on a predetermined rule. However, in order to enhance the naturalness of the synthesized speech, the phonology of the real voice is changed for each phonological environment before and after. A method of providing the result of analysis by statistical processing is disclosed in, for example, Japanese Patent Application Laid-Open No. 03-161800.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、上記第
１の従来技術では、音韻継続時間は入力されたテキスト
が変換された音韻記号列によって、予め決められた規則
に基づいて与えられるものであり、自然音声の継続時間
に比べると単調であり、音韻継続時間は母音定常部の伸
縮のみで行われることが多かった。However, in the above-mentioned first prior art, the phoneme duration is given by the phoneme symbol string obtained by converting the input text based on a predetermined rule. It was more monotonous than the duration of natural speech, and phoneme duration was often performed only by expanding and contracting the vowel stationary part.

【０００９】また、第２の従来技術においても、着目す
る音韻の前後の音韻別の統計量しか考慮していないた
め、適切な継続時間が求まらないことがあった。また、
統計を細分化した場合、全ての音韻環境についての統計
量を集めることは困難であり、どのような継続時間テー
ブルを構成したらよいかがわからないという問題があっ
た。In the second prior art, too, since only the statistics for each phoneme before and after the phoneme of interest are taken into consideration, an appropriate duration may not be obtained. Also,
When the statistics are subdivided, it is difficult to collect statistics about all phoneme environments, and there is a problem that it is not possible to know what kind of duration table should be constructed.

【００１０】従って、本発明の主たる目的は、音声単位
の接続による歪みが比較的少ないＶＣＶやＣＶＣ単位等
を用いた音声合成合成装置において、自然音声の分析デ
ータに基づいて、前後２モーラ以上の環境別に、使用頻
度により効率よく細分化した継続時間テーブルを作成し
て、合成時にこのテーブルを参照することにより、より
自然な音韻継続時間を持つ合成音声を生成することが可
能な音声合成装置を提供することにある。Therefore, a main object of the present invention is to provide a speech synthesis / synthesis apparatus using a VCV or CVC unit, which has relatively little distortion due to connection of speech units, and has two or more mora before and after, based on analysis data of natural speech. A speech synthesizer capable of generating a synthetic speech having a more natural phoneme duration by creating a duration table efficiently subdivided according to the usage frequency and referring to this table during synthesis. To provide.

【００１１】[0011]

【課題を解決するための手段】この発明は、前記課題を
解決するために、入力文字情報から音韻韻律記号列を生
成するテキスト解析部と、音声素片データを格納する音
声素片データ記憶部と、音声素片データ記憶部に格納さ
れている音声単位について、音声単位ラベルと音声素片
データ記憶部での記憶位置などを記述した音声単位辞書
と、自然発声された連続音声を、音韻の使用頻度に応じ
て前後２音以上の音韻環境を持つように分類すると共
に、分類された音韻環境における音韻継続時間を、無音
部、子音部、母音部別に記述した音韻継続時間テーブル
と、音韻韻律記号列に従って、各音韻の音韻環境から前
記音韻継続時間テーブルを検索して音韻継続時間を決定
し、子音部、母音部別に自然な継続時間を設定する合成
パラメータ生成部とを備えたことを特徴とする。SUMMARY OF THE INVENTION In order to solve the above problems, the present invention provides a text analysis unit for generating a phoneme prosodic symbol string from input character information, and a speech unit data storage unit for storing speech unit data. , A voice unit dictionary describing a voice unit label, a storage position in the voice unit data storage unit, and the like for a voice unit stored in the voice unit data storage unit; According to the frequency of use, the phoneme is classified into two or more phonological environments before and after, and a phoneme duration table in which the phoneme durations in the classified phonological environment are described for each silent part, consonant part, and vowel part, and a phonological prosody. A synthesis parameter generation unit that determines the phoneme duration by searching the phoneme duration table from the phoneme environment of each phoneme according to the symbol string and sets a natural duration for each consonant part and vowel part. Characterized by comprising.

【００１２】[0012]

【作用】この発明による音声合成装置においては、テキ
スト解析部は入力文字情報から音韻韻律記号列を生成
し、音声素片データ記憶部は音声合成信号の基準となる
音声素片データを記憶する。また、音声単位辞書には音
声素片データ記憶部に格納されている音声単位につい
て、音声単位ラベルと音声素片データ記憶部での記憶位
置などが記述される。更に、音韻継続時間テーブルに
は、自然発声された連続音声を、音韻の使用頻度に応じ
て前後２音以上の音韻環境を持つように分類すると共
に、分類された音韻環境における音韻継続時間を、無音
部、子音部、母音部別に記述している。また、合成パラ
メータ生成部は、音韻韻律記号列に従って、各音韻の音
韻環境から前記音韻継続時間テーブルを検索して音韻継
続時間を決定し、子音部、母音部別に自然な継続時間を
設定する。従って自然な音韻継続時間を持つ合成音声を
生成することが可能となる。In the speech synthesizer according to the present invention, the text analysis section generates a phonological prosodic symbol string from the input character information, and the speech segment data storage section stores speech segment data which is a reference of the speech synthesis signal. Further, the voice unit dictionary describes a voice unit label, a storage position in the voice unit data storage unit, and the like for a voice unit stored in the voice unit data storage unit. Further, in the phoneme duration table, naturally uttered continuous voices are classified so as to have a phoneme environment of two or more sounds before and after according to the usage frequency of the phoneme, and the phoneme duration in the classified phoneme environment is It is described for each silent part, consonant part, and vowel part. Further, the synthesis parameter generation unit searches the phoneme duration table from the phoneme environment of each phoneme in accordance with the phoneme prosodic symbol string to determine the phoneme duration, and sets a natural duration for each consonant part and vowel part. Therefore, it becomes possible to generate synthetic speech having a natural phoneme duration.

【００１３】[0013]

【実施例】図１は、本発明の音声合成装置の構成を示す
機能ブロック図であり、文字情報入力部１０、テキスト
解析部１１、発音辞書１２、合成パラメータ生成部１
３、音声素片データ記憶部１４、音声合成部１５、音声
単位辞書１６、継続時間テーブル１７、スピーカ１８か
ら構成される。上記構成のうち、文字情報入力部１０、
発音辞書１２、音声素片データ記憶部１４、音声合成部
１５及びスピーカ１８は、図２の従来の音声合成装置の
対応する構成要素と同１つの動作を行うものである。1 is a functional block diagram showing the configuration of a speech synthesizer according to the present invention. A character information input section 10, a text analysis section 11, a pronunciation dictionary 12, and a synthesis parameter generation section 1 are shown.
3, a voice unit data storage unit 14, a voice synthesis unit 15, a voice unit dictionary 16, a duration table 17, and a speaker 18. Of the above configuration, the character information input unit 10,
The pronunciation dictionary 12, the voice unit data storage unit 14, the voice synthesis unit 15, and the speaker 18 perform the same operation as the corresponding constituent elements of the conventional voice synthesis apparatus in FIG.

【００１４】本実施例における合成パラメータ生成部１
３が利用する継続時間テーブル１７には継続時間モデル
に基づいて自然に発声された音声データから分析生成さ
れた継続時間が格納されている。Synthesis parameter generator 1 in this embodiment
The duration table 17 used by No. 3 stores the duration that is analyzed and generated from the voice data that is naturally uttered based on the duration model.

【００１５】音声単位辞書１６には、音声素片データ記
憶部１４に格納されている音声単位について、例えば音
声単位がＶＣＶ単位であれば／ａｋｉ／、／ｅｋｉ／等
の音声単位ラベル（音声単位名）と音声素片データ記憶
部１４での記憶位置などが記述されている。In the voice unit dictionary 16, for voice units stored in the voice unit data storage unit 14, for example, if the voice unit is a VCV unit, voice unit labels such as / aki / and / eki / (voice units Name) and the storage location in the voice unit data storage unit 14 are described.

【００１６】合成パラメータ生成部１３は、音韻記号列
に基づいて音声単位辞書１６を参照し、選択された音声
単位に従って、対応する音声素片データを音声素片デー
タ記憶部１４から取り出し、テキストの音韻環境やアク
セント情報から継続時間テーブル１７を参照して継続時
間を決定し、パワーや基本周波数パターン等の音声合成
用パラメータを生成する。The synthesis parameter generation unit 13 refers to the voice unit dictionary 16 based on the phoneme symbol string, retrieves the corresponding voice unit data from the voice unit data storage unit 14 in accordance with the selected voice unit, and outputs the text data. The duration is determined by referring to the duration table 17 from the phonological environment and accent information, and parameters for speech synthesis such as power and fundamental frequency pattern are generated.

【００１７】以上説明したように機能する各部よりなる
本実施例の音声合成装置は全体としては、以下のように
動作する。この動作手順を図３を用いて説明する。先
ず、文字情報（漢字かな混じり文等のテキストデータ）
を入力し（ステップＳ２０１）、その文字情報を解析し
て１フレーズ毎に音韻韻律記号列に変換する（ステップ
Ｓ２０２）。The speech synthesizing apparatus of this embodiment, which is composed of the respective units functioning as described above, operates as follows as a whole. This operation procedure will be described with reference to FIG. First, character information (text data such as kanji and kana mixed sentences)
Is input (step S201), the character information is analyzed and converted into a phonological prosodic symbol string for each phrase (step S202).

【００１８】次に、音韻韻律記号列に沿ってフレーズの
先頭の音声単位の種類により、順次、音声単語辞書１６
を検索し、音声素片データを取り出す。（ステップＳ２
０３）。Next, along the phonological prosodic symbol string, the phonetic word dictionary 16 is sequentially created according to the type of the phonetic unit at the beginning of the phrase.
To retrieve the voice segment data. (Step S2
03).

【００１９】その後、フレーズ毎に、音韻韻律記号列に
基づいて継続時間テーブル１７を参照することによりそ
れぞれの音韻の継続時間を決定し（ステップＳ２０
４），韻律パラメータ（音韻継続時間、基本周波数パタ
ーン、パワー等を規定するパラメータ）を設定する（ス
テップＳ２０５）．After that, the duration of each phoneme is determined for each phrase by referring to the duration table 17 based on the phoneme prosodic symbol string (step S20).
4), prosody parameters (parameters that define phoneme duration, fundamental frequency pattern, power, etc.) are set (step S205).

【００２０】以上のようにして韻律パラメータと音声素
片データからなる合成パラメータが決定されると、音声
信号を合成して（ステップＳ２０６）出力する（ステッ
プＳ２０７）．出力方法としては、スピーカー１８から
の出力でもよいし、また、回線を介して他の装置への伝
送でもよい。When the synthesis parameter composed of the prosody parameter and the voice segment data is determined as described above, the voice signal is synthesized (step S206) and output (step S207). The output method may be output from the speaker 18 or may be transmission to another device via a line.

【００２１】次に、継続時間テーブル１７の作成方法に
ついて詳述する。先ず、各音韻毎に無音部、子音部、母
音部の各部位毎にラベリングされた自然音声を用いて、
前後の音韻環境別に継続時間を算出する。図４は、
「か」についての音韻環境を分類したものの一部であ
る。「か」に近い音韻環境から、ラベルがツリー状に広
がったラベルテーブルが構成されているが、他の音韻に
ついても同様である。先ず、子音部が［Ｋ］のグループ
にあるものが、分類ツリーのトップとなる。次の分類は
「か、き、く、け、こ」の音韻となる。その次の分類
は、後続音韻環境が「さ」の例が図示されているが、先
ず、［Ｓ］のグループ、その次に「さ、し、す、せ、
そ」のグループとなる。更にその次は、直前環境である
が、ここでは母音グループの平均で、その次が各母音毎
に「あ、い、う、え、お」というように、後続、前環境
というように交互にツリー状にラベルが記載される。語
頭や語尾以降で続く音韻がないときには、語頭から後続
環境へ、語尾から直前環境へと一方向に分類を行い、ラ
ベルを作成する。これが、テーブルのラベルとなる。Next, a method of creating the duration table 17 will be described in detail. First, using natural speech labeled for each part of the silent part, consonant part, and vowel part for each phoneme,
The duration is calculated for each phoneme environment before and after. FIG.
This is a part of the classification of phonological environments for "ka". A label table in which labels are spread in a tree shape is constructed from a phoneme environment close to "ka", but the same applies to other phonemes. First, the consonant part in the [K] group is the top of the classification tree. The next classification is the phoneme of "ka, ki, ku, ke, ko". For the next classification, an example in which the subsequent phonological environment is “sa” is shown. First, the group of [S] and then “sa, shi, su, se,
That's the group. The next is the immediately preceding environment, but here is the average of the vowel group, and the next is for each vowel "A, I, U, E, O", and so on. Labels are written in a tree shape. When there is no phoneme that follows the beginning or end of a word, classification is performed in one direction from the beginning of the word to the succeeding environment and from the ending of the word to the immediately preceding environment, and a label is created. This will be the label for the table.

【００２２】音声データは、ラベルに従って無音部、子
音部、母音部の各部に分けて継続時間が計算され、テー
ブルに記述される。ツリーの下に行くに従ってデータ量
が少なくなるが、データのばらつきが統計的に吸収され
なくなるまで、例えば、データが１０程度になるまで繰
り返し分類していく。こうすると、ツリーの途中でデー
タが足りなくなるが、そのときはデータの足りなくなっ
た時点のラベルに終端記号を記載する（図４では＊記号
を用いている）。このようにして作成されたテーブル
は、データ量の多い音韻、つまり、使用頻度の高い音韻
について、より細かく音韻継続時間を記述することがで
きる。The voice data is divided into silent parts, consonant parts, and vowel parts according to the label, and the duration is calculated and described in the table. Although the amount of data decreases as going down the tree, the data is repeatedly classified until the variation in the data is not statistically absorbed, for example, the data becomes about 10. In this case, the data runs out in the middle of the tree, but at that time, the terminal symbol is described in the label at the time when the data runs out (* symbol is used in FIG. 4). The table created in this way can describe the phoneme duration more finely for phonemes with a large amount of data, that is, for phonemes that are frequently used.

【００２３】この継続時間テーブル１７を用いて音韻継
続時間を決定する処理（図３のステップＳ２０４）の具
体例を以下に示す。ここでは、入力文を”これは、音声
合成装置です”として説明する。また、本発明は、合成
単位を特に問題にしないが、ここではＶＣＶ単位を用い
て説明する。A specific example of the process (step S204 in FIG. 3) for determining the phoneme duration using this duration table 17 is shown below. Here, the input sentence will be described as "this is a speech synthesizer". Further, although the present invention does not make the synthesis unit a problem, the VCV unit will be used for explanation here.

【００２４】上記入力文は、テキスト解析部１１によっ
てＰ１コレワ、Ｐ２オンセーゴーセーソ’−チデスＰ０と解析される。テキスト解析部１１では、このように発
音辞書１２を参照しながら、入力文を音韻韻律記号列に
変換する。この時、必要に応じて文頭や文中、文末にフ
レーズ記号（Ｐ０，Ｐ１，Ｐ２）等を挿入する。これら
のフレーズ記号は、文頭や文中、文末におけるフレーズ
の立ち上がり、立ち下がりを示すものである。The above-mentioned input sentence is analyzed by the text analysis unit 11 as P1 corewa, P2 onsaegoseo'-chidespo. The text analysis unit 11 converts the input sentence into a phonological prosodic symbol string with reference to the pronunciation dictionary 12 as described above. At this time, a phrase symbol (P0, P1, P2) or the like is inserted at the beginning or in the sentence or at the end of the sentence, if necessary. These phrase symbols indicate rising and falling edges of a phrase at the beginning of a sentence, the inside of a sentence, and the end of a sentence.

【００２５】先ず、第１フレーズである”コレワ”につ
いて、合成パラメータ生成部１３は音韻記号列に基づい
て音声単位辞書１６を参照し、選択された音声単位に従
って、対応する音声素片データを音声素片データ記憶部
１４から取り出す。ＶＣＶ音声単位では、／ｋｏ／、／
ｏｒｅ／、／ｅｗａ／の３個の音声単位に相当する音声
素片が取り出される。First, for the first phrase "corewa", the synthesis parameter generator 13 refers to the voice unit dictionary 16 based on the phoneme symbol string, and outputs the corresponding voice segment data according to the selected voice unit. It is taken out from the segment data storage unit 14. In VCV voice unit, / ko /, /
Speech units corresponding to three speech units of ore / and / ewa / are taken out.

【００２６】次に、継続時間テーブル１７を参照し、そ
れぞれの音韻の継続時間を決定する。それぞれの音韻環
境は表１のようになっている。Next, referring to the duration table 17, the duration of each phoneme is determined. Table 1 shows each phoneme environment.

【００２７】[0027]

【表１】ここでは、音韻記号列から、継続時間テーブル１７を参
照し、音声単位における音声素片の構成（子音長、母音
長）からテーブル上の継続時間が実現されるように、フ
レーム長とフレーム数とを決めることで、総合的に継続
時間を決定する。このような継続時間テーブルを用いる
ことで、子音部、母音部で個々に継続時間を求め、自然
音声の子音長と母音長のバランスに近いように決めるこ
とが可能である。[Table 1] Here, the duration table 17 is referred from the phoneme symbol string, and the frame length and the number of frames are set so that the duration on the table is realized from the configuration (consonant length, vowel length) of the speech unit in the voice unit. By determining, the duration is comprehensively determined. By using such a duration table, it is possible to obtain the duration for each of the consonant part and the vowel part, and to determine the duration to be close to the balance between the consonant length and the vowel length of natural speech.

【００２８】例えば、音韻／ｋｏ／について、１フレー
ムが８ミリ秒の標準速で、テーブルで要求される長さ
が、テーブル：｛ｋｏ｝＝｛ｋ｝＋｛ｏ｝＝６フレーム＋９
フレームであり、実際の音声素片の長さが、素片構成：｛ｋｏ｝＝｛ｋ｝＋｛ｏ｝＝５フレーム＋１
１フレームであった場合、音声素片の子音部のフレーム長を９．６
ミリ秒にすればテーブルと同じ子音の継続時間が実現で
き、母音部については従来通り母音定常部のフレーム数
を加減することで、全体により自然に近い継続時間を設
定できる。For example, for the phoneme / ko /, one frame has a standard speed of 8 milliseconds and the length required in the table is as follows: Table: {ko} = {k} + {o} = 6 frames + 9
It is a frame, and the actual length of the speech segment is as follows: Element configuration: {ko} = {k} + {o} = 5 frames + 1
When the frame length is 1 frame, the frame length of the consonant part of the speech unit is 9.6.
If it is set to milliseconds, the duration of the same consonant as in the table can be realized, and for the vowel portion, by adjusting the number of frames of the vowel stationary portion as in the conventional case, a more natural duration can be set.

【００２９】それぞれの音韻の音韻環境により、継続時
間テーブル上でどこまで分類されているか異なるが、そ
の音韻が分類されたところの最終端のデータを用いる。
このことにより、使用頻度の高い音韻については、より
詳しく適切な音韻継続時間を求めることが可能であり、
音声がより自然になる。また、このように前後２音以上
の環境と、無音、子音、母音長を分けて持つことによ
り、無声化音を自動で設定可能であり、フレーズ間のポ
ーズ長、無音区間の長さも全て統計データから決めるこ
とが可能である。ツリーの最終端のデータを用いるとい
うことは、例えば、「明かすのであった。（ａｋａｓｕ
ｎｏｄｅａｔｔａ）」という音韻列の「か」の継続時間
を図４から参照するには、後続の［ｓｕ］のラベルを確
認し、次に直前の［ａ］、次に後続環境の２番目の［ｎ
ｏ］のラベルをみると、終端記号（＊）が付加されてい
るので、［ｎｏ］の前のグループ［ｎ］のグループの平
均データを用いることになる。Depending on the phoneme environment of each phoneme, how far the phoneme is classified on the duration table is different, but the data at the final end where the phoneme is classified is used.
As a result, for phonemes that are frequently used, it is possible to obtain more detailed and appropriate phoneme duration,
The sound becomes more natural. In addition, it is possible to automatically set the unvoiced sound by having the environment with two or more sounds before and after and the silence, consonant, and vowel length separately, and the pause length between phrases and the length of silent intervals are all statistical. It can be decided from the data. The use of the data at the end of the tree is, for example, "to reveal. (Akasu
To refer to the duration of "ka" in the phoneme sequence "nodeatta)" from FIG. 4, check the label of the subsequent [su], then the immediately preceding [a], and then the second [] of the subsequent environment. n
Looking at the label of [o], since the terminal symbol (*) is added, the average data of the group of group [n] before [no] is used.

【００３０】[0030]

【発明の効果】以上、詳細に説明したように、本発明に
よれば、入力文字情報から音韻韻律記号列を生成するテ
キスト解析部と、音声素片データを格納する音声素片デ
ータ記憶部と、前記音声素片データ記憶部に格納されて
いる音声単位について、音声単位ラベルと音声素片デー
タ記憶部での記憶位置などを記述した音声単位辞書と、
自然発声された連続音声を、音韻の使用頻度に応じて前
後２音以上の音韻環境を持つように分類すると共に、分
類された音韻環境における音韻継続時間を、無音部、子
音部、母音部別に記述した音韻継続時間テーブルと、前
記音韻韻律記号列に従って、各音韻の音韻環境から前記
音韻継続時間テーブルを検索して音韻継続時間を決定
し、子音部、母音部別に自然な継続時間を設定する合成
パラメータ生成部とを備えた構成としたので、フレーズ
の立ち上がりや立ち下がり、アクセントのある音韻に非
常に効果的に作用し、肉声感が増加しより自然な印象を
与えるという効果がある。As described above in detail, according to the present invention, a text analysis unit for generating a phoneme prosodic symbol string from input character information, and a voice unit data storage unit for storing voice unit data. A voice unit dictionary describing a voice unit label, a storage position in the voice unit data storage unit, and the like for a voice unit stored in the voice unit data storage unit,
Naturally uttered continuous speech is classified so as to have a phonological environment of two or more sounds before and after according to the frequency of use of the phonology, and the phoneme duration in the classified phonological environment is divided into silent parts, consonant parts, and vowel parts. According to the described phoneme duration table and the phoneme prosodic symbol string, the phoneme duration table is searched from the phoneme environment of each phoneme to determine the phoneme duration, and a natural duration is set for each consonant part and vowel part. Since it is configured to include the synthesis parameter generation unit, it has a very effective effect on the rise and fall of phrases and phonemes with accents, and the effect of increasing the real voice and giving a more natural impression.

[Brief description of drawings]

【図１】本発明の音声合成装置の一実施例の構成を示す
ブロック図である。FIG. 1 is a block diagram showing the configuration of an embodiment of a speech synthesizer of the present invention.

【図２】従来の音声合成装置の構成を示すブロック図で
ある。FIG. 2 is a block diagram showing a configuration of a conventional speech synthesizer.

【図３】実施例の音声合成装置の音声合成動作手順を示
すフローチャートである。FIG. 3 is a flowchart showing a voice synthesizing operation procedure of the voice synthesizing device of the embodiment.

【図４】継続時間テーブルの構成例を示す図である。FIG. 4 is a diagram showing a configuration example of a duration table.

[Explanation of symbols]

１０文字情報入力部１１テキスト解析部１２発音辞書１３合成パラメータ生成部１４音声素片データ記憶部１５音声合成部１６音声単位辞書１７継続時間テーブル１８スピーカ 10 character information input unit 11 text analysis unit 12 pronunciation dictionary 13 synthesis parameter generation unit 14 speech unit data storage unit 15 speech synthesis unit 16 speech unit dictionary 17 duration table 18 speaker

Claims

[Claims]

1. A text analysis unit for generating a phonological prosodic symbol string from input character information, a voice unit data storage unit for storing voice unit data, and a voice unit stored in the voice unit data storage unit. For each piece, a voice unit dictionary describing the voice unit label and the storage position in the voice unit data storage unit, and a continuous voicing spontaneously have a phonological environment of two or more tones before and after, depending on the phoneme usage frequency. A phoneme duration table in which the phoneme duration in the classified phoneme environment is described for each of the silent part, the consonant part, and the vowel part, and the phoneme continuation from the phoneme environment of each phoneme according to the phoneme prosodic symbol string. A speech synthesizing apparatus comprising: a synthesis parameter generating unit that searches a time table to determine a phoneme duration and sets a natural duration for each consonant part and vowel part.