JP3668583B2

JP3668583B2 - Speech synthesis apparatus and method

Info

Publication number: JP3668583B2
Application number: JP05760397A
Authority: JP
Inventors: 孝章新居
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-03-12
Filing date: 1997-03-12
Publication date: 2005-07-06
Anticipated expiration: 2017-03-12
Also published as: JPH10254676A

Abstract

PROBLEM TO BE SOLVED: To provide voice output for easily recognizing contents by linking and outputting a voice waveform and display. SOLUTION: This voice synthesizer 10 is composed of an input part 12, a display information extraction part 16, a language processing part 14, a phonetic processing part 18, a synthesis parameter generation part 20, a voice waveform generation part 22, an output control part 24, a voice output part 26, a display part 28 and a display condition extraction part 30. A prescribed sentence in a routine sentence such as an index, a title, a subject and a sender, etc., is extracted in an inputted text, a character string for easily expressing the intention of the sentence is extracted in the case of judging that it is not the routine sentence and the sentence of the display information of voice to be synthesized is outputted to the voice output part 26 and the display part 28 by linking the voice waveform and the display information in the output control part.

Description

【０００１】
【発明の属する技術分野】
本発明は、音声と同時に画面表示を行う音声合成装置及びその方法に関するものである。
【０００２】
【従来の技術】
近年においては、文字、音声、図形、映像などのマルチメディアを入力、出力及び加工することで、人間とコンピュータとの対話（Human-Computer Interaction）を様々な形態で行う研究が行なわれている。
【０００３】
一方、最近になってメモリ容量や計算機のパワーが飛躍的に向上したことで、マルチメディアを扱えるワークステーションやパーソナルコンピュータが開発され、高品質の音声や音響信号が特別なハードウエアなしに標準システムで入出力できるようになってきた。このような状況から計算機の入出力メディアとして音声入出力技術は重要なものとなっている。
【０００４】
従来から音声入出力技術の中でも、音声合成技術と呼ばれる任意の文章あるいは単語の文字を音声に変換する技術の開発が行なわれてきている。現在では、音声規則合成専用のハードウエアが開発され、テキストの読み合わせなどの応用が広がりつつある。このような技術により、メディア変換として言語的な情報である文字を音声信号に変換が可能となってきた。また、電話による音声サービスや情報携帯端末などにおいて、利用価値が高くなってきている。
【０００５】
【発明が解決しようとする課題】
しかし、電話などのように、音声のみを利用する場合、途中から聞き始めた場合、内容の把握が容易でなく、読み上げている文章が何のことを読み上げているのか把握しにくい。
【０００６】
情報携帯端末などにおける画面表示装置をもつ端末などにおいて、画面表示と音声を出力することが可能なものがあるが、携帯性を重視すると、画面表示部はどうしても小さくなり、音声を出力する場合において、画面表示においても連携して表示する場合の配慮が必要である。
【０００７】
定型文においては、予め設定した項目を表示することも可能であるが、非定型文の表示情報においては、画面表示部が小さい場合、文章の内容を全部表示することは容易でなく、次々と表示される文章を読んでいく、または音声を最後まで聞くなどしないと内容を把握しにくい場合がある。
【０００８】
そこで、本発明では、入力文章から画面表示する表示情報を適切に抽出して、音声と連携して出力し、内容の把握を向上させることを目的とする音声合成装置を提供する。
【０００９】
【課題を解決するための手段】
本発明は、テキスト文書を入力する入力手段と、前記入力したテキスト文書から音声処理に必要な言語情報を抽出する言語処理手段と、前記テキスト文書の内容や特徴を表す表示情報を、前記言語情報から抽出する表示情報抽出手段と、前記言語情報から音声波形を生成する音声波形生成手段と、前記音声波形を出力する音声出力手段と、前記表示情報を表示する表示手段と、前記音声出力手段における前記音声波形の出力状態に合わせて、前記表示情報を前記表示手段に表示するように制御する出力制御手段とを具備し、前記表示情報抽出ステップは、前記テキスト文書が定型文か、それ以外の非定型文かを判別する文体判別ステップと、前記テキスト文書が定型文であると判別した場合に、見出し、タイトル、項目、差出人などの予め設定した前記定型文における所定文章を前記テキスト文章から抽出し、この抽出した所定文章を前記表示情報とする定型文表示情報抽出ステップと、前記テキスト文書が非定型文であると判別した場合に、前記テキスト文章の意図を表現する文字列を前記テキスト文章から抽出し、この抽出した文字列を前記表示情報とする非定型文表示情報抽出ステップとを具備することを特徴とする音声合成装置である。
【００１０】
また、本発明は、入力したテキスト文書から音声処理に必要な言語情報を抽出する言語処理ステップと、前記テキスト文書の内容や特徴を表す表示情報を、前記言語情報から抽出する表示情報抽出ステップと、前記言語情報から音声波形を生成する音声波形生成ステップと、前記音声波形を出力する音声出力ステップと、前記表示情報を表示する表示ステップと、前記音声出力ステップにおける前記音声波形の出力状態に合わせて、前記表示情報を前記表示ステップにおいて表示するように制御する出力制御ステップとを具備し、前記表示情報抽出ステップは、前記テキスト文書が定型文か、それ以外の非定型文かを判別する文体判別ステップと、前記テキスト文書が定型文であると判別した場合に、見出し、タイトル、項目、差出人などの予め設定した前記定型文における所定文章を前記テキスト文章から抽出し、この抽出した所定文章を前記表示情報とする定型文表示情報抽出ステップと、前記テキスト文書が非定型文であると判別した場合に、前記テキスト文章の意図を表現する文字列を前記テキスト文章から抽出し、この抽出した文字列を前記表示情報とする非定型文表示情報抽出ステップとを具備することを特徴とする音声合成方法である。
【００１１】
上記発明であると、任意のテキスト文章が与えられたとき、その文章の構造より、文章中の見出し、タイトル、文章の意図を表現しやすい文字列などの情報と音声を連携して表示することにより、テキスト文章を音声で聞く場合に内容が把握しやすい。
【００１２】
【発明の実施の形態】
以下、本発明の一実施例を図面に従い説明する。
【００１３】
本実施例の音声合成装置１０は、図１に示すように、入力部１２、言語処理部１４、表示情報抽出部１６、音韻処理部１８、合成パラメータ生成部２０、音声波形生成部２２、出力制御部２４、音声出力部２６、表示部２８、表示状況抽出部３０からなる。
【００１４】
上述の各構成部分について説明する。
【００１５】
（言語処理部１４）
文字音声変換の場合には、入力部１２から入力された漢字仮名交じり文を解析し、音声処理に必要な情報である、語、文節の境界、漢字の読み、単語のアクセント、係り受け、及び品詞、活用形などを決定する。
【００１６】
例えば、言語処理部１４において形態素解析、構文解析、意味解析などを行うことにより、音声処理で必要な言語情報、すなわち語、文節の境界、漢字の読み、単語のアクセント、かかり受け、品詞、活用形などの情報を出力することができる。
【００１７】
形態素解析では、入力文を単語単位に分割し、それぞれの単語の品詞や活用情報などの文法情報と、語、文節の境界などを得ることができる。また、形態素解析における辞書検索結果から漢字の読みや、単語のアクセントなどを得ることができる。ここでは、辞書には各単語の読み、アクセントが付加されているものを用いる。
【００１８】
また、構文解析では、形態素解析で得られた語、文節に対して、構文木などを作成して解析し、各語、文節間の係り受けや文節間の結合度の情報などを求めることができる。例えば、例文「議員は選挙の準備を開始する。」に対して構文解析を行なうと、図２に示すような構文情報が得られる。
【００１９】
意味解析では、構文解析で得られる単語の意味属性などを含んだ解析結果から、１つ１つの文の意味構造を出力することができる。
【００２０】
（表示情報抽出部１６）
表示情報抽出部１６では、表示部２８に表示すべき内容情報を抽出し、出力制御部２４に送る。表示情報としては、入力文章に対する定型文のフォーマットや文章のレイアウトを予め保持しておき、入力文章中において、文章におけるタイトル、見出し、電子メール等のサブジェクト、差出人などのフォーマットに該当する部分を表示情報とする。非定型文は、上述した形態素解析で得られた品詞情報や構文解析・意味解析などで得られた情報から表示すべき情報を抽出する。
【００２１】
抽出された表示情報は、出力制御部２４へ送られる。また、言語処理部１４から得られた情報はそのまま音韻処理部１８へ送られる。
【００２２】
（音韻処理部１８）
音韻処理部１８では、漢字の読みや辞書項目などの情報から単語の音韻属性などを生成し、これとともにピッチパターンを生成するためのフレーズ立ち上げ位置やフレーズ指令の大きさ、ポーズ位置、ポーズ長などの韻律制御情報を求める。得られた音韻属性・韻律制御情報から、韻律記号列と単音記号列を生成し、合成パラメータ生成部２０に出力する。合成パラメータ生成部２０では、音韻処理部１８の結果を基にして、音声合成器を駆動するための合成パラメータ系列を生成する。この系列は、音声波形生成部２２に送られて、出力部２６を通じて出力される。
【００２３】
（合成パラメータ生成部２０・音声波形生成部２２）
上記の音韻記号列・韻律記号列を合成パラメータ生成部２０で音声合成器を駆動するための合成パラメータ系列を生成し、音声波形生成部２２に送り、音声波形が生成される。
【００２４】
（表示状況抽出部３０）
表示状況抽出部３０は、現在画面に表示されている表示情報を出力制御部２４に送り、出力制御部２４において制御する。
【００２５】
（出力制御部２４）
出力制御部２４は、音声波形生成部２２で生成された音声波形と、表示情報抽出部１６で抽出された表示情報とを同期して音声出力部２６、表示部２８に出力する。表示情報は、音声波形が出力開始と同時、または直前など、表示する状況を予め設定することができる。また、表示情報を表示する時間は、音声波形の出力終了時と同時、または終了する直前まで、または事前に設定されていた時間内など、表示する状況を予め設定することができる。
【００２６】
また、表示情報を時間毎に変更するなど、予め設定することができる。
【００２７】
また、表示状況抽出部３０から得られる画面表示情報により、他の情報が表示されていないかどうかを判断し、適宜表示情報を出力する。
【００２８】
（具体例）
以下、本発明の音声合成装置１０における具体例を説明する。
【００２９】
図３は、本実施例における表示部２８を有する情報携帯端末１００を示す摸式図である。
【００３０】
図３に示すように、情報携帯端末１００は、音声を出力するためにスピーカなどの音声出力部２６を有し、また液晶画面などのように文字情報を表示するための表示部２８を有しているものとする。
【００３１】
定型文の場合
まず、図４、図５に示すような電子メールやタイトル・見出しを含む文章などの定型文に対する場合の例を示す。図５において、「………」の部分は、任意の文章があるものとする。
【００３２】
図１の入力部１２から送られた例文は、言語処理部１４において、音韻処理以降で利用すべき情報が生成される。
【００３３】
次に、表示情報抽出部１６において、図６に示すフローチャートに従って処理される。
【００３４】
まず、入力文章に対して定型文であるかどうかの判定を行なう（ステップａ１）。
【００３５】
本装置１０における定型文は、表示情報抽出部１６において予め設定することができる。
【００３６】
例えば、図４の電子メールの場合は、入力文章において、“To: ”，“From: ”，“Subject:”，“Date: ”などといった文字列が含まれる文章を電子メールの定型文として判別することができる。または、入力文章中の先頭に“To: ”の文字列が存在するなどにより判別することができる。
【００３７】
同様に、図５に示すような例の場合、入力文章の先頭の“「」”で囲まれた文字列と行頭に連番で数字のついた文字列が含まれる文章であるかどうかで定型文であるかどうかを判別することができる。このように、定型的な文字列が含まれる、または入力文章中における定型的な場所に特定の文字列があるなどを取得することにより、定型文であるかどうかの判定を行なうことができる。
【００３８】
先の判定により、定型文でならば、次のステップへ進み、表示すべき文字列を抽出する（ステップａ２）。
【００３９】
予め抽出する文字列は設定することができ、図４の電子メールのサブジェクト部分を表示情報とすると設定した場合、定型的に得られた入力文章中のヘッダ部に書かれている“Subject:”を含む文字列１行のテキストを抽出することができる。この例の場合、「Subject:テスト」という文字が表示情報として抽出され、出力制御部２４へ送られる。
【００４０】
同様に、図５においても、見出しとなる部分を表示情報とすると設定した場合、連番で数字のついた文字列が含まれる１行を抽出することができる。この場合、表示情報１〜４で示される「１．目次」、「２．概要」、「３．本文」、「４．まとめ」といった文字列が表示情報として抽出され、出力制御部２４へ送られる（ステップａ４）。
【００４１】
次に、言語処理部１４により得られた言語情報は、表示情報抽出部１６から、音韻処理部１８、合成パラメータ生成部２０、音声波形手段により、音声波形として生成される。
【００４２】
次に出力制御部２４においては、図７に示すフローチャートに従って処理される。
【００４３】
出力制御部２４に音声波形が入力されると、まず出力する音声波形の有無を確認する（ステップｂ１）。
【００４４】
表示波形がない場合は終了する（ステップｂ２）。
【００４５】
次に、表示情報抽出部１６から得られた表示情報があるかどうかを確認する（ステップｂ３）。
【００４６】
表示情報がない場合は、次のステップに進む（ステップｂ５）。
【００４７】
表示情報が、ある場合は、予め設定した方法で表示情報である文字列を表示部２８に出力する（ステップｂ４）。
【００４８】
例えば、図４に示した電子メールの場合、予め出力制御部２４において、音声波形出力と同時に表示部２８で表示情報を表示し、音声出力終了と同時に表示を終了すると設定されているものとする。この設定では、表示情報である「Subject:テスト」という文字列が表示部２８に送られ、同時に音声波形を音声出力部２６に生成された音声波形が送られる。
【００４９】
次に、表示の終了は、予め設定内容に従って、音声波形の出力が終了したと同時に（ステップｂ６）、表示部２８への表示情報送信を停止する（ステップｂ７）。
【００５０】
また、図５に示した定型文の場合、予め出力制御部２４において、各章の音声出力と同時に見出しを表示するし、音声出力が終了すると同時に表示を終了するように設定されているものとする。出力制御部２４では、設定通りに、まず表示部２８に、音声波形と最初の見出しである「１．目次」を送る。次にその章の音声出力が終了したと同時に、次の見出し「２．概要」を表示部２８に出力し、同様の手順で表示部２８に、次の見出し「３．本文」、「４．まとめ」送る制御を行なう。次に、表示の終了は、予め設定内容に従って、音声波形の出力が終了したと同時に（ステップｂ６）、表示部２８への表示情報送信を停止する（ステップｂ７）。
【００５１】
表示状況抽出部３０においては、上述した出力制御部２４における、表示を開始するように表示部２８に表示情報を送る場合において、事前に現在の表示部２８における表示状況を監視し、表示状況抽出部３０から得られる情報により、現在表示されている情報を出力制御部２６に送る。例えば、出力制御部２４において、表示部２８に現在表示されている文字列の表示を停止し、表示情報を表示すると予め設定されていた場合、直前の文字列の表示を停止して、新たに表示情報を表示するように表示部２８を制御することができる。
【００５２】
定型文においては、電子メールに限らず、タイトルや見出しを抽出することが可能で、または、アンダーラインや、記号で囲まれている文字列など、予め定型文の構造を設定しておくことにより、同様に表示情報を表示し、音声波形と連携して出力することができる。
【００５３】
非定型文の場合
次に、先の定型文に該当しない文（以下、非定型文という）について、例えば、図８に示すような入力文章の場合の例を示す。
【００５４】
非定型文は、表示情報抽出部１６において、図６に示すフローチャートに従って処理され、言語解析情報から得られる情報を元に、予め設定した手法により、表示情報とすべき該当する文字列を抽出する（ステップａ３）。
【００５５】
例えば、文章中における体言語で名詞である属性の出現頻度の高い単語の文字列を表示情報とするとした場合、図８における先頭の１文において、言語解析で得られる情報は、図２に示したような情報が得られ、この文章においては、体言語で名詞の単語を抽出し、出現頻度を数える。この文の場合、「議員」、「選挙」、「準備」の単語が抽出される。同様に次の文における上記属性の単語が抽出され、それぞれ出現頻度が数えられる。この例文の場合、「選挙」の文字列が３回出現し、出現頻度が高いので、表示情報として選定され、出力制御部２４に送られる。
【００５６】
ここで、表示情報とすべき該当する文字列の抽出においては、出現頻度のみから選定してもよいし、構文構造から得られる主部における単語や目的語となる単語、意味情報として得られる重要度の高い単語など、言語情報における抽出対象を設定できる。また単語に限らず、文節や、１文などを表示情報とすることもできる。
【００５７】
表示情報抽出部１６以降の処理は、先の定型文における処理と同様に行なうことができる。
【００５８】
以上のように、表示情報抽出部１６において、表示すべき文章を抽出し、音声波形の出力と連携して、音声出力部２６、表示部２８に出力することにより、内容を把握しやすくすることが可能となる。
【００５９】
【発明の効果】
以上説明したように、本発明であると、任意のテキスト文章が与えられたとき、その文章の構造より、文章中の見出し、タイトル、文章の意図を表現しやすい文字列などの情報と音声を連携して表示することにより、テキスト文章を音声で聞く場合に内容を把握しやすい。
【図面の簡単な説明】
【図１】音声合成装置のブロック図である。
【図２】構文解析によって得られた構文情報の図面である。
【図３】表示部を有する情報携帯端末を示す摸式図である。
【図４】電子メールを示す図である。
【図５】定型文を示す図である。
【図６】表示情報抽出部における処理を示すフローチャートである。
【図７】出力制御部における処理を示すフローチャートである。
【図８】非定型文を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesizer that performs screen display simultaneously with speech and a method thereof.
[0002]
[Prior art]
In recent years, research has been conducted on human-computer interaction in various forms by inputting, outputting, and processing multimedia such as characters, sounds, graphics, and images.
[0003]
On the other hand, with the dramatic improvement in memory capacity and computer power recently, workstations and personal computers that can handle multimedia have been developed, and high-quality voice and sound signals are standard systems without special hardware. I can now input and output. In this situation, voice input / output technology is important as an input / output medium for computers.
[0004]
2. Description of the Related Art Conventionally, among speech input / output technologies, a technology called speech synthesis technology for converting an arbitrary sentence or word character into speech has been developed. Currently, hardware dedicated to speech rule synthesis has been developed, and applications such as text reading are expanding. With such technology, it has become possible to convert characters, which are linguistic information, into audio signals as media conversion. In addition, the utility value is increasing in voice services by telephone and information portable terminals.
[0005]
[Problems to be solved by the invention]
However, when using only voice, such as a telephone, or when listening from the middle, it is not easy to grasp the content, and it is difficult to grasp what is being read out.
[0006]
Some terminals with a screen display device such as portable information terminals can output screen display and audio. However, if portability is important, the screen display unit will inevitably become smaller and output audio. Therefore, it is necessary to consider the case of displaying in cooperation with the screen display.
[0007]
In fixed phrases, it is possible to display preset items, but in the display information of non-fixed sentences, if the screen display part is small, it is not easy to display all the contents of the sentences, one after another It may be difficult to grasp the content without reading the displayed text or listening to the end of the voice.
[0008]
Therefore, the present invention provides a speech synthesizer that aims to appropriately extract display information to be displayed on the screen from input text, output it in cooperation with speech, and improve the grasp of the content.
[0009]
[Means for Solving the Problems]
The present invention provides an input means for inputting a text document, a language processing means for extracting language information necessary for speech processing from the input text document, and display information representing contents and features of the text document. Display information extracting means for extracting from the language information, voice waveform generating means for generating a voice waveform from the language information, voice output means for outputting the voice waveform, display means for displaying the display information, and voice output means Output control means for controlling the display information to be displayed on the display means in accordance with the output state of the speech waveform, and the display information extraction step includes whether the text document is a fixed sentence or other A stylistic determination step for determining whether the text document is an atypical sentence, and a headline, a title, an item, a sender, etc. are set in advance when it is determined that the text document is a standard sentence. A predetermined sentence in the fixed sentence is extracted from the text sentence, a fixed sentence display information extraction step using the extracted predetermined sentence as the display information, and when it is determined that the text document is an atypical sentence, A speech synthesizer comprising: an atypical sentence display information extraction step that extracts a character string expressing an intention of a text sentence from the text sentence and uses the extracted character string as the display information .
[0010]
In addition, the present invention provides a language processing step for extracting language information necessary for speech processing from an input text document, and a display information extraction step for extracting display information representing the contents and characteristics of the text document from the language information. A voice waveform generation step for generating a voice waveform from the language information, a voice output step for outputting the voice waveform, a display step for displaying the display information, and an output state of the voice waveform in the voice output step. An output control step for controlling the display information to be displayed in the display step , wherein the display information extraction step determines whether the text document is a fixed sentence or another non-fixed sentence. When determining that the text document is a fixed sentence, such as a headline, a title, an item, a sender, etc. A predetermined sentence in the set fixed sentence is extracted from the text sentence, a fixed sentence display information extraction step using the extracted predetermined sentence as the display information, and when it is determined that the text document is an atypical sentence A speech synthesis method comprising: extracting a character string expressing the intention of the text sentence from the text sentence, and an atypical sentence display information extracting step using the extracted character string as the display information. is there.
[0011]
According to the above invention, when an arbitrary text sentence is given, information such as a headline, a title, a character string that can easily express the intention of the sentence, and voice are displayed in cooperation with the structure of the sentence. Therefore, it is easy to grasp the contents when listening to the text sentence by voice.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0013]
As shown in FIG. 1, the speech synthesizer 10 of the present embodiment includes an input unit 12, a language processing unit 14, a display information extraction unit 16, a phonological processing unit 18, a synthesis parameter generation unit 20, a speech waveform generation unit 22, and an output. It consists of a control unit 24, an audio output unit 26, a display unit 28, and a display status extraction unit 30.
[0014]
Each component described above will be described.
[0015]
(Language processing unit 14)
In the case of character-to-speech conversion, the kanji-kana mixed sentence input from the input unit 12 is analyzed, and the information necessary for speech processing, such as word, phrase boundary, kanji reading, word accent, dependency, and Determine part of speech, usage, etc.
[0016]
For example, by performing morphological analysis, syntax analysis, semantic analysis, etc. in the language processing unit 14, language information necessary for speech processing, that is, word, phrase boundaries, kanji reading, word accent, reception, part of speech, utilization Information such as shape can be output.
[0017]
In morphological analysis, an input sentence can be divided into words, and grammatical information such as parts of speech and utilization information of each word, and boundaries between words and phrases can be obtained. In addition, kanji readings, word accents, and the like can be obtained from dictionary search results in morphological analysis. Here, a dictionary to which each word is read and accented is used.
[0018]
Also, in syntax analysis, it is possible to create a syntax tree etc. for the words and clauses obtained by morphological analysis, and analyze the dependency between each word and clause, information on the degree of connectivity between clauses, etc. it can. For example, when syntactic analysis is performed on the example sentence “the legislator starts preparation for election”, syntax information as shown in FIG. 2 is obtained.
[0019]
In the semantic analysis, it is possible to output the semantic structure of each sentence from the analysis result including the semantic attribute of the word obtained by the syntax analysis.
[0020]
(Display information extraction unit 16)
The display information extraction unit 16 extracts content information to be displayed on the display unit 28 and sends it to the output control unit 24. As the display information, the standard text format and text layout for the input text are stored in advance, and the portion of the input text that corresponds to the format of the title, headline, subject such as e-mail, sender, etc. in the text is displayed. Information. The atypical sentence extracts information to be displayed from the part-of-speech information obtained by the morphological analysis and information obtained by the syntax analysis / semantic analysis.
[0021]
The extracted display information is sent to the output control unit 24. The information obtained from the language processing unit 14 is sent to the phoneme processing unit 18 as it is.
[0022]
(Phonological processing unit 18)
The phonological processing unit 18 generates a phonological attribute of a word from information such as kanji readings and dictionary items, and the phrase start position, phrase command size, pose position, and pose length for generating a pitch pattern along with this. Find prosodic control information. A prosodic symbol string and a phonetic symbol string are generated from the obtained phoneme attribute / prosodic control information and output to the synthesis parameter generation unit 20. The synthesis parameter generation unit 20 generates a synthesis parameter sequence for driving the speech synthesizer based on the result of the phoneme processing unit 18. This sequence is sent to the speech waveform generator 22 and output through the output unit 26.
[0023]
(Synthesis parameter generation unit 20 / voice waveform generation unit 22)
A synthesis parameter sequence for driving the speech synthesizer is generated by the synthesis parameter generation unit 20 from the above phoneme symbol string / prosodic symbol string, and is sent to the speech waveform generation unit 22 to generate a speech waveform.
[0024]
(Display status extraction unit 30)
The display status extraction unit 30 sends the display information currently displayed on the screen to the output control unit 24 and the output control unit 24 controls the display information.
[0025]
(Output control unit 24)
The output control unit 24 outputs the audio waveform generated by the audio waveform generation unit 22 and the display information extracted by the display information extraction unit 16 to the audio output unit 26 and the display unit 28 in synchronization. The display information can be set in advance for a situation in which the audio waveform is displayed, for example, at the same time as or immediately before the start of output. In addition, the display information display time can be set in advance at the same time as the end of the output of the audio waveform, immediately before the end, or within a preset time.
[0026]
Further, the display information can be set in advance, for example, by changing the display information every time.
[0027]
Moreover, it is judged whether other information is not displayed by the screen display information obtained from the display status extraction unit 30, and the display information is output as appropriate.
[0028]
(Concrete example)
Hereinafter, a specific example in the speech synthesizer 10 of the present invention will be described.
[0029]
FIG. 3 is a schematic diagram showing the information portable terminal 100 having the display unit 28 in the present embodiment.
[0030]
As shown in FIG. 3, the portable information terminal 100 has a voice output unit 26 such as a speaker for outputting voice, and a display unit 28 for displaying character information such as a liquid crystal screen. It shall be.
[0031]
In the case of a fixed sentence First, an example in the case of a fixed sentence such as an e-mail or a sentence including a title / headline as shown in FIGS. 4 and 5 will be described. In FIG. 5, it is assumed that “...
[0032]
In the example sentence sent from the input unit 12 in FIG. 1, information to be used after the phoneme processing is generated in the language processing unit 14.
[0033]
Next, the display information extraction unit 16 performs processing according to the flowchart shown in FIG.
[0034]
First, it is determined whether or not the input sentence is a fixed sentence (step a1).
[0035]
A fixed sentence in the apparatus 10 can be set in advance in the display information extraction unit 16.
[0036]
For example, in the case of the e-mail shown in FIG. 4, a sentence including a character string such as “To:”, “From:”, “Subject:”, “Date:” in the input sentence is determined as a standard e-mail sentence. can do. Alternatively, it can be determined by the presence of a character string “To:” at the beginning of the input sentence.
[0037]
Similarly, in the case of the example as shown in FIG. 5, it is determined whether or not the text includes a character string surrounded by ““ ”at the beginning of the input sentence and a character string with a serial number and a number at the beginning of the line. It is possible to determine whether or not the sentence is a sentence, as described above, by obtaining information such as the presence of a fixed character string or the presence of a specific character string at a fixed place in the input sentence. It can be determined whether or not.
[0038]
If it is determined by the previous determination that the sentence is a fixed phrase, the process proceeds to the next step to extract a character string to be displayed (step a2).
[0039]
The character string to be extracted in advance can be set, and when the subject part of the e-mail in FIG. 4 is set as display information, “Subject:” written in the header portion of the input sentence obtained in a typical form is used. It is possible to extract one line of text including a character string. In this example, the characters “Subject: test” are extracted as display information and sent to the output control unit 24.
[0040]
Similarly, also in FIG. 5, when the heading portion is set as display information, one line including a character string with a serial number and a number can be extracted. In this case, character strings such as “1. Contents”, “2. Overview”, “3. Body”, “4. Summary” indicated by the display information 1 to 4 are extracted as display information and sent to the output control unit 24. (Step a4).
[0041]
Next, the linguistic information obtained by the language processing unit 14 is generated as a speech waveform from the display information extracting unit 16 by the phonological processing unit 18, the synthesis parameter generating unit 20, and the speech waveform means.
[0042]
Next, the output control unit 24 performs processing according to the flowchart shown in FIG.
[0043]
When a voice waveform is input to the output control unit 24, first, the presence / absence of a voice waveform to be output is confirmed (step b1).
[0044]
If there is no display waveform, the process ends (step b2).
[0045]
Next, it is confirmed whether there is display information obtained from the display information extraction unit 16 (step b3).
[0046]
If there is no display information, the process proceeds to the next step (step b5).
[0047]
If there is display information, a character string as display information is output to the display unit 28 by a preset method (step b4).
[0048]
For example, in the case of the e-mail shown in FIG. 4, the output control unit 24 is set to display the display information on the display unit 28 at the same time as the voice waveform output and to end the display at the end of the voice output. . In this setting, the character string “Subject: Test”, which is display information, is sent to the display unit 28, and at the same time, the generated voice waveform is sent to the voice output unit 26.
[0049]
Next, the display is terminated according to the setting contents in advance, and at the same time as the output of the voice waveform is completed (step b6), the display information transmission to the display unit 28 is stopped (step b7).
[0050]
In addition, in the case of the standard sentence shown in FIG. 5, the output control unit 24 is set in advance to display the heading simultaneously with the voice output of each chapter, and to end the display simultaneously with the end of the voice output. To do. The output control unit 24 first sends the speech waveform and the first heading “1. Table of Contents” to the display unit 28 as set. Next, at the same time as the audio output of the chapter is completed, the next heading “2. Outline” is output to the display unit 28, and the next headings “3. Text”, “4. "Summary" Control to send. Next, the display is terminated according to the setting contents in advance, and at the same time as the output of the voice waveform is completed (step b6), the display information transmission to the display unit 28 is stopped (step b7).
[0051]
In the display status extraction unit 30, when the display information is sent to the display unit 28 to start display in the output control unit 24 described above, the display status in the current display unit 28 is monitored in advance and the display status extraction is performed. Information currently displayed is sent to the output control unit 26 based on information obtained from the unit 30. For example, in the output control unit 24, when the display of the character string currently displayed on the display unit 28 is stopped and the display information is displayed is set in advance, the display of the immediately preceding character string is stopped and newly displayed. The display unit 28 can be controlled to display the display information.
[0052]
In fixed phrases, it is possible to extract titles and headings, not just e-mails, or by setting the structure of fixed phrases in advance, such as underlines and character strings surrounded by symbols. Similarly, display information can be displayed and output in cooperation with a voice waveform.
[0053]
In the case of an atypical sentence Next, an example in the case of an input sentence as shown in FIG. 8 will be shown for a sentence that does not correspond to the above-mentioned typical sentence (hereinafter referred to as an atypical sentence).
[0054]
The atypical sentence is processed by the display information extraction unit 16 according to the flowchart shown in FIG. 6, and a corresponding character string to be displayed information is extracted by a preset method based on information obtained from the language analysis information. (Step a3).
[0055]
For example, when the display information is a character string of a word having a high appearance frequency of an attribute that is a noun in a body language in a sentence, information obtained by language analysis in the first sentence in FIG. 8 is shown in FIG. In this sentence, noun words are extracted in the body language and the appearance frequency is counted. In the case of this sentence, the words “Meeting member”, “Election” and “Preparation” are extracted. Similarly, the words having the above attributes in the next sentence are extracted, and the appearance frequencies are counted. In the case of this example sentence, the character string “election” appears three times and has a high appearance frequency, so that it is selected as display information and sent to the output control unit 24.
[0056]
Here, in the extraction of the corresponding character string to be displayed information, it may be selected only from the appearance frequency, or the word in the main part obtained from the syntax structure, the word as the object, or the important information obtained as the semantic information It is possible to set the extraction target in linguistic information such as words with high degrees. Moreover, not only a word but a clause, one sentence, etc. can also be used as display information.
[0057]
The processing after the display information extraction unit 16 can be performed in the same manner as the processing in the above-mentioned fixed sentence.
[0058]
As described above, the display information extraction unit 16 extracts the text to be displayed and outputs it to the audio output unit 26 and the display unit 28 in cooperation with the output of the audio waveform, thereby making it easy to grasp the contents. Is possible.
[0059]
【The invention's effect】
As described above, according to the present invention, when an arbitrary text sentence is given, information such as a headline, a title in the sentence, a character string that can easily express the intention of the sentence, and voice are given from the structure of the sentence. By linking and displaying, it is easy to grasp the contents when listening to text sentences by voice.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech synthesizer.
FIG. 2 is a diagram of syntax information obtained by syntax analysis.
FIG. 3 is a schematic diagram showing an information portable terminal having a display unit.
FIG. 4 is a diagram showing an electronic mail.
FIG. 5 is a diagram showing a fixed sentence.
FIG. 6 is a flowchart showing processing in a display information extraction unit.
FIG. 7 is a flowchart showing processing in the output control unit.
FIG. 8 is a diagram showing an atypical sentence.

Claims

An input means for inputting a text document;
Language processing means for extracting language information necessary for speech processing from the input text document;
Display information extraction means for extracting display information representing the content and characteristics of the text document from the language information;
Speech waveform generation means for generating a speech waveform from the language information;
Voice output means for outputting the voice waveform;
Display means for displaying the display information;
Output control means for controlling the display information to be displayed on the display means in accordance with the output state of the voice waveform in the voice output means ,
The display information extracting means includes
Stylistic discriminating means for discriminating whether the text document is a fixed sentence or another atypical sentence;
When it is determined that the text document is a fixed sentence, a predetermined sentence in the predetermined fixed sentence such as a headline, a title, an item, and a sender is extracted from the text document, and the extracted predetermined sentence is used as the display information. Fixed phrase display information extracting means for
When it is determined that the text document is an atypical sentence, a character string expressing the intention of the text sentence is extracted from the text document, and the atypical sentence display information extraction is performed using the extracted character string as the display information. speech synthesis apparatus characterized by comprising a means.

A language processing step for extracting language information necessary for speech processing from the input text document;
A display information extraction step for extracting display information representing contents and features of the text document from the language information;
A speech waveform generation step of generating a speech waveform from the language information;
An audio output step for outputting the audio waveform;
A display step for displaying the display information;
An output control step for controlling the display information to be displayed in the display step according to the output state of the audio waveform in the audio output step ;
The display information extraction step includes:
A stylistic determination step for determining whether the text document is a fixed sentence or other non-standard sentence;
When it is determined that the text document is a fixed sentence, a predetermined sentence in the predetermined fixed sentence such as a headline, a title, an item, and a sender is extracted from the text sentence, and the extracted predetermined sentence is used as the display information. A fixed sentence display information extraction step,
When it is determined that the text document is an atypical sentence, a character string expressing the intention of the text sentence is extracted from the text sentence, and the atypical sentence display information extraction is performed using the extracted character string as the display information. A speech synthesis method comprising the steps of :