JP3707872B2

JP3707872B2 - Audio output apparatus and method

Info

Publication number: JP3707872B2
Application number: JP23740396A
Authority: JP
Inventors: 重宣瀬戸; 宏之坪井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1996-03-18
Filing date: 1996-09-09
Publication date: 2005-10-19
Anticipated expiration: 2016-09-09
Also published as: JPH09311775A

Abstract

PROBLEM TO BE SOLVED: To output test information corresponding to a reference level by voice, by switching the read-aloud style matching with the state of a careful read or a skip read according to the reference level for document information, and varying the vocalizing speed of a synthesized voice at skip read time, or the like. SOLUTION: A read-aloud style management part 44 of a voice synthesis part 40 generates a control parameter series by use of control rule sets corresponding to respective document reference levels while reflecting the document reference levels to be judged at a user request from a user input part 20. For example, when a document reference level is judged to be a skip read, a control rule set which increases the vocalization speed is used and a control rule set which reduces the width of pitch fluctuations is used to generate a synthesized voice. When it is judged that a careful read is made to the contrary, a control rule set which increases the width of pitch and power fluctuations is used to generate a synthesized voice.

Description

【０００１】
【発明の属する技術分野】
本発明は、文書情報中のテキスト情報は少なくとも音声に変換して出力する音声出力装置及びその方法に関する。
【０００２】
【従来の技術】
入力テキストから合成音声を得るＴＴＳ（Ｔｅｘｔ−Ｔｏ−Ｓｐｅｅｃｈ）変換の研究は以前から行われており、例えば、文献：岩田他：“パソコン向けソフトウェア日本語テキスト音声合成，”日本音響学会講演論文集，２−８−１３，ｐｐ．２４５−２４６（平成５年１０月）にあるようなＴＴＳシステムは現在までに数多く実現されている。これらのＴＴＳシステムでは、様々な他のアプリケーションからテキストを受け取り、合成音声で読み上げる１つのアプリケーションやハードウェアとして実現されている。
【０００３】
一方、単調でない合成音声を生成のための知見を得るために、読み上げスタイル、即ち、アナウンス調や朗読調や対話調や、あるいは、早口やゆっくりとした口調や高い声や低い声などを様々に変え、その音声のピッチや発話速度などを変えた場合の特徴を分析した研究が報告されている。例えば、阿部他：“異なる初話様式の特徴分析，”電子情報通信学会技術研究報告，ＳＰ９３−７，ｐｐ．３７−４２（平成５年５月）や、阪田他：“対話音声の韻律的特徴と合成，”電子情報通信学会研究報告，ＳＰ９５−１７，ｐｐ．５５−６２（平成７年５月）や、新居他：“発生速度を考慮したピッチパターン生成規則の利用，”日本音響学会講演論文集，２−０−１０，ｐｐ．２３８−２８４（平成６年３月）などがその分析に関する研究報告である。
【０００４】
読み上げスタイルは、ピッチや発話速度、パワーなどの韻律パラメータやその時間変化や、音声合成のためのデータベース中の素片そのものやその選択規則などに深く関係しており、読み上げスタイルを変えた合成音声を生成することは、すなわち、これらの制御パラメータを、それぞれのスタイルに対応するように祖の値やその時間変化を制御することに対応する。
【０００５】
ところで、音声合成技術は、視覚障害者用の画面読み上げソフトとしての応用システムがいくつか開発されている。例えば、文献、斎藤：“視覚障害者支援ソフトウェアの製作，”情報処理，Ｖｏｌ．３６，Ｎｏ．１２，ｐｐ．１１１６−−１１２１（Ｄｅｃ．１９９５）にあるように、操作時に視覚的なフィードバックの困難な視覚障害者が、キーの操作だけで文書を読み上げ可能となるための工夫が行われている。具体的には、読み上げスキップ繰り返し再生などの読み上げ方を変えたり、画面表示メモリを監視して、画面に表示しつつあるテキストの読み上げやユーザのカーソル操作等によってすでに表示された指定箇所のテキスト読み上げ、キー操作による読み上げ中断、文字種や音声読み上げ操作キー種に応じた文章読み、音読み、詳細（例えば「田んぼの田」）読みを自動変更する機能が開発されている。このスキップ、繰り返し再生などの読み上げ方に関する指示は、視覚障害者が指示を即座に出せるように、あらかじめこれらの指示に直接対応づけられたキーをユーザが操作することによって実現している。また、基本的に読み上げ機能は、視覚障害者が視認できない画面上のテキストの表示箇所を把握するためのものであり、読み上げ箇所は、画面に表示しようとしているか、あるいは今、画面中に表示されているテキストである。
【０００６】
【発明が解決しようとする課題】
一般に、じっくり参照したい文書を読む際には、声に出して音読するのと同じ程度の速度で読んでもさほどストレスを感じにくいが、斜め読みないしは目視検索をする場合は、音読のスピードよりもずっと早く目を走らせるのが普通である。したがって、文書中のテキストを合成音声で読み上げる際にも、斜め読みをしている文書の場合はじっくり参照したい文書の場合よりもてきぱきとした読み上げスタイルで読み上げる方がより自然である。
【０００７】
また、じっくり読む場合と斜め読みする場合とでは、文書中のテキストのうち、どこに注意を払うか、あるいは、どこに注目して読むかが変わってくる。例えば、じっくり読みの場合には文書を隅々まで読む傾向があるのが、長い文書を斜め読みする場合には文書の大まかな構造に注意を払ったり、多くの文書を目視検索する場合には文書のスタイルなど特定の箇所に注目する。
【０００８】
本発明では、呈示している文書のテキストのうちのどこを読み上げるかを、文書の参照状況に応じて切替える。例えば、じっくり読む場合には文書中のテキストを順に全て読み上げたり、斜め読みの場合にはまず文書構造上のキーとなるテキストを読み上げるなど、文書の参照状況に応じて読み上げ対象となる箇所を切り替える。
【０００９】
ところで、このような斜め読みやじっくり読みといった文書の参照状況に応じて、合成音声の読み上げスタイルや読み上げ箇所を決めるという音声読み上げ機能の利用は、これまで行なわれてこなかった。
【００１０】
従来のＴＴＳシステムにおいて、合成音声の読み上げスタイルは、あらかじめ行っておいた発話速度やピッチの高さなどの設定にしたがって、読み上げ対象としてユーザが指定したテキスト全体に対して一様に付与される。また、ユーザが読み上げスタイルの設定を明示的に変更し、その後ユーザが設定を変更するまでのあいだ同じ読み上げスタイルで読み上げを続ける。これは、単に音声合成をテキストから音声へのメディア変換機能としてとらえ、読み上げる合成音声を生成する際に可変なパラメータの設定をユーザに委ねているだけに過ぎない。
【００１１】
視覚障害者用の画面読み上げの場合も、操作キーにあらかじめ対応づけられた読み上げスタイルが対応づけられていて、ユーザはカーソルによって読み上げ箇所を指定し、読み上げ操作キーによって読み上げスタイルを選択しているに過ぎない。
【００１２】
【課題を解決するための手段】
文書情報に対するユーザの呈示要求の時間間隔か呈示要求の入力頻度のうち少なくとも一方を検出して、前記呈示要求の時間間隔か呈示要求の入力頻度のうち少なくとも一方が短いほど、前記ユーザの前記文書情報に対する参照レベルが低くなるように前記参照レベルを求める参照レベル判断手段と、この参照レベル判断手段で求めた参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力する音声合成手段を有することを特徴とするものである。
【００１３】
次に、参照する文書情報を表示する表示手段を有し、この表示手段に同時に表示可能な文字数もしくは行数もしくは表示面積と参照する文書情報中に含まれるテキストの文字数もしくは行数もしくは文書情報の表示面積の比、ないしは、表示文字の大きさ、が大きいほど、前記ユーザの前記文書情報に対する参照レベルが低くなるように参照レベルを求める参照レベル判断手段と、この参照レベル判断手段で求めた参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力する音声合成手段を有することを特徴とするものである。
【００１４】
参照する文書情報を表示する表示手段を有し、表示手段に同時に表示可能な文字数もしくは行数もしくは表示面積と参照する文書情報中に含まれるテキスト情報の文字数もしくは行数もしくは文書情報の表示面積の比、ないしは表示文字の大きさが小さいほど、前記ユーザの前記文書情報に対する参照レベルが低くなるように参照レベルを求める参照レベル判断手段と、この参照レベル判断手段で求めた参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力する音声合成手段を有することを特徴とするものである。
【００１８】
次に、ユーザからの文字列入力を受け付けるユーザ入力手段と、ユーザ入力手段から入力された文字列を検索対象文字列として参照する文書情報中を検索する文字列検索手段と、この文字列検索手段が前記検索対象文字列を前記参照する文章情報中に検出した箇所の数が多いほど参照レベルを高くなるように参照レベルを求める参照レベル判断手段と、この参照レベル判断手段で求めた参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力する音声合成手段を有することを特徴とするものである。
【００１９】
更に、ユーザからの単語入力を受け付けるユーザ入力手段と、ユーザ入力手段から入力された単語の同義語を得る同義語辞書と、前記同義語を検索対象文字列として参照する文書情報中を検索する文字列検索手段と、この文字列検索手段が前記検索対象文字列を前記参照する文章情報中に検出した箇所の数が多いほど参照レベルを高くなるように参照レベルを求める参照レベル判断手段と、この参照レベル判断手段で求めた参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力する音声合成手段を有することを特徴とするものである。
【００２０】
更には、文書情報に対するユーザの参照時間か参照速度のうち少なくとも一方を検出するステップと、前記参照時間か参照速度のうち少なくとも一方が短いほど、前記ユーザの前記文書情報に対する参照レベルが低くなるように前記参照レベルを求めるステップと、求められた前記参照レベルから読上げスタイルを決定し、この決定された読上げスタイルに対応した制御規則セットを用いて前記文書情報中のテキスト情報は少なくとも音声に変換して出力するステップを有することを特徴とするものである。
【００２１】
【発明の実施の形態】
図１は、本実施例の音声出力装置１０の基本構成を示している。
音声出力装置１０は、ユーザ入力部２０、文書入力部３０、音声合成部４０、文書参照レベル管理部５０より構成される。
【００２２】
図２に、より詳細な構成図を示す。以下、各装置を説明して行く。
文書入力部における処理
文書入力部３０は、文書引出し部３２と文書蓄積部３４とを有する。
文書入力部３０の扱う文書はテキストが含まれており、文書中のテキストを音声合成部４０が読上げる。
【００２３】
なお、「テキスト」とは、文字あるいは記号。数字も含む。また、「文書」とは、テキストが含まれていることを前提としているが、テキストだけでなく画像やオーディオデータ等を含むマルチメディアドキュメントであっても構わない。また、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）文書やオンラインドキュメント（オンラインヘルプ）のようなハンパードキュメントであってもよい。
【００２４】
文書引出し部３２は、文書蓄積部３４から文書を引き出す。
文書蓄積部３４は、ディスクやメモリ等の記憶デバイスであって、内部に文書を蓄積しており、文書入力部３０の内部に設けられている。また、ネットワークで接続された外部に置かれ、ネットワークアドレスや文書探索パスによって文書が管理されている。
【００２５】
文書入力部３０は、新しい文字の呈示を要求するユーザ入力がユーザ入力部２０から伝えられると、文書引き出し部３２が対応する文書を文書蓄積部３４から引き出す。
【００２６】
文書にアクセスするための方法は、アドレスやファイル探索パス等の形でユーザからの指示内容として指定されているとする。このような文書管理方法は、ネットワークファイルシステムあるいはＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）等で既に実現されている方法が利用できる。
【００２７】
音声合成部における通常の処理
音声合成部４０は、テキスト解析部４２、読上げスタイル管理部４４、音声合成器４６とを有する。
【００２８】
音声合成部４０は、文書引出し部３２が新しい文書を引き出すと、その文書あるいは文書に含まれるテキストの内容はテキスト解析部４２に送られ、その解析結果は読上げスタイル管理部４４へ、さらに音声合成器４６に順次送られて、テキストが合成音声で読上げられる。なお、「読上げ箇所」とは、音声合成器４６で文書中のテキストのうち、合成音声で読上げる部分をいう。
【００２９】
テキスト解析部４２から音声合成器４６までの一連の処理は、既存のＴＴＳ（Ｔｅｘｔ−Ｔｏ−Ｓｐｅｅｃｈテキストからの音声合成）の技術が利用できる。
【００３０】
例えば、佐藤他「日本語テキストからの音声合成」研究実用化報告，Ｖｏｌ．３２，Ｎｏ．１１，ｐｐ．２２４３−２２５２（１９８３年）はその例であり、具体的には、下記のようにする。
【００３１】
テキスト解析を行ない、ピッチやパワーや発話速度等の韻律的な制御や音声素片の選択及び接続、さらには音質制御等音韻的な制御を行ない、波形の生成を行なう音声合成の処理のうち、韻律的な制御や音韻的な制御が読上げスタイル管理部４４の処理に相当し、波形生成処理が音声合成器４６で行なう処理に相当する。
【００３２】
さらに、詳しく説明する。
（テキスト解析部４２）
テキスト解析部４２において文書中のテキスト解析を行い、読みとアクセントやイントネーションを与える韻律制御の単位を決定する。読みやアクセントやイントネーションを与える韻律制御の単位は、言語解析に利用する辞書の中にある各語彙毎の読みやアクセントに関する情報や、品詞や活用に関する文法情報等から規則的に決定される。
【００３３】
（読上げスタイル管理部４４）
読上げスタイル管理部４４は、アナウンス調や朗読調や対話調や、あるいは、早口やゆっくりとした口調や高い声や低い声といった声の調子になるように、合成音声のピッチや発話速度やパワーといった韻律的特徴パラメータや、音質を左右する音声素片の選択やスペクトル変形等の音韻的特徴パラメータの制御を行なう。
【００３４】
なお、「読上げスタイル」とは、アナウンス調や朗読調や対話調、あるいは、早口やゆっくりとした口調、高い声や低い声といった「声の調子」を指し、その違いは、具体的には、声のピッチや発話速度やパワー等の韻律的な特徴パラメータや声質のスペクトル的な特徴パラメータの制御を変えることにより実現する。
【００３５】
読上げスタイル管理部４４は、指定された読上げスタイルに対応する制御規則セットを用いて、韻律的あるいは音韻的特徴パラメータを制御する。これらの制御規則は、所望の読上げスタイルの自然音声の発声データを多く集め、韻律的あるいは音韻的特徴パラメータの分析を行ない、パラメータの制御規則を導出する。
【００３６】
このような規則導出の例として、阪田他「対話音声の韻律的特徴と合成」電子情報通信学会技術研究報告，ＳＰ９５−１７，ｐｐ．５５−６２（平成７年５月）が挙げられる。
【００３７】
具体的には、対話音声の発声データを多く集め、声のピッチの制御パラメータを分析し、合成時にピッチ制御モデルに与えるパラメータとして妥当な値を決定することにより、制御規則を導出する。これらのパラメータ値は、同一の読み上げスタイルに対して文中の位置やアクセント型の違い等に応じてそれぞれ妥当な値を決定しておく。この対応をピッチ制御規則と呼ぶ。
【００３８】
これらの読上げスタイル毎に導出された規則は、図３に示すように、読上げスタイルと組み合わせた表の形で読上げスタイル管理部４４の中に格納され、読上げスタイルを指定することにより対応する規則が合成時に適用される。合成時には、入力テキストのテキスト解析により文中の位置やアクセント型を得て、上記規制からそれぞれに対応するパラメータを得て合成音声の生成が可能になる。
【００３９】
ここではピッチ制御について説明したが、他のパラメータについても同様の方法が適用できる。
文書参照レベル管理における処理
文書参照レベル管理部５０は、対応表５２とレベル判断部５４とを有する。
【００４０】
文書参照レベル管理部５０では、図２に示すように、レベル判断部５４において、新しい文書の呈示の要求の入力状況を監視し、ユーザの文書の参照レベルを判断する。
【００４１】
なお、「参照レベル」とは、文書をユーザがどのように読んでいるかを示す状況であり、例えば、斜め読み、じっくり読んでいるかの状態を、段階的に表したものである。そして、参照レベルは、今の瞬間だけで決定されるとは限らず、むしろ、時間的に若干の幅を持ってとらえた文書を読んでいる状態であり、例えば、同じ文書を呈示状態を変えず（頁送りやスクロールや拡大をせず）にいるとか、頻繁に次々と新しい文書を呈示するように要求しているといった状況を想定している。
【００４２】
参照レベルを判断する方法としては、下記のようなものがある。
（１）第１の方法
図４の（例１）に示すように、新しい文書の呈示要求が入力される毎に要求が入力された時刻Ｔ０を対応表５２に記憶しておく。
【００４３】
レベル判断部５４は、新しい文書の呈示要求が入力されると、直前まで呈示されていた文書の呈示を要求した時刻Ｔ１との差を調べ、その時間差Ｔ０−Ｔ１がある閾値Ｔｔｈよりも長ければじっくりと参照していると判断し、逆に短ければ斜め読みしていると判断する。
【００４４】
この処理のフローを図５に示す。図５において、ステップ１１からステップ１４の流れはテキストからの音声合成を行なう処理であり、ステップ２１からステップ２４の流れは文書参照レベルを決定する処理である。この場合、ステップ２３１あるいはステップ２３２における文書参照レベルの決定処理は、ステップ１３における文書の参照レベルに応じた読上げスタイルの変更処理よりも先に行なう必要があるが、それ以外は、ステップ１１からステップ１４までの処理の流れとステップ２１からステップ２４の処理の流れは並行に処理してよい。
【００４５】
（２）第２の方法
音声合成は比較的処理時間がかかることが多いため、対照となるテキスト全体を一度に処理せずに、句読点や構文上の境界等で切れ目を設定し、切れ目までのテキストを一回の処理単位として音声合成することがある。図６にその場合の処理を示す。
【００４６】
図６におけるテキスト解析処理（ステップ１２）では、解析の最後の段階で切れ目を見つけるための処理（ステップ１２２）を行なう。ステップ１２２から参照レベルに対応した読上げスタイルによる制御パラメータ生成処理（ステップ１３）及び制御パラメータにしたがった音声合成処理（ステップ１４）までを、テキスト全体の処理が終わるまで繰り返し処理する。
【００４７】
（３）第３の方法
また、これらのような直前の文書呈示要求からの時間差に注目する方法だけでなく、文書の呈示要求の単位時間あたりの入力回数等で得た入力頻度に対して同様にある閾値を設定して斜め読みをしているかじっくり読んでいるかを判断してもよい。
【００４８】
（４）第４の方法
図４の（例２）に示すように、時間経過に重みづけをして足し合わせたパラメータｐをある閾値Ｐｔｈと比較して判断してよい。
【００４９】
この場合、これらのユーザ入力部２０のインタフェースに対するユーザの操作開始時刻と操作終了時刻を検知しその時間差からユーザの操作持続時間を求め、その値が閾値を越えた場合に、その文書に対する参照レベルを低くするように判断してもよい。これはユーザ入力部２０のインタフェースに対する継続的な操作を文書の拾い読みないしは早読みを要求するユーザ入力と判断して、参照レベルを低くすることに相当する。
【００５０】
また逆に、ユーザの操作持続時間を求め、その値が閾値を越えた場合に、その文書に対する参照レベルを高くするように判断してもよい。これは、ユーザ入力部２０のインタフェースに対する継続的な操作を文書の精読を要求するユーザ入力と判断して、参照レベルを高くすることに相当する。
【００５１】
さらに、ユーザの操作持続時間のほか、ユーザ入力部２０のインタフェースに対するユーザの操作の単位時間当たりの頻度に対する閾値を設定し、ユーザ操作の頻度が閾値を越えるか否かに応じて参照レベルを設定してもよい。
【００５２】
また、ユーザ入力部２０がインタフェースを複数個持つ場合には、それぞれユーザの「精読」「拾い読み」「早読み」の要求として判断して、対応する参照レベルを設定してもよい。
【００５３】
（５）第５の方法
文書参照レベルは、上記のように自動的に判断する以外に、ユーザが直接に入力することによって変更してもよい。例えば、斜め読みやじっくり読みといった切替えを行なうボタンやスイッチをユーザ入力部２０のインタフェースとして利用して、ユーザの操作によって文書参照レベルを切り替えさせる。
【００５４】
また、図７に示すインタフェースのようにじっくり読んでいる状況から斜め読み状況までの間に中間的なレベルの状況を設けて、それをスライダーでユーザにレベル調整させてもよい。図７の（ａ）は、スライダにより文書参照レベルを設定させるインタフェースでの説明図であり、図７の（ｂ）は、ボタンにより文書参照レベルを設定させるインタフェースの説明図である。
【００５５】
また、ユーザ入力部２０のインタフェースとして音声認識部を有し、例えば「詳細」「詳しく」「え？」などの音声認識結果が得られた場合に、その文書に対する参照レベルを高くするように判断し、「スキップ」「飛ばす」などの音声認識結果が得られた場合に、その文書に対する参照レベルを低くするように判断してもよい。
【００５６】
音声合成部における文書参照レベルによる処理
このような文書参照レベルを反映して、音声合成部４０における読上げスタイル管理部４４ではそれぞれの文書参照レベルに対応する制御規則セットを用いて制御パラメータ系列を生成する。
【００５７】
例えば、文書参照レベルが斜め読みと判断されている場合は、発話速度を速める制御規制セットを用いたり、ピッチの上げ下げの幅を小さくする制御規制セットを用いて、合成音声を生成する。逆に、じっくりと読んでいると判断される場合は、ピッチやパワーの上げ下げの幅を大きくする制御規制セットを用いて合成音声を生成する。もちろん、この例の通りだけではなく、斜め読みと判断される場合にはピッチやパワーの上げ下げの幅を大きく、逆にじっくり読んでいると判断される場合にはピッチやパワーの上げ下げ幅を小さくする制御規制セットを用いて合成音声を生成してもよい。
【００５８】
図３に、このような読上げスタイル管理部４４における処理を、文書参照レベルと読上げスタイルの対応表及び読上げスタイルと制御規制セットの対応表によって実現する例を示す。
【００５９】
図８のフローチャートに示すように、音声合成部４０におけるテキスト解析部４２では、言語解析時に利用する辞書の品詞や活用等の文法情報等を用いて、情報量の少ない語かどうかを判定することができる。この例では、自立語であるか付属語であるかを分類して、自立語であれば情報量が多い語、付属語であれば情報量が少ない語であると判定する。また、副詞的に用いられる名詞や、代名詞等、自立語であるもののその語自体の持つ意味が弱い場合は、情報量が少ない語であると判定してもよい。
【００６０】
情報量が少ない語であると判定された場合は、図９のフローチャートに示すように、売上げスタイル管理部４４において、発話速度を速めるように売上げスタイルを変更し、それにしたがった合成音声を生成する。あるいは、テキスト解析部４２において、重要でない語を除いた句の連鎖に変えたものに対して、読みとアクセントやイントネーションを与える韻律制御の単位を決定し、この合成音声を生成してもよい。
【００６１】
音声合成部４０のテキスト解析部４２において、ＨＴＭＬやＳＧＭＬ、ＴｅＸのような構造化タグの埋められた文書の構造解析を行なうことにより、文書中のテキストの構造が得ることができる。これは構造化のためのタグを検索することにより容易に構造が得られる。この場合、文書参照レベル管理部５０で管理している文書参照レベルに応じて、テキストの読上げる箇所を変えることが可能になる。
【００６２】
例えば、図１０のフローチャートに示すように、文書参照レベルが斜め読みの場合には、構造上の上位のテキストから順次読上げ、文の構造を音声で呈示することが可能になる。
【００６３】
また、このような構造化されたテキストを読上げる場合、例えば、章タイトルを読上げる場合、スクロール方向の先にある章タイトルを順次読上げ、章タイトルを読上げている際にユーザ入力部２０から指示入力があると、その章のテキストが見えるように表示画面上でジャンプしたり、あるいはその章の下位構造にあたる節のテキストを読上げることができる。
【００６４】
なお、上記処理をＦＤやＣＤ−ＲＯＭ等の記録媒体に記憶させておき、音声合成装置を有するコンピュータにインストールして、上記処理をこのコンピュータに実行させてもよい。
【００６５】
第２の実施例
第２の実施例の音声出力装置１００を図１１に示す。
第１の実施例の音声出力装置１０と異なる構成は、ユーザに呈示する文書呈示部６０を加えることにある。この文書呈示部６０は文書引出し部３２から受け取った文書を呈示する。
【００６６】
「呈示」とは、ディスプレイを用いた視覚的表示を基本とするが、視覚的なディスプレイに限定するものではない。例えば、オーディオデータを含む場合はその再生出力も呈示の一種とみなす。
【００６７】
「呈示状態」とは、文書内容をユーザに出力している状態。視覚的にどのくらいの大きさに表示しているとか、文書全体のうちのどの部分を表示していて、どの部分が表示されずに隠れているといった（今の瞬間における）表示状態。オーディオ出力の音量や全体でどれだけの長さのオーディオデータのうちどこまで再生しているといった（今の瞬間における）再生状態をいう。
【００６８】
文書呈示部６０は、ディスプレイのような視覚的な出力手段を持つ場合、ユーザの指示に応じて拡大や縮小等の呈示状態を変えることができる。また、文書の大きさがディスプレイの大きさに収まらない状態はページめくりやスクロール等を行なうことができる。これらの呈示状態の変更はユーザ入力部２０を介して行なわれる。
【００６９】
文書呈示部６０が、マルチメディア文書のオーディオデータを再生する場合は、このような呈示状態の変更は、音量の調整や巻戻しや早送りや中断等に相当する。
【００７０】
ユーザ入力部２０は、例えば、キーあるいはマウスやペン、タッチパネル等のポインティング機能を持つデバイスをディスプレイと組み合わせたタブレットのようなデバイスとしてもよい。あるいはスイッチボダンあるいはダイヤル等のレベル設定可能なデバイスとして実装してもよい。
【００７１】
本発明であると、ユーザが文書を斜め読み、または、精読しているか等のユーザの関心度に応じて合成音声の読上げスタイルを変化させて出力することができる。
【００７２】
このように文書呈示部６０を持つ場合、拡大や縮小、スクロール等の文書の呈示状態の変更を要求するユーザからの入力の頻度を調べて、入力頻度が高ければ同じ文書を丁寧に参照しているとみなし、じっくりと読んでいると判断することもできる。あるいは、スクロールや頁めくり等の文書の呈示状態のへ変更を要求するユーザからの入力の頻度を同様に調べて、入力頻度が高ければ、斜め読みしていると判断できる。このような入力の頻度や時刻のチェックは、図２に示した方法と同様に、ユーザ入力部２０へのユーザからの入力された時刻を記録する対応表を利用して、図１１のように行なうことができる。
【００７３】
なお、上記処理をＦＤやＣＤ−ＲＯＭ等の記憶媒体に記憶させておき、音声合成装置を有するコンピュータにインストールして、上記処理をこのコンピュータに実行させてもよい。
第３の実施例
第３の実施例の音声出力装置２００を図１２に示す。
【００７４】
この音声出力装置２００は、文書引出し部３２が新たに呈示するために文書の引出しをしている際の文書データの流量について文書参照レベル管理部５０がチェックし、単位時間あたりの流量が多ければ飛ばし読みしていると判断することもできる。
【００７５】
なお、上記処理をＦＤやＣＤ−ＲＯＭ等の記憶媒体に記憶させておき、音声合成装置を有するコンピュータにインストールして、上記処理をこのコンピュータに実行させてもよい。
【００７６】
【発明の効果】
本発明により、文書ブラウザへのユーザの操作から、じっくり読んだり斜め読みしているという状況に適した読み上げスタイルを自動的に切り替える。また、斜め読み時の合成音声の発話速度を変えることにより、斜め読みをスピーディに行なえる。また、斜め読み時に、ピッチ制御の話調成分を大きくしたりベースラインを高くとることにより声の高い読み上げを行なうことにより、文書を視覚的に確認しなくても、ブラウジングモードにあるかどうかを耳で判断できる。
【００７７】
さらに、文書の構造解析を行って文書中のテキストの構造を調べ、斜め読みと判断された時には、画面に表示しきれなかった文書中のテキストの章タイトルとか節タイトルなどの階層上の上位のテキストを順次読み上げ、画面を確認しなくても音声で文構造を示すことができる。
【図面の簡単な説明】
【図１】第１の実施例の音声出力装置のブロック図である。
【図２】第１の実施例の音声出力装置の詳細なブロック図である。
【図３】読上げスタイル管理部での処理図である。
【図４】レベル判断部における文書参照レベルの判断を示す図である。
【図５】ユーザによる新しい文書の呈示要求の入力時刻の間隔からの文書参照レベルの決定のフローチャートである。
【図６】ユーザによる新しい文書の呈示要求の入力時刻の間隔からの文書参照レベルの決定のフローチャートである。
【図７】（ａ）は、スライダにより文書参照レベルを設定させるインタフェースでの説明図であり、
（ｂ）は、ボタンにより文書参照レベルを設定させるインタフェースの説明図であり、
【図８】品詞と情報量の対応テーブルを用いた売上げテキスト中の語の情報量判定を示す図である。
【図９】売上げテキスト中の語の情報量の判定結果を用いた読上げのフローチャートである。
【図１０】文書参照レベルに応じた構造化文書に対する読上げ箇所のフローチャートである。
【図１１】第２の実施例の音声出力装置の詳細なブロック図である。
【図１２】第３の実施例の音声出力装置の詳細なブロック図である。
【符号の説明】
２０…ユーザ入力部
３０…文書入力部
３２…文書引出し部
３４…文書蓄積部
４２…テキスト解析部
４４…売上げスタイル管理部
４６…音声合成器
５０…文書参照レベル管理部
５２…対応表
５４…レベル判断部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio output apparatus and method for converting at least text information in document information into audio and outputting the audio.
[0002]
[Prior art]
Research on TTS (Text-To-Speech) conversion that obtains synthesized speech from input text has been conducted for a long time, for example: Literature: Iwata et al .: “Software Japanese Text-to-Speech Synthesis for PCs,” Proc. , 2-8-13, pp. Many TTS systems such as those in 245-246 (October 1993) have been realized so far. These TTS systems are implemented as a single application or hardware that receives text from various other applications and reads it out with synthesized speech.
[0003]
On the other hand, in order to obtain knowledge for generating synthesized speech that is not monotonous, various reading styles, that is, announcement tone, reading tone, dialogue tone, fast tone, slow tone, high voice, low voice, etc. Research has been reported that analyzes the characteristics when changing the pitch and speaking speed of the voice. For example, Abe et al .: “Characteristic analysis of different first story styles,” IEICE technical report, SP93-7, pp. 37-42 (May 1993), Sakata et al .: “Prosodic features and synthesis of conversational speech,” IEICE Research Report, SP95-17, pp. 37-42. 55-62 (May 1995) and Arai et al .: “Use of Pitch Pattern Generation Rules Considering Generation Speed,” Proc. Of the Acoustical Society of Japan, 2-0-10, pp. 238-284 (March 1994) is a research report on the analysis.
[0004]
The reading style is closely related to the prosodic parameters such as pitch, speech speed, and power, changes over time, the segments themselves in the database for speech synthesis, and their selection rules. Is equivalent to controlling the value of the ancestor and its temporal change so that these control parameters correspond to the respective styles.
[0005]
By the way, several application systems have been developed as speech synthesis technology as screen reading software for visually impaired persons. For example, literature, Saito: “Production of software for visually impaired persons,” Information processing, Vol. 36, no. 12, pp. As described in 1116-1121 (Dec. 1995), a device has been devised so that visually handicapped persons who have difficulty in visual feedback at the time of operation can read a document only by operating a key. Specifically, it changes the reading method such as read skip repeated playback, monitors the screen display memory, and reads the text that is already displayed by reading the text being displayed on the screen or the user's cursor operation, etc. A function has been developed to automatically change reading aloud by key operation, sentence reading according to character type or voice reading operation key type, sound reading, and detailed (for example, “Tanbo no Tada”) reading. This instruction regarding how to read out, such as skipping and repeated reproduction, is realized by the user operating keys that are directly associated with these instructions in advance so that visually impaired persons can immediately issue instructions. In addition, the read-out function is basically used to grasp the display location of text on the screen that cannot be visually recognized by the visually impaired. The read-out location is displayed on the screen or is currently displayed on the screen. Is text.
[0006]
[Problems to be solved by the invention]
In general, when reading a document that you want to browse carefully, it is hard to feel stress even if you read it at the same speed as reading aloud aloud, but when you read diagonally or visually, it is much faster than the speed of reading aloud It is normal to run your eyes early. Therefore, when reading text in a document with synthesized speech, it is more natural to read a text in a reading style that is more crisp than a document that is read obliquely than a document that is to be referred to carefully.
[0007]
In addition, in the case of reading carefully and in the case of reading obliquely, the place where attention is paid to or the place where attention is read in the text in the document changes. For example, when reading carefully, there is a tendency to read a document from corner to corner, but when reading a long document diagonally, pay attention to the rough structure of the document, or when searching many documents visually Focus on specific parts such as document styles.
[0008]
In the present invention, which part of the text of the presented document is read out is switched according to the document reference status. For example, when reading carefully, all the text in the document is read out sequentially, and when reading diagonally, the text that is the key in the document structure is read out first. .
[0009]
By the way, the use of a speech reading function that determines a reading style and a reading position of a synthesized voice according to a document reference situation such as oblique reading or careful reading has not been performed so far.
[0010]
In the conventional TTS system, the synthesized speech reading style is uniformly given to the entire text designated by the user as a reading target in accordance with settings such as speech speed and pitch height that have been performed in advance. Further, until the user explicitly changes the setting of the reading style and thereafter the user changes the setting, the reading style is continued in the same reading style. This simply treats speech synthesis as a media-to-speech media conversion function, and leaves the variable parameters to the user when generating synthesized speech to be read out.
[0011]
In the case of screen reading for the visually impaired, the reading style that is associated with the operation key in advance is associated, and the user specifies the reading position with the cursor and selects the reading style with the reading operation key. Not too much.
[0012]
[Means for Solving the Problems]
User information for document information Presentation request time interval Or Presentation request input frequency Detecting at least one of the above Presentation request time interval Or Presentation request input frequency The reference level determination means for determining the reference level so that the reference level for the document information of the user is lower as at least one of the reference level is shorter, and the reference level determined by the reference level determination means Using the control rule set corresponding to this determined reading style The text information in the document information has at least speech synthesis means for converting the text information into speech and outputting the speech.
[0013]
Next, there is a display means for displaying the document information to be referred to, and the number of characters or lines or display area that can be displayed simultaneously on the display means and the number of characters or lines of the text included in the document information to be referred to or the document information Reference level determination means for obtaining a reference level so that the reference level for the document information of the user is lower as the ratio of display area or the size of display characters is larger, and the reference obtained by the reference level determination means level Using the control rule set corresponding to this determined reading style The text information in the document information has at least speech synthesis means for converting the text information into speech and outputting the speech.
[0014]
It has a display means for displaying the document information to be referenced, and the number of characters or lines or display area that can be displayed simultaneously on the display means and the number of characters or lines of text information contained in the document information to be referred to or the display area of the document information. The reference level determination means for determining the reference level so that the reference level for the document information of the user is lower as the ratio or the size of the display character is smaller, and the reference level determined by the reference level determination means Using the control rule set corresponding to this determined reading style The text information in the document information has at least speech synthesis means for converting the text information into speech and outputting the speech.
[0018]
Next, a user input unit that receives a character string input from the user, a character string search unit that searches the document information that refers to the character string input from the user input unit as a search target character string, and the character string search unit The reference level determination means for determining the reference level so that the reference level is higher as the number of locations where the character string to be searched is detected in the text information to be referenced is larger, and the reference level determined by the reference level determination means Using the control rule set corresponding to this determined reading style The text information in the document information has at least speech synthesis means for converting the text information into speech and outputting the speech.
[0019]
Further, user input means for receiving word input from the user, a synonym dictionary for obtaining synonyms of words input from the user input means, and characters for searching in document information referring to the synonyms as search target character strings A column search means, a reference level determination means for obtaining a reference level so that the reference level is higher as the number of locations where the character string search means detects the search target character string in the text information to be referred to is larger. Reference level obtained by reference level judgment means Using the control rule set corresponding to this determined reading style The text information in the document information has at least speech synthesis means for converting the text information into speech and outputting the speech.
[0020]
Further, the step of detecting at least one of the reference time or reference speed of the user with respect to the document information and the reference level with respect to the document information of the user are lowered as at least one of the reference time or the reference speed is shorter. Obtaining the reference level from the obtained reference level Determine the reading style and use the control rule set corresponding to the determined reading style. The text information in the document information includes at least a step of converting the text information into a voice and outputting it.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows a basic configuration of an audio output device 10 of the present embodiment.
The voice output device 10 includes a user input unit 20, a document input unit 30, a voice synthesis unit 40, and a document reference level management unit 50.
[0022]
FIG. 2 shows a more detailed configuration diagram. Each device will be described below.
Processing in the document input part
The document input unit 30 includes a document extraction unit 32 and a document storage unit 34.
The document handled by the document input unit 30 includes text, and the speech synthesis unit 40 reads the text in the document.
[0023]
“Text” is a character or symbol. Includes numbers. The “document” is assumed to include text, but it may be a multimedia document including not only text but also images and audio data. Alternatively, it may be a hammer document such as an HTML (Hyper Text Markup Language) document or an online document (online help).
[0024]
The document extraction unit 32 extracts a document from the document storage unit 34.
The document storage unit 34 is a storage device such as a disk or a memory, stores documents therein, and is provided inside the document input unit 30. In addition, the document is managed by a network address or a document search path that is placed outside connected via a network.
[0025]
When the user input requesting presentation of a new character is transmitted from the user input unit 20, the document input unit 30 extracts the corresponding document from the document storage unit 34.
[0026]
It is assumed that a method for accessing a document is specified as an instruction content from the user in the form of an address, a file search path, or the like. As such a document management method, a method already realized by a network file system or WWW (World Wide Web) can be used.
[0027]
Normal processing in the speech synthesizer
The speech synthesis unit 40 includes a text analysis unit 42, a reading style management unit 44, and a speech synthesizer 46.
[0028]
When the document extraction unit 32 extracts a new document, the speech synthesis unit 40 sends the document or the text content included in the document to the text analysis unit 42, and the analysis result is Reading The text is sent to the speech style synthesizer 46 sequentially to the speech style manager 44, and the text is read out with synthesized speech. In addition, " Reading The “upward portion” means a portion of the text in the document read out by the synthesized speech in the speech synthesizer 46.
[0029]
For a series of processing from the text analysis unit 42 to the speech synthesizer 46, an existing TTS (speech synthesis from text-to-speech text) technology can be used.
[0030]
For example, Sato et al. "Speech synthesis from Japanese text" research practical application report, Vol. 32, no. 11, pp. 2243-2252 (1983) is an example, and specifically, as follows.
[0031]
Performs text analysis, performs prosodic control such as pitch, power, speech rate, etc., selection and connection of speech units, and further performs phonological control such as sound quality control, and generates speech waveforms, Prosodic control and phonological control correspond to processing of the reading style management unit 44, and waveform generation processing corresponds to processing performed by the speech synthesizer 46.
[0032]
Furthermore, it demonstrates in detail.
(Text analysis unit 42)
The text analysis unit 42 analyzes the text in the document, and determines the prosodic control unit that gives the reading, accent, and intonation. The units of prosodic control that give readings, accents and intonations are regularly determined from information on readings and accents for each vocabulary in the dictionary used for language analysis, parts of speech and grammatical information on utilization.
[0033]
(Reading style manager 44)
The reading style management unit 44 can provide information such as the pitch of the synthesized speech, speech speed, and power so that the tone of the announcement, reading, dialogue, or tone such as fast, slow, high or low Prosodic feature parameters, selection of speech units that influence sound quality, and control of phonological feature parameters such as spectral deformation are performed.
[0034]
“Reading style” refers to “annual tone” such as announcement, reading, dialogue, or fast / slow tone, high / low voices. This is achieved by changing the control of prosodic feature parameters such as voice pitch, speech rate, and power, and spectral feature parameters of voice quality.
[0035]
The reading style management unit 44 controls prosodic or phonological feature parameters using a control rule set corresponding to the designated reading style. These control rules collect a lot of utterance data of natural speech of a desired reading style, analyze prosodic or phonological feature parameters, and derive parameter control rules.
[0036]
As an example of such rule derivation, Sakata et al. “Prosodic Features and Synthesis of Dialogue Speech”, IEICE Technical Report, SP95-17, pp. 199-101. 55-62 (May 1995).
[0037]
Specifically, a lot of dialogue voice utterance data is collected, a control parameter of the voice pitch is analyzed, and a control rule is derived by determining an appropriate value as a parameter to be given to the pitch control model at the time of synthesis. These parameter values are determined appropriately for the same reading style depending on the position in the sentence and the difference in accent type. This correspondence is called a pitch control rule.
[0038]
The rules derived for each of these reading styles are as shown in FIG. Reading It is stored in the reading style management unit 44 in the form of a table combined with the reading style, and the corresponding rule is applied at the time of synthesis by designating the reading style. At the time of synthesis, it is possible to obtain the position and accent type in the sentence by text analysis of the input text, obtain the parameters corresponding to each from the above restrictions, and generate synthesized speech.
[0039]
Although the pitch control has been described here, the same method can be applied to other parameters.
Processing in document reference level management
The document reference level management unit 50 includes a correspondence table 52 and a level determination unit 54.
[0040]
In the document reference level management unit 50, as shown in FIG. 2, the level determination unit 54 monitors the input status of a request for presentation of a new document, and determines the reference level of the user's document.
[0041]
Note that the “reference level” is a situation indicating how a user is reading a document. For example, the reference level is a stepwise representation of a state of reading obliquely or carefully. And the reference level is not always determined only at the present moment, but rather, it is a state of reading a document that has been captured with a slight width in time, for example, changing the presentation state of the same document It is assumed that the user is in the middle (no page turning, scrolling, or enlargement) or frequently requests new documents to be presented one after another.
[0042]
There are the following methods for determining the reference level.
(1) First method
As shown in FIG. 4 (example 1), the time T0 at which the request is input is stored in the correspondence table 52 every time a request for presenting a new document is input.
[0043]
When a request for presentation of a new document is input, the level determination unit 54 checks the difference from the time T1 when the presentation of the document that was presented immediately before is requested, and if the time difference T0−T1 is longer than a certain threshold Tth. Judge that it is referring carefully, and conversely, if it is short, it is judged that it is reading diagonally.
[0044]
The flow of this process is shown in FIG. In FIG. 5, the flow from step 11 to step 14 is a process of performing speech synthesis from text, and the flow from step 21 to step 24 is a process of determining the document reference level. In this case, the document reference level determination process in step 231 or step 232 corresponds to the document reference level in step 13. Reading Although it is necessary to perform it before the raising style change process, the process flow from step 11 to step 14 and the process flow from step 21 to step 24 may be performed in parallel.
[0045]
(2) Second method
Since speech synthesis often takes a relatively long processing time, the entire text to be controlled is not processed at once, but a break is set by punctuation or syntactic boundaries, and the text up to the break is processed once. As a voice synthesis. FIG. 6 shows the processing in that case.
[0046]
In the text analysis process (step 12) in FIG. 6, a process (step 122) for finding a break at the final stage of the analysis is performed. The process from step 122 to the control parameter generation process (step 13) according to the reading style corresponding to the reference level and the speech synthesis process according to the control parameter (step 14) are repeated until the entire text process is completed.
[0047]
(3) Third method
In addition to the method of paying attention to the time difference from the previous document presentation request, a threshold value is similarly set for the input frequency obtained by the number of input per unit time of the document presentation request. You may judge whether you are reading diagonally or carefully.
[0048]
(4) Fourth method
As shown in (Example 2) of FIG. 4, the parameter p obtained by weighting and adding to the passage of time may be compared with a certain threshold value Pth.
[0049]
In this case, the user's operation start time and operation end time for the interface of the user input unit 20 are detected, and the user's operation duration is obtained from the time difference. When the value exceeds the threshold, the reference level for the document May be determined to be low. This corresponds to lowering the reference level by determining that a continuous operation on the interface of the user input unit 20 is a user input requesting browsing or quick reading of a document.
[0050]
Conversely, the operation duration time of the user may be obtained, and if the value exceeds a threshold value, it may be determined to increase the reference level for the document. This is equivalent to increasing the reference level by determining that a continuous operation on the interface of the user input unit 20 is a user input requesting a detailed reading of the document.
[0051]
Furthermore, in addition to the user operation duration, a threshold is set for the frequency per unit time of the user operation on the interface of the user input unit 20, and a reference level is set according to whether the frequency of the user operation exceeds the threshold May be.
[0052]
Further, when the user input unit 20 has a plurality of interfaces, it may be determined that each of the user requests “fine reading”, “browsing”, and “fast reading”, and a corresponding reference level may be set.
[0053]
(5) Fifth method
The document reference level may be changed by direct input by the user in addition to the automatic determination as described above. For example, a document reference level is switched by a user operation using a button or a switch for switching such as oblique reading or careful reading as an interface of the user input unit 20.
[0054]
Alternatively, an intermediate level situation may be provided between the situation where the user is reading carefully as in the interface shown in FIG. 7 and the situation where the user is reading obliquely, and the user may adjust the level with a slider. 7A is an explanatory diagram of an interface for setting a document reference level by a slider, and FIG. 7B is an explanatory diagram of an interface for setting a document reference level by a button.
[0055]
In addition, a speech recognition unit is provided as an interface of the user input unit 20 and, for example, when a speech recognition result such as “details”, “details”, “e?” Is obtained, a determination is made to increase the reference level for the document. When a speech recognition result such as “skip” or “skip” is obtained, it may be determined to lower the reference level for the document.
[0056]
Processing by the document reference level in the speech synthesizer
Reflecting such a document reference level, the reading style management unit 44 in the speech synthesizer 40 generates a control parameter series using a control rule set corresponding to each document reference level.
[0057]
For example, when the document reference level is determined to be oblique reading, synthesized speech is generated using a control restriction set that increases the speech rate or a control restriction set that reduces the pitch increase / decrease width. On the other hand, if it is determined that the user is reading carefully, synthesized speech is generated using a control restriction set that increases the range of pitch and power increase / decrease. Of course, not only as in this example, the pitch and power increase / decrease are increased when it is determined that the reading is oblique, and the pitch and power increase / decrease is decreased when it is determined that the reading is performed carefully. The synthesized speech may be generated using the control restriction set.
[0058]
FIG. 3 shows an example in which the processing in the reading style management unit 44 is realized by a correspondence table between the document reference level and the reading style and a correspondence table between the reading style and the control restriction set.
[0059]
As shown in the flowchart of FIG. 8, the text analysis unit 42 in the speech synthesizer 40 determines whether or not the word has a small amount of information by using the part of speech of the dictionary used at the time of language analysis, grammatical information such as utilization, and the like. Can do. In this example, whether an independent word or an attached word is classified, it is determined that the word has a large amount of information if it is an independent word, and the word has a small amount of information if it is an auxiliary word. Further, if the meaning of the word itself is weak although it is an independent word such as a noun or a pronoun used as an adverb, it may be determined that the word has a small amount of information.
[0060]
If it is determined that the word has a small amount of information, as shown in the flowchart of FIG. 9, the sales style management unit 44 changes the sales style so as to increase the utterance speed, and generates synthesized speech according to the change. . Alternatively, the text analysis unit 42 may determine a prosodic control unit that gives a reading, accent, and intonation to a phrase chain excluding unimportant words, and may generate this synthesized speech.
[0061]
The structure of the text in the document can be obtained by performing the structure analysis of the document in which the structured tag such as HTML, SGML, TeX is embedded in the text analysis unit 42 of the speech synthesis unit 40. In this case, a structure can be easily obtained by searching a tag for structuring. In this case, it is possible to change the text reading location according to the document reference level managed by the document reference level management unit 50.
[0062]
For example, as shown in the flowchart of FIG. 10, when the document reference level is oblique reading, it is possible to sequentially read the text from the upper text on the structure and present the sentence structure by voice.
[0063]
Further, when reading such structured text, for example, when reading a chapter title, the chapter titles in the scroll direction are read sequentially, and instructions are given from the user input unit 20 when reading the chapter title. When there is an input, it is possible to jump on the display screen so that the text of the chapter can be seen, or to read the text of the section corresponding to the substructure of the chapter.
[0064]
The above process may be stored in a recording medium such as an FD or a CD-ROM, installed in a computer having a speech synthesizer, and the computer may execute the above process.
[0065]
Second embodiment
A sound output device 100 of the second embodiment is shown in FIG.
The configuration different from the audio output device 10 of the first embodiment is to add a document presentation unit 60 to be presented to the user. The document presentation unit 60 presents the document received from the document withdrawal unit 32.
[0066]
“Presentation” is based on visual display using a display, but is not limited to visual display. For example, when audio data is included, the reproduction output is also regarded as a kind of presentation.
[0067]
“Presentation state” is a state in which the document content is being output to the user. A display state (at the moment) that the size is visually displayed, which part of the entire document is displayed, and which part is hidden without being displayed. This is the playback state (at the moment) of how much audio data is being played and how much of the audio data is being played.
[0068]
When the document presentation unit 60 has a visual output unit such as a display, the document presentation unit 60 can change the presentation state such as enlargement or reduction in accordance with a user instruction. Further, when the document size does not fit in the display size, page turning or scrolling can be performed. These presentation states are changed via the user input unit 20.
[0069]
When the document presentation unit 60 reproduces audio data of a multimedia document, such a change in the presentation state corresponds to volume adjustment, rewinding, fast-forwarding, interruption, or the like.
[0070]
For example, the user input unit 20 may be a device such as a tablet in which a key or a device having a pointing function such as a mouse, a pen, or a touch panel is combined with a display. Alternatively, it may be implemented as a device whose level can be set, such as a switch board or a dial.
[0071]
According to the present invention, it is possible to output by changing the reading style of the synthesized speech in accordance with the degree of interest of the user such as whether the user is reading the document obliquely or reading it carefully.
[0072]
When the document presentation unit 60 is provided as described above, the frequency of input from the user who requests a change in the presentation state of the document such as enlargement, reduction, scrolling, etc. is checked, and if the input frequency is high, the same document is referred carefully. It can be determined that you are reading carefully. Alternatively, the frequency of input from a user requesting a change to the presentation state of a document such as scrolling or page turning is similarly examined, and if the input frequency is high, it can be determined that the reading is performed obliquely. Like the method shown in FIG. 2, the input frequency and time are checked as shown in FIG. 11 using a correspondence table that records the time input from the user to the user input unit 20. Can be done.
[0073]
The above process may be stored in a storage medium such as an FD or a CD-ROM, installed in a computer having a speech synthesizer, and the computer may execute the above process.
Third embodiment
An audio output device 200 of the third embodiment is shown in FIG.
[0074]
In this audio output device 200, the document reference level management unit 50 checks the flow rate of document data when a document is being drawn for the document drawing unit 32 to newly present, and if the flow rate per unit time is large. It can also be determined that you are skipping.
[0075]
The above process may be stored in a storage medium such as an FD or a CD-ROM, installed in a computer having a speech synthesizer, and the computer may execute the above process.
[0076]
【The invention's effect】
According to the present invention, a reading style suitable for a situation in which reading is performed carefully or obliquely is automatically switched from a user operation on the document browser. In addition, the oblique reading can be performed speedily by changing the utterance speed of the synthesized speech during the oblique reading. In addition, when reading diagonally, the speech control can be read aloud by increasing the tone component of the pitch control or by taking a high baseline so that it is possible to check whether the document is in the browsing mode without having to visually check the document. Can be judged by ear.
[0077]
Furthermore, by analyzing the structure of the document and examining the structure of the text in the document, when it is determined that the reading is oblique, the higher-level hierarchy such as the chapter title or section title of the text in the document that could not be displayed on the screen. Sentences are read aloud sequentially and the sentence structure can be shown by voice without checking the screen.
[Brief description of the drawings]
FIG. 1 is a block diagram of an audio output device according to a first embodiment.
FIG. 2 is a detailed block diagram of the audio output device according to the first embodiment;
FIG. 3 is a processing diagram in a reading style management unit;
FIG. 4 is a diagram illustrating determination of a document reference level in a level determination unit.
FIG. 5 is a flowchart of determination of a document reference level from an input time interval of a new document presentation request by a user.
FIG. 6 is a flowchart of determination of a document reference level from an input time interval of a new document presentation request by a user.
FIG. 7A is an explanatory diagram of an interface for setting a document reference level with a slider;
(B) is an explanatory diagram of an interface for setting a document reference level with a button;
FIG. 8 is a diagram showing information amount determination of words in sales text using a part-of-speech and information amount correspondence table;
FIG. 9 is a flowchart of reading using the determination result of the information amount of words in the sales text.
FIG. 10 is a flowchart of a reading portion for a structured document according to a document reference level.
FIG. 11 is a detailed block diagram of an audio output device according to a second embodiment.
FIG. 12 is a detailed block diagram of an audio output device according to a third embodiment.
[Explanation of symbols]
20: User input section
30 ... Document input section
32 ... Document drawer
34 ... Document storage unit
42 ... Text analysis section
44 ... Sales Style Management Department
46 ... Synthesizer
50. Document reference level management section
52 ... correspondence table
54 ... Level judgment part

Claims

By detecting at least one of the time interval of the user's presentation request or the input frequency of the presentation request for the document information, and the at least one of the time interval of the presentation request or the input frequency of the presentation request is shorter, the document of the user A reference level determining means for obtaining the reference level so that a reference level for information is low, a reading style is determined from the reference level obtained by the reference level determining means, and a control rule set corresponding to the determined reading style is obtained. A speech synthesizing apparatus comprising speech synthesizing means for converting at least text information in the document information into speech and outputting the speech.

It has a display means for displaying the document information to be referred to, and the number of characters or lines or display area that can be displayed simultaneously on the display means and the number of characters or lines of text included in the document information to be referred to or the display area of the document information. The reference level determination means for obtaining the reference level so that the reference level for the document information of the user is lower as the ratio or the size of the display character is larger, and reading from the reference level obtained by the reference level determination means A speech synthesizer comprising speech synthesis means for determining a style and converting at least text information in the document information into speech using a control rule set corresponding to the determined reading style.

It has a display means for displaying the document information to be referenced, and the number of characters or lines or display area that can be displayed simultaneously on the display means and the number of characters or lines of text information contained in the document information to be referred to or the display area of the document information. The reference level determination means for obtaining the reference level so that the reference level for the document information of the user is lower as the ratio or the size of the displayed character is smaller, and the reading style is determined from the reference level obtained by the reference level determination means. A speech synthesizer comprising speech synthesis means for determining and converting at least text information in the document information into speech using a control rule set corresponding to the determined reading style.

User input means for receiving a character string input from a user, character string search means for searching in document information referring to a character string input from the user input means as a search target character string, and the character string search means Reference level determination means for obtaining a reference level so that the reference level becomes higher as the number of locations where the target character string is detected in the text information to be referred to is larger, and a reading style is determined from the reference level obtained by the reference level determination means. A speech synthesizer comprising speech synthesis means for determining and converting at least text information in the document information into speech using a control rule set corresponding to the determined reading style.

User input means for receiving word input from a user, synonym dictionary for obtaining synonyms of words input from user input means, and character string search for searching in document information referring to the synonyms as search target character strings Means, a reference level determination means for obtaining a reference level so that the reference level increases as the number of parts detected in the text information to which the character string search means refers to the search target character string increases, and the reference level A speech style is determined by determining a reading style from the reference level obtained by the determining means, and converting at least text information in the document information into speech using a control rule set corresponding to the determined reading style. A speech synthesizer characterized by the above.

The step of detecting at least one of a reference time or a reference speed of the user with respect to the document information, and the reference so that the reference level of the user with respect to the document information becomes lower as at least one of the reference time or the reference speed is shorter. A level is determined, and a reading style is determined from the determined reference level, and text information in the document information is converted to at least speech and output using a control rule set corresponding to the determined reading style. A speech synthesis method comprising steps.