JP3864197B2

JP3864197B2 - Voice client terminal

Info

Publication number: JP3864197B2
Application number: JP04818098A
Authority: JP
Inventors: 育夫並木; 弘道林; 哲哉金丸; 常治木目田; 正美氏家
Original assignee: NTT Electronics Corp; Nippon Telegraph and Telephone Corp
Current assignee: NTT Electronics Corp; Nippon Telegraph and Telephone Corp
Priority date: 1998-02-27
Filing date: 1998-02-27
Publication date: 2006-12-27
Anticipated expiration: 2018-02-27
Also published as: JPH11249867A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声クライアント端末に係り、特に、コンピュータとネットワークからなるクライアント／サーバ構成のシステム、とりわけインターネットにおけるWorld Wide Web（以下、単にＷＷＷと記す）システムのサーバに対して、クライアント端末のマイクから音声で入力し、サーバに蓄積している情報を音声で出力する音声クライアント端末に関する。
【０００２】
【従来の技術】
周知のように、ＷＷＷシステムとして、サーバ及びクライアントのハードウェア、ソフトウェアがネットワーク上に適切に構成されている場合に、クライアント端末上にインストールしたNetscape Navigatorなどのブラウザを使用することによって、サーバに格納したテキストやイメージの情報をクライアント画面上に表示して閲覧することが可能である。
【０００３】
このシステムでは、画面上の特定の情報をマウスなどで選択すると、これと関連付けられた（以下では、これをリンクが張られたと言い、先の特定の情報をリンク項目と言う）情報にアクセスし、画面上に表示し閲覧することが可能である。
これらのサービスは、情報を視覚によって享受することが前提であり、画面に目を向けないと享受できない、あるいは、視覚障害者は全く享受できないという欠点がある。これを解決する方法として、最近の音声認識技術と音声合成技術を使用することによって、マイクから音声で入力し、音声合成で出力することが可能である。例えあ、音声で［首相官邸］と入力すれば、［首相官邸］の情報にアクセスし、クライアント端末のスピーカから、テキスト部分を合成音で出力することができる。
【０００４】
【発明が解決しようとする課題】
しかしながら、上記従来の方法では、周知のＷＷＷ情報には、長い文章や、リンク箇所が随所に１０個も２０個もある場合がある、カラーのイメージ情報がテキストに混在することは勿論のこと、動画とリンクが張られたり、視覚に訴える情報がふんだんに使用されるのが実情である。このような情報を視覚障害者にいかに出力するかという問題が存在する。
【０００５】
本発明は、上記の点に鑑みなされたもので、視覚障害者であってもＷＷＷ情報を取得することが可能な音声クライアント端末を提供することを目的とする。
【０００６】
【課題を解決するための手段】
図１は、本発明の原理構成図である。
本発明（請求項１）は、サーバに蓄積されているＨＴＭＬファイルをインターネットを介して取得し、音声により出力する音声クライアント端末２００であって、
ユーザの音声による要求を入力する音声入力手段２０１と、
入力されたユーザの音声を音声認識サーバに送信し、当該音声認識サーバから音声認識結果を受信し、受信した音声認識結果よりＵＲＬを抽出し、該ＵＲＬに基づいてプロキシサーバにＨＴＭＬファイルを要求する要求発行手段２０２と、
プロキシサーバから取得したＨＴＭＬファイルを解析し、タイトル、リンク項目、本文に分類する手段と、
分類されたリンク項目を音声認識サーバに送信する手段と、
タイトル、リンク項目、本文を含む表示されるテキスト情報を音声合成サーバに送信し、当該音声合成サーバから合成された音声データを受信する手段と、
受信した音声データを利用して、タイトル、リンク項目、本文の順に読み上げて音声出力する音声出力手段２０３と、を有する。
【０００９】
上記のように、本発明では、インターネット上に公開されているＨＴＭＬ(Hyper Text Markup Language)形式のファイルを市販のＷｅｂブラウザを通して視覚情報から音声情報に変換し、ユーザに提供することを可能にしたシステムである。また、クライアント側で情報を取得する場合において、音声を用いることで視覚障害者の操作も可能とする。
【００１０】
【発明の実施の形態】
図２は、本発明が適用されるシステムの構成を示す。
同図に示すシステムは、高速なネットワーク上に各処理用エンジンを配置し、負荷分散を行うことにより高速なレスポンスをクライアント端末１０で実現可能とするものである。同図におけるシステムでは、大別して２つのシステムに分けられる。
【００１１】
まず、第１に同図におけるワークステーション２０、３０、４０で構成されるフロント・エンドとしての処理部である。ワークステーション２０、３０は、翻訳サービスを提供するためのシステムである。ワークステーション４０は、一般的にインターネットでよく利用される機能で、本システムでは主に、データのキャッシングや漢字コードの変換で利用する。
【００１２】
第２に、バック・エンドとして利用されるワークステーション５０、６０である。ワークステーション５０は、ユーザからクライアント端末１０に入力された音声データを、候補リスト（リンク項目のリスト）と比較照合を行い、適切な項目を選出する。また、ユーザが入力した音声データとリンク項目が全く同一である必要性を解消するためにリンク項目に対して形態素解析処理を施す。そうすることにより、ユーザからはリンク項目の断片を入力するだけでも適切なリンク項目を推量し、選出する。ワークステーション６０は、クライアント端末１０で抽出されたテキスト情報を、（言語種別のパラメータ付きで）受信し、音声データを生成する音声合成エンジンと呼ぶものである。
【００１３】
次に、クライアント端末１０から送信されたＨＴＭＬファイル要求は、ワークステーション２０上で動作中にプロキシ・サーバを経由し、ワークステーション４０のプロキシ・サーバを使って外部インターネット７０に転送される。インターネット７０から返送されてきた応答データ（ＨＴＭＬファイル）は、ワークステーション４０のプロキシ・サーバを経由して、ワークステーション２０のプロキシ・サーバで各翻訳エンジンへデータの処理を依頼して（依頼のみ、処理結果はユーザの要求で随時クライアント端末１０へ送信される）クライアント端末１０へ転送する。クライアント端末１０に届いたＨＴＭＬファイルは、解析され、表示されるべきテキスト情報をワークステーション６０の音声合成エンジンへ送信し、音声データへ変換され、クライアント端末１０で出力される。また、音声認識のために必要なリンク項目のリストは、ワークステーション５０の形態素解析エンジンで形態素解析処理を施され、音声認識エンジンへ渡され、クライアント端末１０から音声データの送信を待つ。
【００１４】
次にクライアント端末１０で入力（指示）された音声データを音声認識エンジン５０に送信し、音声認識エンジン５０は、以前入力されていたリンク項目（テキスト情報）と比較照合され、適切な結果を得る。得られた結果をクライアント端末１０に送信する。これにより、クライアント端末１０は、リンク項目からＵＲＬを抽出し、インターネット７０へプロキシサーバ２０、４０を介して、次の情報を取得する。
【００１５】
【実施例】
以下に、本発明の実施例を図面と共に説明する。
図３は、本発明の一実施例のシステム構成を示す。同図に示すシステムにおいて、図２と同様のものについては同一符号を付す。
同図に示すシステムは、クライアント端末１０、ワークステーション２０、３０、４０、５０から構成される。
【００１６】
クライアント端末１０は、Ｗｅｂブラウザ１１、音声入力用ボタン監視プログラム１２、音声ブラウザクライアント１３、音声入力用ボタン監視プログラム１２に接続される音声入力開始用ボタン１５、音声ブラウザクライアント１３に接続されるスピーカ１６及びマイク１７から構成される。
ワークステーション２０は、プロキシサーバ２１の機能と、テキストを英語から日本語に翻訳する英日翻訳エンジン２２を有する。
【００１７】
ワークステーション３０は、テキストを日本語から英語に翻訳する日英翻訳エンジン３１を有する。
ワークステーション４０は、プロキシサーバ４１の機能を有する。
ワークステーション５０は、音声認識Ｉ／Ｆプログラム５１、音声認識エンジン５２、形態素解析エンジン５３を有する。
【００１８】
ワークステーション６０は、音声合成エンジン６１を有する。
Ｗｅｂブラウザ１１は、一般によく利用されるもので、例えば、Netscape Navigator等がある。本実施例では、当該ブラウザを用いて説明する。当該Ｗｅｂブラウザ１１は、インターネット７０への窓口として必要情報を取得し、音声ブラウザクライアント１３に渡す。また、Ｗｅｂブラウザ上への表示も行う。音声ブラウザクライアント１３では、Ｗｅｂブラウザ１１から得られた情報を解析し、音読すべきテキスト情報をワークステーション６０の音声合成エンジン６１へ、音声認識すべきリンク項目の情報をワークステーション５０の音声認識Ｉ／Ｆプログラム５１へそれぞれ送信する。
【００１９】
クライアント端末１０は、音声合成エンジン６１から受信した音声データをローカル・ディスクに記録しながら、スピーカ１６により再生する。また、ユーザからの入力は、音声入力開始用ボタン１５を押下することによって、音声入力用ボタン監視プログラム１２から音声ブラウザクライアント１３に通知される。その通知を受けた音声ブラウザクライアント１３は、マイク１７から音声の録音を開始する。音声ブラウザクライアント１３は、ユーザの音声入力開始用ボタン１５の解放によって、録音の停止を行い、音声認識Ｉ／Ｆプログラム５１に録音した音声データをワークステーション６０の音声合成エンジン６１に送信する。
【００２０】
クライアント端末１０に接続れるワークステーション２０のプロキシサーバ２１は、インターネット７０からの転送すべき情報（ＨＴＭＬファイル）を各翻訳エンジン（英日翻訳エンジン２２、日英翻訳エンジン３１）へ送信し、翻訳処理を行わせる。翻訳結果については、各翻訳エンジン２２、３１の記憶装置に記録しておき、ユーザから翻訳要求があった際にその翻訳結果をクライアント端末１０に送信する。
【００２１】
ワークステーション４０のプキシサーバ４１では、インターネット７０からの情報を一部変換（漢字コード等）したり、情報を一時的にキャッシュするなどの機能を有する。
ワークステーション５０の音声認識Ｉ／Ｆプログラム５１は、音声ブラウザクライアント１３から送信されるリンク項目を形態素解析エンジン５３に入力し、出力された結果である分解された文字列の品詞を元に適切な再構成処理を行う。その結果を音声認識エンジン５２へ登録し、音声ブラウザクライアント１３から送信される録音された音声データを音声認識エンジン５２に渡し、当該音声認識エンジン５２において照合された結果を音声ブラウザクライアント１３に返送する。
【００２２】
ワークステーション６０の音声合成エンジン６１は、音声ブラウザクライアント１３で抽出された音読すべきテキスト情報を受信し、音声データを生成し、音声ブラウザクライアント１３に返送する。
図４は、本発明の一実施例の動作のシーケンスチャートである。
まず、ユーザが音声入力開始用ボタン１５を押下すると（ステップ１０１）、音声入力用ボタン監視プログラム１２から音声ブラウザクライアント１３へその通知が転送される。音声ブラウザクライアント１３は、マイク１７から音声録音を開始し、終了を待つ。ユーザによりボタンが解放になると（ステップ１０２）、音声入力用ボタン監視プログラム１２から再度終了の通知が音声ブラウザクライアント１３に報告される。その報告を受けた音声ブラウザクライアント１３は、録音を停止し、ワークステーション５０の音声認識Ｉ／Ｆプログラム５１へその録音された音声データを送信する（ステップ１０３）。
【００２３】
ワークステーション５０の音声認識Ｉ／Ｆプログラム５１は、当該データを受信し、音声認識エンジン５２にそのデータを転送し（ステップ１０４）、認識処理を依頼する。当該音声認識エンジン５２により認識処理された結果を音声人気Ｉ／Ｆプログラム５１が取得すると（ステップ１０５）、当該認識結果が音声ブラウザクライアント１３に転送される（ステップ１０６）。音声ブラウザクライアント１３は、結果のリンク項目からＵＲＬを得て、Ｗｅｂブラウザ１１に要求送信指示を発行する（ステップ１０７）。
【００２４】
Ｗｅｂブラウザ１１は、指定されたＵＲＬへデータ要求伝文をプロキシサーバ２１、４１を経由してインターネット７０に送信する（ステップ１０８）。
プロキシサーバ２１がプロキシサーバ４１を経由して、当該要求に対応する応答をインターネット７０から取得すると（ステップ１０９）、当該応答を英日翻訳エンジン２２または、日英翻訳エンジン３１に送信する（ステップ１１０）。
【００２５】
英日翻訳エンジン２２、日英翻訳エンジン３１のいずれかがプロキシサーバ２１により指示された処理を行い、その結果をプロキシサーバ２１に返却する。これによりプロキシサーバ２１は、応答データをクライアント端末１０のＷｅｂブラウザ１１に送信する（ステップ１１１）。
クライアント端末１０のＷｅｂブラウザ１１が応答データを受け取ると、音声ブラウザクライアント１３に渡される（ステップ１１２）。このとき、ページがマルチフレームの場合は、Ｗｅｂブラウザ１１から上記の処理を構成ビュー分繰り返す。音声ブラウザクライアント１３では、取得した応答データを解析し、リンク項目と表示されるテキスト情報等を取得し、それぞれをワークステーション６０の音声合成エンジン６１及び、ワークステーション５０の音声認識Ｉ／Ｆプログラム５１に送信する（ステップ１１３）。その際、ＨＴＭＬ解析を行い、イメージ（絵）情報に付加されている説明文等のテキスト情報がある場合は、その情報も適切に処理し、音声合成エンジン６１へ送信され、ユーザにイメージの子細をスピーカ１６より音読で伝える。また、音声合成エンジン６１に送信する場合は、言語別に１文単位で送信を行うことによって適切なサービス（１文単位の巻き戻し／早送り等）をユーザに提供することを可能にしている。音声認識Ｉ／Ｆプログラム５１には、リンク項目など動的に変化する認識候補リスト情報などを送信する（ステップ１１４）。なお、固定コマンドについては、毎回送信は行わないものとする。
【００２６】
音声認識Ｉ／Ｆプログラム５１では、受信したリンク項目について１度形態素解析エンジン５３により形態素解析処理を施し（ステップ１１５、１１６）、音声認識エンジン５２へ固定コマンドと共に登録され（ステップ１１７）、ユーザからの音声データを待機する。なお、当該登録は、ユーザから送信される音声データとの比較照合に利用するために必要となる。
【００２７】
音声合成エンジン６０は、ユーザからの音声を合成し、合成された音声データを音声ブラウザクライアント１３に送信する（ステップ１１８）。
次に、具体的な例を用いて説明する。
図５は、本発明の一実施例のユーザインターフェースを示す図であり、図６は、本発明の一実施例の音声ブラウザにより音読される例を示す図である。
【００２８】
図５は、Ｗｅｂブラウザ１１のユーザインタフェースである。ページには、タイトル１１０、リンク項目１３０、本文１２０がテキスト情報で表示されている。これらの情報はＨＴＭＬに従った文法で書かれたファイルから得られ、当該ファイルを音声ブラウザクライアント１３側で解析を行い、タイトル、リンク項目、本文といった項目に分類する。そして、出力される情報は、図６に示すように音読される。適切なガイダンスを付加することによって、ユーザに詳細に情報を提供する。
【００２９】
また、入力時において、リンク項目を指示する場合には、リンク項目全文を音声ブラウザクライアント１３に入力することなく、印象に残った単語だけの入力でアクセスが可能である。その仕組みについては、まず、形態素解析エンジン５３にリンク項目全文を入力し、品詞分解された結果から再度組み合わせを行う。最小の要素である単語から複合語などを再構築していくことによって、ユーザからの（単語から複合語までの）入力に対処できる。
【００３０】
図７は、本発明の一実施例のＷｅｂブラウザのマルチフレーム構成によって表示されているホームページの型であり、図８は、本発明の一実施例の音声ブラウザにより音読される例である。この場合には、音声ブラウザクライアント１３では、複数のビューを持つことなどをＨＴＭＬファイルを解析することによって取得し、ユーザに音声で伝える。また、音読はビュー単位で行われるものとする。
【００３１】
図９は、本発明の一実施例のＷｅｂブラウザにイメージが掲載された例であり、図１０は、本発明の一実施例の音声ブラウザにより音読される例である。図９に示すイメージ図は、視覚情報であるため、当該情報を音声で伝えるのは困難である。しかし、ＨＴＭＬのタグ情報を解析することによって、イメージに説明を付加することによって音声ブラウザクライアント１３側でイメージに対応する説明を抽出し、音声で読み上げることを可能とする。この場合は、ＨＴＭＬ作成者がそのイメージ図に関する説明文をテキスト情報で追記することを条件とする。
【００３２】
次に、再生制御機能について説明する。この機能は、音読の読み上げ速度、音量、話者の性別などを音声の指示によってリアルタイムに変更できる機能である。読み上げ速度、話者の性別に関しては、音声データの再作成を音声合成エンジン６１のパラメータを変更することによって行い、また、再生ポイントから優先的に再作成を行うことによってリアルタイムに変更を可能にした。音量については、システムのパラメータを変更することによって対処するものとする。
【００３３】
再制御機能の付加機能として、特定のサービス（現在時刻の通知等）の割り込み等も付加することが可能である。詳しくは、音読中に現在時刻を質問すると、システムに時刻を問い合わせ、一度音声合成エンジン６１へ音声データの作成を要求し、作成完了と共に音読を一時中断し、現在時刻をアナウンスする。また、その後、中断中の音読を再開することによって実現できる。
【００３４】
また、ＨＴＭＬファイル中に埋め込まれている音声データの再生制御も可能である。図１１は、本発明の一実施例の音楽／朗読コンテンツ再生を行う場合の構成図である。この音声データは、音楽や朗読などの内容のもので再生する際には、再生可能な再生用ソフトウェア１４をＷｅｂブラウザ１１が自動的に起動する。その再生用ソフトウェア１４を音声ブラウザクライアント１３が制御することによって、一時停止や再生などの当該ソフトウェアが持つ従来の機能を音声で制御することが可能となる。
【００３５】
図１２は、本発明の一実施例の音声データの再生制御のシーケンスチャートである。まず、音声ブラウザクライアント１３からＷｅｂブラウザ１１にＵＲＬの指示を発行すると（ステップ２０１）、Ｗｅｂブラウザ１１は、当該指示をインターネット７０に送信する。これによりＷｅｂブラウザ１１において、インターネット７０からＨＴＭＬを取得して、音声ブラウザクライアント１３に転送し、当該音声ブラウザクライアント１３において当該ＨＴＭＬの解析を行う。また、Ｗｅｂブラウザ１１は、ＨＴＭＬに埋め込まれた音声データをインターネット７０に要求し（ステップ２０３）、インターネット７０から取得した当該要求に対応する応答を取得して、再生用ソフトウェア１４を起動して転送する（ステップ２０４）。また、音声ブラウザクライアント１３は、解析された結果に基づいて音声認識エンジン５３に対して音声認識要求を発行し（ステップ２０５）、音声認識結果を取得すると、当該結果を音声ブラウザクライアント１３に転送する（ステップ２０６）。これにより音声ブラウザクライアント１３は再生用ソフトウェア１４を制御して、音声を再生する。
【００３６】
なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内で種々変更・応用が可能である。
【００３７】
【発明の効果】
上述のように、本発明によれば、音声入力によってインターネットのＷＷＷ情報にアクセスし、音声で出力することが可能となる。
また、アクセスした情報が英文の場合、翻訳機能よって日本語で出力可能であるので、英語の知識が少ないユーザでも、情報の内容を把握することができる。
【００３８】
さらに、テキスト文のみでなく、音楽情報、朗読情報へのアクセス可能であるため、娯楽用としても使用することができる。
さらに、音楽情報や朗読情報の再生の際には、通常のラジカセ（出力・再生・録音機能を有するラジオ・カセット）と同様に、一時停止、再生、停止などの再生制御が可能であるため、ユーザの抵抗感が少ない。
【００３９】
さらに、テキスト音読中などに割り込んで時報を聞くことが可能であるため、視覚障害者向、または、時計を持ち合わせていない場合でも音声で時報を知ることができる。
以上の機能により視覚障害者のインターネット利用への支援が可能となる。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明が適用されるシステム構成図である。
【図３】本発明の一実施例のシステム構成図である。
【図４】本発明の一実施例の動作のシーケンスチャートである。
【図５】本発明の一実施例のＷｅｂブラウザの通常のユーザインタフェースを示す図である。
【図６】本発明の一実施例の音声ブラウザにより音読される例である。
【図７】本発明の一実施例のＷｅｂブラウザのマルチフレーム構成によって表示されているホームページの型の例である。
【図８】本発明の一実施例の音声ブラウザにより音読される例である。
【図９】本発明の一実施例のＷｅｂブラウザにイメージが掲載された例である。
【図１０】本発明の一実施例の音声ブラウザにより音読される例である。
【図１１】本発明の一実施例の音楽／朗読コンテンツ再生を行う場合の構成図である。
【図１２】本発明の一実施例の音声データの再生制御のシーケンスチャートである。
【符号の説明】
１０クライアント端末
１１Ｗｅｂブラウザ
１２音声入力用ボタン監視用プログラム
１３音声ブラウザクライアント
１４再生用ソフトウェア
２０，３０，４０，５０，６０ワークステーション
２１プロキシサーバ
２２英日翻訳エンジン
３１日英翻訳エンジン
４１プロキシサーバ
５１音声認識Ｉ／Ｆプログラム
５２音声認識エンジン
５３形態素解析エンジン
６１音声合成エンジン
７０インターネット
１００サーバ
１１０タイトル
１２０本文
１３０リンク項目
２００クライアント端末
２０１音声入力手段
２０２要求発行手段
２０３音声出力手段
２１０第一フレーム
２２０第二フレーム
３１０イメージ図[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice client terminal , and more particularly to a client / server configuration system including a computer and a network, particularly a server of a World Wide Web (hereinafter simply referred to as WWW) system on the Internet, from a microphone of a client terminal. The present invention relates to a voice client terminal that inputs by voice and outputs information stored in a server by voice.
[0002]
[Prior art]
As is well known, when the hardware and software of the server and client are properly configured on the network as a WWW system, it is stored in the server by using a browser such as Netscape Navigator installed on the client terminal. It is possible to display the information on the text and image displayed on the client screen for browsing.
[0003]
In this system, when specific information on the screen is selected with a mouse or the like, it is associated with this information (hereinafter, this is called a link, and the previous specific information is called a link item). It can be displayed on the screen and viewed.
These services are premised on enjoying information visually, and cannot be enjoyed without looking at the screen, or cannot be enjoyed at all by visually impaired persons. As a method for solving this, by using recent speech recognition technology and speech synthesis technology, it is possible to input speech from a microphone and output by speech synthesis. For example, if [Prime official residence] is input by voice, the information of [Prime official residence] can be accessed, and the text portion can be output as synthesized sound from the speaker of the client terminal.
[0004]
[Problems to be solved by the invention]
However, in the above-described conventional method, there are cases where there are 10 to 20 long sentences and link parts everywhere in the well-known WWW information. Of course, color image information is mixed in the text. The reality is that links to videos and information that appeals to the eye are used abundantly. There is a problem of how to output such information to the visually impaired.
[0005]
The present invention has been made in view of the above points, and an object of the present invention is to provide a voice client terminal capable of acquiring WWW information even for a visually impaired person.
[0006]
[Means for Solving the Problems]
FIG. 1 is a principle configuration diagram of the present invention.
The present invention (claim 1), the HTML file stored in the servers acquired via the Internet, an audio client terminal 200 for outputting a voice,
Voice input means 201 for inputting a user's voice request;
The input user's voice is transmitted to the voice recognition server, the voice recognition result is received from the voice recognition server, the URL is extracted from the received voice recognition result, and the HTML file is requested to the proxy server based on the URL. Request issuing means 202;
Means for analyzing the HTML file obtained from the proxy server and classifying it into a title, a link item, and a body;
Means for transmitting the classified link items to the speech recognition server;
Means for transmitting displayed text information including a title, a link item, and a body to a speech synthesis server, and receiving speech data synthesized from the speech synthesis server;
Using the received audio data, audio output means 203 that reads out the title, the link item, and the text in order and outputs the audio.
[0009]
As above SL, in the present invention, that is converted into voice information from the visual information HTML (Hyper Text Markup Langu a ge ) format files that are published on the Internet through a commercial Web browser, provides a user It is a system that made it possible. In addition, when information is acquired on the client side, it is possible to operate a visually impaired person by using sound.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 2 shows the configuration of a system to which the present invention is applied.
In the system shown in FIG. 2, each processing engine is arranged on a high-speed network, and load distribution is performed so that a high-speed response can be realized by the client terminal 10. The system in the figure is roughly divided into two systems.
[0011]
First, a processing unit as a front end composed of the workstations 20, 30, and 40 in FIG. The workstations 20 and 30 are systems for providing a translation service. The workstation 40 is a function that is commonly used on the Internet in general, and is mainly used for data caching and kanji code conversion in this system.
[0012]
Second, workstations 50 and 60 used as back ends. The workstation 50 compares the voice data input from the user to the client terminal 10 with a candidate list (link item list), and selects an appropriate item. In addition, the morphological analysis process is performed on the link item in order to eliminate the necessity that the voice data input by the user and the link item are exactly the same. By doing so, an appropriate link item is inferred and selected from the user simply by inputting a fragment of the link item. The workstation 60 is called a speech synthesis engine that receives the text information extracted by the client terminal 10 (with a language type parameter) and generates speech data.
[0013]
Next, the HTML file request transmitted from the client terminal 10 is transferred to the external Internet 70 using the proxy server of the workstation 40 via the proxy server while operating on the workstation 20. Response data (HTML file) returned from the Internet 70 is sent to each translation engine by the proxy server of the workstation 20 via the proxy server of the workstation 40 (only request, The processing result is transmitted to the client terminal 10 at any time as requested by the user (transferred to the client terminal 10). The HTML file that has arrived at the client terminal 10 is analyzed, text information to be displayed is transmitted to the speech synthesis engine of the workstation 60, converted into speech data, and output at the client terminal 10. The list of link items necessary for speech recognition is subjected to morpheme analysis processing by the morpheme analysis engine of the workstation 50, passed to the speech recognition engine, and waits for transmission of speech data from the client terminal 10.
[0014]
Next, the voice data input (instructed) at the client terminal 10 is transmitted to the voice recognition engine 50, and the voice recognition engine 50 is compared and collated with the link item (text information) input before, and an appropriate result is obtained. . The obtained result is transmitted to the client terminal 10. As a result, the client terminal 10 extracts the URL from the link item, and acquires the following information to the Internet 70 via the proxy servers 20 and 40.
[0015]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 3 shows the system configuration of an embodiment of the present invention. In the system shown in the figure, the same components as those in FIG.
The system shown in FIG. 1 includes a client terminal 10 and workstations 20, 30, 40, and 50.
[0016]
The client terminal 10 includes a web browser 11, a voice input button monitoring program 12, a voice browser client 13, a voice input start button 15 connected to the voice input button monitoring program 12, and a speaker 16 connected to the voice browser client 13. And a microphone 17.
The workstation 20 has a function of the proxy server 21 and an English-Japanese translation engine 22 that translates text from English to Japanese.
[0017]
The workstation 30 has a Japanese-English translation engine 31 that translates text from Japanese to English.
The workstation 40 has a function of a proxy server 41.
The workstation 50 includes a voice recognition I / F program 51, a voice recognition engine 52, and a morphological analysis engine 53.
[0018]
The workstation 60 has a speech synthesis engine 61.
The Web browser 11 is commonly used and includes, for example, Netscape Navigator. In this embodiment, description will be made using the browser. The Web browser 11 acquires necessary information as a window to the Internet 70 and passes it to the voice browser client 13. It also displays on a web browser. The voice browser client 13 analyzes the information obtained from the Web browser 11, converts the text information to be read aloud to the speech synthesis engine 61 of the workstation 60, and the link item information to be recognized by the voice recognition I of the workstation 50. / F program 51 respectively.
[0019]
The client terminal 10 reproduces the voice data received from the voice synthesis engine 61 through the speaker 16 while recording the voice data on the local disk. Further, an input from the user is notified from the voice input button monitoring program 12 to the voice browser client 13 by pressing the voice input start button 15. Upon receiving the notification, the voice browser client 13 starts recording voice from the microphone 17. The voice browser client 13 stops the recording by releasing the user's voice input start button 15 and transmits the voice data recorded in the voice recognition I / F program 51 to the voice synthesis engine 61 of the workstation 60.
[0020]
The proxy server 21 of the workstation 20 connected to the client terminal 10 transmits information (HTML file) to be transferred from the Internet 70 to each translation engine (English-Japanese translation engine 22, Japanese-English translation engine 31), and translation processing is performed. To do. The translation result is recorded in the storage device of each translation engine 22, 31, and the translation result is transmitted to the client terminal 10 when a translation request is received from the user.
[0021]
The proxy server 41 of the workstation 40 has functions such as partial conversion of information from the Internet 70 (kanji code and the like) and temporary caching of information.
The speech recognition I / F program 51 of the workstation 50 inputs the link item transmitted from the speech browser client 13 to the morphological analysis engine 53 and outputs an appropriate result based on the part of speech of the decomposed character string as the output result. Perform reconfiguration processing. The result is registered in the voice recognition engine 52, the recorded voice data transmitted from the voice browser client 13 is passed to the voice recognition engine 52, and the result collated in the voice recognition engine 52 is returned to the voice browser client 13. .
[0022]
The speech synthesis engine 61 of the workstation 60 receives the text information to be read aloud extracted by the speech browser client 13, generates speech data, and returns it to the speech browser client 13.
FIG. 4 is a sequence chart of the operation of one embodiment of the present invention.
First, when the user presses the voice input start button 15 (step 101), the notification is transferred from the voice input button monitoring program 12 to the voice browser client 13. The voice browser client 13 starts voice recording from the microphone 17 and waits for the end. When the user releases the button (step 102), the voice input button monitoring program 12 reports the end notification to the voice browser client 13 again. Upon receiving the report, the voice browser client 13 stops the recording and transmits the recorded voice data to the voice recognition I / F program 51 of the workstation 50 (step 103).
[0023]
The voice recognition I / F program 51 of the workstation 50 receives the data, transfers the data to the voice recognition engine 52 (step 104), and requests a recognition process. When the speech popularity I / F program 51 acquires the result of recognition processing by the speech recognition engine 52 (step 105), the recognition result is transferred to the speech browser client 13 (step 106). The voice browser client 13 obtains the URL from the resulting link item, and issues a request transmission instruction to the Web browser 11 (step 107).
[0024]
The Web browser 11 transmits a data request message to the designated URL via the proxy servers 21 and 41 to the Internet 70 (step 108).
When the proxy server 21 acquires a response corresponding to the request from the Internet 70 via the proxy server 41 (step 109), the response is transmitted to the English-Japanese translation engine 22 or the Japanese-English translation engine 31 (step 110). ).
[0025]
Either the English-Japanese translation engine 22 or the Japanese-English translation engine 31 performs processing instructed by the proxy server 21 and returns the result to the proxy server 21. Thereby, the proxy server 21 transmits the response data to the Web browser 11 of the client terminal 10 (step 111).
When the web browser 11 of the client terminal 10 receives the response data, it is passed to the voice browser client 13 (step 112). At this time, if the page is multi-frame, the above processing is repeated from the Web browser 11 for the configuration view. The voice browser client 13 analyzes the obtained response data, obtains link items and text information to be displayed, and the like, and respectively obtains the speech synthesis engine 61 of the workstation 60 and the voice recognition I / F program 51 of the workstation 50. (Step 113). At this time, if there is text information such as an explanatory text added to the image (picture) information by performing HTML analysis, the information is also processed appropriately and transmitted to the speech synthesis engine 61, so that the user can understand the details of the image. Is read aloud from the speaker 16. Further, when transmitting to the speech synthesis engine 61, it is possible to provide the user with an appropriate service (such as rewinding / fast-forwarding in units of one sentence) by transmitting in units of sentences for each language. The speech recognition I / F program 51 transmits dynamically changing recognition candidate list information such as link items (step 114). Note that fixed commands are not transmitted every time.
[0026]
In the voice recognition I / F program 51, the received link item is subjected to a morphological analysis process once by the morphological analysis engine 53 (steps 115 and 116) and registered with the fixed command in the voice recognition engine 52 (step 117). Wait for voice data. Note that the registration is necessary for use in comparison with the voice data transmitted from the user.
[0027]
The speech synthesis engine 60 synthesizes speech from the user and transmits the synthesized speech data to the speech browser client 13 (step 118).
Next, a specific example will be described.
FIG. 5 is a diagram illustrating a user interface according to an embodiment of the present invention, and FIG. 6 is a diagram illustrating an example of reading aloud by a voice browser according to an embodiment of the present invention.
[0028]
FIG. 5 shows a user interface of the Web browser 11. On the page, a title 110, a link item 130, and a body 120 are displayed as text information. These pieces of information are obtained from a file written in a grammar according to HTML, and the file is analyzed on the voice browser client 13 side and classified into items such as a title, a link item, and a text. Then, the information to be output is read aloud as shown in FIG. Provide users with detailed information by adding appropriate guidance.
[0029]
Further, when the link item is instructed at the time of input, it is possible to access by inputting only the words that remain in the impression without inputting the full text of the link item to the voice browser client 13. Regarding the mechanism, first, the link item full text is input to the morphological analysis engine 53, and the combination is performed again from the result of the part of speech decomposition. By reconstructing a compound word or the like from a word that is the smallest element, it is possible to cope with an input (from a word to a compound word) from a user.
[0030]
FIG. 7 shows a homepage type displayed by the multi-frame configuration of the Web browser according to the embodiment of the present invention. FIG. 8 shows an example of reading aloud by the voice browser according to the embodiment of the present invention. In this case, the voice browser client 13 obtains by analyzing the HTML file that it has a plurality of views and conveys it to the user by voice. Further, it is assumed that reading aloud is performed in view units.
[0031]
FIG. 9 is an example in which an image is posted on a Web browser according to an embodiment of the present invention, and FIG. 10 is an example in which reading is performed by the voice browser according to an embodiment of the present invention. Since the image shown in FIG. 9 is visual information, it is difficult to convey the information by voice. However, by analyzing the tag information of HTML, the explanation corresponding to the image can be extracted on the voice browser client 13 side by adding the explanation to the image, and can be read out by voice. In this case, it is a condition that the HTML creator adds an explanatory note regarding the image diagram as text information.
[0032]
Next, the playback control function will be described. This function is a function that can change the reading speed of the reading aloud, the volume, the gender of the speaker, and the like in real time by voice instructions. With regard to reading speed and speaker gender, voice data can be recreated by changing parameters of the speech synthesis engine 61, and can be changed in real time by pre-creating from the playback point. . The volume is dealt with by changing system parameters.
[0033]
As an additional function of the re-control function, an interrupt of a specific service (notification of current time, etc.) can be added. Specifically, when the current time is asked while reading aloud, the system is inquired about the time, and once the voice synthesis engine 61 is requested to create voice data, the reading is temporarily suspended when the creation is completed, and the current time is announced. Further, it can be realized by restarting the reading aloud after that.
[0034]
It is also possible to control playback of audio data embedded in the HTML file. FIG. 11 is a configuration diagram for reproducing music / reading content according to an embodiment of the present invention. When the audio data is reproduced with contents such as music or reading, the Web browser 11 automatically activates the reproducible reproduction software 14. By controlling the playback software 14 by the voice browser client 13, it is possible to control conventional functions of the software such as pause and playback by voice.
[0035]
FIG. 12 is a sequence chart of audio data reproduction control according to an embodiment of the present invention. First, when a URL instruction is issued from the voice browser client 13 to the Web browser 11 (Step 201), the Web browser 11 transmits the instruction to the Internet 70. As a result, the Web browser 11 acquires HTML from the Internet 70, transfers it to the voice browser client 13, and the voice browser client 13 analyzes the HTML. Further, the Web browser 11 requests the audio data embedded in HTML from the Internet 70 (Step 203), acquires a response corresponding to the request acquired from the Internet 70, activates the reproduction software 14, and transfers it. (Step 204). Further, the voice browser client 13 issues a voice recognition request to the voice recognition engine 53 based on the analyzed result (step 205), and when the voice recognition result is acquired, the result is transferred to the voice browser client 13. (Step 206). Thereby, the voice browser client 13 controls the playback software 14 to play back the voice.
[0036]
The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.
[0037]
【The invention's effect】
As described above, according to the present invention, it is possible to access and output the WWW information on the Internet by voice input.
In addition, when the accessed information is in English, it can be output in Japanese by the translation function, so even a user with little English knowledge can grasp the contents of the information.
[0038]
Furthermore, since it is possible to access not only text but also music information and reading information, it can be used for entertainment.
In addition, when playing music information and reading information, playback control such as pause, playback, stop, etc. is possible, just like a normal radio cassette player (radio / cassette with output / playback / recording function). Less user resistance.
[0039]
Furthermore, since it is possible to listen to the time signal while interrupting text reading, the time signal can be obtained by voice even for visually handicapped persons or when not holding a clock.
With the above functions, it is possible to support visually impaired people using the Internet.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is a system configuration diagram to which the present invention is applied.
FIG. 3 is a system configuration diagram of an embodiment of the present invention.
FIG. 4 is a sequence chart of an operation according to an embodiment of the present invention.
FIG. 5 is a diagram showing a normal user interface of a Web browser according to an embodiment of the present invention.
FIG. 6 is an example of reading aloud by a voice browser according to an embodiment of the present invention.
FIG. 7 is an example of a homepage type displayed by a multi-frame configuration of a Web browser according to an embodiment of the present invention.
FIG. 8 is an example of reading aloud by a voice browser according to an embodiment of the present invention.
FIG. 9 is an example in which an image is posted on a Web browser according to an embodiment of the present invention.
FIG. 10 is an example of reading aloud by a voice browser according to an embodiment of the present invention.
FIG. 11 is a configuration diagram in the case of reproducing music / reading content according to an embodiment of the present invention.
FIG. 12 is a sequence chart of audio data reproduction control according to an embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Client terminal 11 Web browser 12 Voice input button monitoring program 13 Voice browser client 14 Reproduction software 20, 30, 40, 50, 60 Workstation 21 Proxy server 22 English-Japanese translation engine 31 Japanese-English translation engine 41 Proxy server 51 Speech recognition I / F program 52 Speech recognition engine 53 Morphological analysis engine 61 Speech synthesis engine 70 Internet 100 Server 110 Title 120 Text 130 Link item 200 Client terminal 201 Voice input means 202 Request issue means 203 Voice output means 210 First frame 220 Two-frame 310 image

Claims

An HTML file stored on servers acquired via the Internet, an audio client terminal for outputting a voice,
Voice input means for inputting a user's voice request;
The input user's voice is transmitted to the voice recognition server, a voice recognition result is received from the voice recognition server, a URL is extracted from the received voice recognition result, and an HTML file is stored in the proxy server based on the URL. Request issuing means to request;
Means for analyzing an HTML file acquired from the proxy server and classifying the file into a title, a link item, and a body;
Means for transmitting the classified link items to a speech recognition server;
Means for transmitting displayed text information including the title, link item, and body to a speech synthesis server, and receiving speech data synthesized from the speech synthesis server;
Using the received voice data, voice output means for reading out the voice in the order of title, link item, and text;
Audio client terminal characterized in that it comprises a.