JP2002244841A

JP2002244841A - Voice indication system and voice indication program

Info

Publication number: JP2002244841A
Application number: JP2001044686A
Authority: JP
Inventors: Tatsu Ifukube; 達伊福部
Original assignee: BUG Inc; Japan Science and Technology Corp
Current assignee: BUG Inc; Japan Science and Technology Agency
Priority date: 2001-02-21
Filing date: 2001-02-21
Publication date: 2002-08-30

Abstract

PROBLEM TO BE SOLVED: To support communications between a speaker and a user by providing the user with language information and non-language information from the speaker simultaneously. SOLUTION: The voice indication system 100 includes, for example the speaker 10, a computing device 1 of a portable computer, a permeable display device 2 of a permeable spectacle display, and the user 40. The voice indication system 100 is a system to support communication between the speaker 10 and the user 40 (for example, a hearing impairment). The computing device 1 recognizes voice inputted by the speaker 10 through a microphone or the like and converts the same into language information (verbal information, character string here), and outputs the character string as a result of the voice recognition on the permeable display device 2. The permeable display device 2 indicates the character string of language information inputted by the computing device 1 and is provided with a permeable part (permeable display) for providing non-language information (non-verbal information, for example, notions of parts of the speaker's face such as lip and eyes, reading, gesture, sign language, expression or some of those) from the speaker 10.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声表示システム
及び音声表示プログラムに係り、特に、音声認識結果デ
ータである文字列の情報（言語情報：バーバル情報）だ
けでなく、話し手（話者）の表情、唇、ジェスチャー等
の言語情報以外の情報（非言語情報：ノンバーバル情
報）を用いて話者とユーザとのコミュニケーションの補
助を行うことができる音声表示システム及び音声表示プ
ログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice display system and a voice display program, and more particularly to not only character string information (language information: verbal information) as speech recognition result data but also a speaker (speaker). The present invention relates to a voice display system and a voice display program capable of assisting communication between a speaker and a user using information (non-verbal information: non-verbal information) other than linguistic information such as facial expressions, lips, and gestures.

【０００２】[0002]

【従来の技術】近年、高度情報化および超高齢化の社会
においては、各種情報を受け取る感覚器官（例えば、視
覚、聴覚等）の能力が低い人達（例えば、高齢者、聴覚
障害者等）のために、これらの感覚器官の能力を補うた
めの各種補助装置（例えば、補聴器等）の開発が切望さ
れている。特に、人間同士のコミュニケーションでは、
音声が非常に重要な役割を果たしており、聴覚障害者の
ための様々な補助方式が研究されている。例えば、大学
の講義において、講義内容を素早く書き取り、話者又は
講義用黒板の周辺に字幕を表示する手法がある（参考：
小林正幸・石原保志・西川俊・高橋秀知、ルビ付きリア
ルタイム字幕提示システムの試作;筑波技術短期大学テ
クノレポート、1996）。2. Description of the Related Art In recent years, in a society of advanced information and super aging, people (for example, elderly people, hearing impaired people, etc.) who have low ability of sensory organs (for example, sight, hearing, etc.) to receive various information. Therefore, development of various auxiliary devices (for example, hearing aids and the like) for supplementing the capabilities of these sensory organs has been eagerly desired. Especially in human-to-human communication,
Voice plays a very important role, and various assistive systems for the hearing impaired are being studied. For example, in a lecture at a university, there is a method of quickly writing down the contents of the lecture and displaying subtitles around a speaker or a lecture blackboard (reference:
Masayuki Kobayashi, Yasushi Ishihara, Shun Nishikawa, Hidetomo Takahashi, Prototype of real-time subtitle presentation system with ruby; Tsukuba College of Technology Techno Report, 1996).

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上述の
手法では、第三者の助けが必要であり、例えば、聴覚障
害者が単独で外出する場合での利用は困難であることが
想定される。このように、未だに聴覚障害者が日常的に
利用することができる補助装置は存在していない。した
がって、聴覚障害者のための補助装置は、今後ますます
増え続けるであろう高齢難聴者、又は完全聴覚障害者に
とっても有用であり、その必要性は今後とも高まると考
えられる。However, the above technique requires the assistance of a third party, and it is assumed that, for example, it is difficult to use a hearing-impaired person when going out alone. As described above, there is no auxiliary device that can be used by a hearing-impaired person on a daily basis. Therefore, assistive devices for the hearing impaired are also useful for elderly deaf or completely deaf persons who will continue to increase in the future, and the need for such devices is expected to increase in the future.

【０００４】一方、近年、入力された音声を認識して文
字列に変換し、この文字列を、例えば、コンピュータの
モニタディスプレイやテレビの字幕として表示する、い
わゆる音声認識方法が普及している。この音声認識方法
は、雑音の多い環境（すなわち日常の生活空間）で不特
定話者を対象に認識を行った場合、一般には、５０〜６
０％程度の認識率しか確保できないため、特定の用途で
の利用に限られているのが現状である。On the other hand, in recent years, a so-called voice recognition method of recognizing an input voice and converting it into a character string and displaying the character string as a subtitle of a monitor display of a computer or a television has become widespread. This speech recognition method generally performs 50 to 6 when an unspecified speaker is recognized in a noisy environment (that is, a daily living space).
At present, only a recognition rate of about 0% can be secured, so that the use is limited to a specific use.

【０００５】しかし、この現状は、音声認識システム
を、従来のようにマン・マシン・インターフェースとし
て捉えた結果であり、この音声認識システムを、人間を
対象としたマン・マン・インターフェースとして捉えた
場合、人間は、入力音声の完全な音声認識が行われなく
てもコミュニケーションにおける前後の文脈などから欠
落した情報を類推することができる（参照：齊藤幹、失
聴者のための音声認識技術を利用したマン・マン・イン
ターフェースに関する研究;北海道大学大学院工学研究
科修士論文、1999２）。[0005] However, this situation is a result of the speech recognition system being regarded as a man-machine interface as in the past, and this speech recognition system is considered as a man-man interface for humans. Humans can infer missing information from contexts before and after communication without complete speech recognition of input speech. (See: Miki Saito, Using speech recognition technology for the hearing-impaired.) Research on man-man interface; Master's thesis, Graduate School of Engineering, Hokkaido University, 19992).

【０００６】ここで、本発明に関連する技術について説
明する。本発明者らは、コミュニケーションでは、話者
の音声以外にも唇、目等の顔の部分の動き、読話（唇の
動きを読むこと）、ジェスチャー、手話、表情といった
非言語情報（ノンバーバル情報）も重要であり、さら
に、このノンバーバル情報は、マン・マシン・インター
フェースで利用するには非常に高度な技術を必要とする
が、マン・マン・インターフェースであれば、人間の視
覚を用いることで、容易に取得できる点に着目した。Here, a technique related to the present invention will be described. In the communication, the present inventors use non-verbal information (non-verbal information) such as movement of a face part such as lips and eyes, reading speech (reading the movement of lips), gesture, sign language, and facial expression in addition to the voice of the speaker. Is important, and this non-verbal information requires a very high level of technology to be used in a man-machine interface. We focused on the point that it can be easily obtained.

【０００７】また、聴覚障害者（ユーザ）は、上述の
「読話」やジェスチャーで伝える「手話」を習得してい
る場合が多く、この読話、手話等によって、話者の言葉
をある程度理解できることが想定される。このため、音
声認識結果の文字列（バーバル情報）と、読話や手話か
ら得られるノンバーバル情報とを、聴覚障害者が同時に
受け取れるようにして、音声認識結果の文字列を聴覚障
害者に呈示することが必要となる。[0007] In addition, the hearing impaired (user) often learns the above-mentioned "reading utterance" or "sign language" conveyed by gesture, and it can be understood that the reading utterance, sign language, etc. can understand the speaker's words to some extent. is assumed. For this reason, a character string of speech recognition result (verbal information) and non-verbal information obtained from reading and sign language can be simultaneously received by a hearing-impaired person, and the character string of the speech recognition result is presented to a hearing-impaired person. Is required.

【０００８】この際、考慮すべき点は、読話や手話で文
意を理解する能力は、聴覚障害者の失聴時期、残存聴力
などに大きく依存している点と、たとえ高性能の音声認
識装置による音声認識結果であっても認識率が必ず１０
０％であることはなく、大抵の場合、文字列には誤りが
含まれているため、音声認識結果である文字を全て呈示
してしまうと、文意を誤って理解する場合が想定される
点である。このため、例えば、聴覚障害者の読話や手話
による文理解能力（すなわち、読話や手話の習熟度）に
応じて、呈示する文字列を表示する割合を、聴覚障害者
自身で設定可能とすることが必要である。At this time, it is important to consider that the ability to understand sentence in reading and sign language greatly depends on the hearing loss and hearing loss of a hearing-impaired person. The recognition rate must be 10 even if the result of speech recognition by the device
It is not 0%, and in most cases, the character string contains an error. Therefore, if all the characters that are the result of the speech recognition are presented, it is assumed that the meaning is misunderstood. Is a point. For this reason, for example, the ratio of displaying a character string to be presented can be set by the hearing-impaired person according to the ability of the hearing-impaired person to understand a sentence in reading and sign language (that is, proficiency in reading and sign language). is necessary.

【０００９】本発明は、以上の点に鑑み、話者による言
語情報と非言語情報とを、同時にユーザに呈示すること
で、話者とユーザとのコミュニケーションの補助を行う
ことを目的とする。また、本発明は、ユーザの特性と音
声認識装置の性能に応じて、読話、手話による非言語情
報と、音声認識結果である言語情報とを、同時にユーザ
に呈示することを目的としている。また、本発明は、特
に、読話や手話による文理解能力が低い中途失聴者であ
っても、円滑なコミュニケーションを図ることができる
ことを目的とする。SUMMARY OF THE INVENTION In view of the above, it is an object of the present invention to provide communication between a speaker and a user by simultaneously presenting the user with linguistic information and non-linguistic information. It is another object of the present invention to simultaneously present non-verbal information based on reading and sign language and linguistic information as a result of voice recognition to a user according to the characteristics of the user and the performance of the voice recognition device. It is another object of the present invention to enable smooth communication, especially for a prematurely hearing-impaired person who has a low ability to understand sentences by reading and sign language.

【００１０】[0010]

【課題を解決するための手段】本発明の第１の解決手段
によると、コミュニケーションを補助するための音声表
示システムであって、入力された音声を認識して言語情
報に変換して出力する演算装置と、前記言語情報を表示
すると共に、話者からの非言語情報を得るための透過部
を含む透過型表示装置とを備え、前記演算装置は、該音
声を入力する音声入力部と、該音声入力部から入力され
た音声を認識するための音声認識部と、前記透過型表示
装置への表示状態を設定するためのレイアウト設定部
と、前記レイアウト設定部の設定に従って、前記音声認
識部により音声認識された結果を、前記透過表示装置に
出力する出力部とを有する音声表示システムを提供す
る。According to a first aspect of the present invention, there is provided a voice display system for assisting communication, comprising an operation for recognizing an input voice, converting the input voice into linguistic information, and outputting the linguistic information. Device, and a transmissive display device that displays the linguistic information and includes a transmissive unit for obtaining non-linguistic information from a speaker, wherein the arithmetic device includes a voice input unit that inputs the voice, A voice recognition unit for recognizing a voice input from a voice input unit, a layout setting unit for setting a display state on the transparent display device, and a voice recognition unit according to the setting of the layout setting unit. An output unit that outputs a result of voice recognition to the transparent display device.

【００１１】本発明の第２の解決手段によると、入力さ
れた音声を認識して言語情報に変換して出力する演算装
置と、前記言語情報を表示すると共に話者からの非言語
情報を得るための透過部を含む透過型表示装置とを備え
たコミュニケーションを補助するための音声表示システ
ムに用いられる音声表示プログラムであって、該音声を
入力する音声入力手順と、音声認識された文字列の尤度
についての予め設定された閾値に応じて、該尤度が該閾
値より低いときは、表示される文字列を非文字で表示す
るように認識するようにした、該音声入力手順から入力
された音声を認識するための音声認識手順と、前記音声
認識手順により音声認識された結果を、前記透過表示装
置に出力する出力手順とをコンピュータに実行させるた
めの音声表示プログラムを提供する。According to a second solution of the present invention, an arithmetic unit for recognizing an inputted voice, converting it into linguistic information and outputting the same, and displaying the linguistic information and obtaining non-linguistic information from a speaker A voice display program used for a voice display system for assisting communication with a transmission type display device including a transmission portion for the voice input procedure for inputting the voice, According to a preset threshold value for the likelihood, when the likelihood is lower than the threshold value, the character string to be displayed is recognized as being displayed as non-characters. A voice display procedure for causing a computer to execute a voice recognition procedure for recognizing the read voice and an output procedure for outputting the result of voice recognition by the voice recognition procedure to the transparent display device. To provide the ram.

【００１２】また、本発明は、誤りを含む音声認識結果
データである言語情報（文字列）を、透過型表示装置
（メガネディスプレイ）上に表示することにより、この
透過型メガネディスプレイを用いるユーザは、話者の前
に表示された文字列だけでなく、話者の唇、目等の顔の
部分の動き、読話、ジェスチャー、手話、表情などを含
む非言語情報を、同時に見ることができる。これによ
り、ユーザが聴覚障害者であっても、話者の文意を理解
しやすくなり、ユーザと話者間の円滑なコミュニケーシ
ョンを図ることができる。Further, the present invention displays language information (character string), which is speech recognition result data including an error, on a transmissive display device (glasses display) so that a user using the transmissive glasses display can use the information. In addition, non-verbal information including movement of facial parts such as the lips and eyes of the speaker, reading, gesture, sign language, facial expressions, etc., as well as the character string displayed in front of the speaker can be simultaneously viewed. Thereby, even if the user is a hearing impaired person, it is easy to understand the meaning of the speaker, and smooth communication between the user and the speaker can be achieved.

【００１３】[0013]

【発明の実施の形態】以下、図面を用いて本発明の実施
の形態を詳細に説明する。図１は、本発明に関する音声
表示システム１００の概略構成図である。音声表示シス
テム１００は、例えば、携帯型のコンピュータである演
算装置１、透過型メガネディスプレイである透過型表示
装置２を含む。この音声表示システム１００は、話者１
０とユーザ（例えば、聴覚障害者など）４０間のコミュ
ニケーションを補助するためのシステムである。演算装
置１は、話者１０により図示しないマイクロホン等を介
して入力された音声を認識して言語情報（バーバル情報
であって、ここでは、文字列）に変換して、この音声認
識の結果である文字列を透過型表示装置２に出力する。
透過型表示装置２は、演算装置１により入力された言語
情報である文字列を表示すると共に、話者１０からの非
言語情報（ノンバーバル情報であって、例えば、話者１
０の唇、目等の顔の部分の動き、読話、ジェスチャー、
手話、表情のいずれか又は複数を含む：図中、点線の矢
印）を得るための透過部（透過性ディスプレイ）を備え
る。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a schematic configuration diagram of a voice display system 100 according to the present invention. The audio display system 100 includes, for example, an arithmetic device 1 that is a portable computer and a transmissive display device 2 that is a transmissive glasses display. The voice display system 100 is a speaker 1
This is a system for assisting communication between 0 and a user (for example, a hearing-impaired person) 40. The arithmetic unit 1 recognizes speech input by the speaker 10 via a microphone (not shown) or the like, converts the speech into verbal information (verbal information, a character string in this case), and uses the result of the speech recognition. A certain character string is output to the transmissive display device 2.
The transmissive display device 2 displays a character string, which is linguistic information input by the arithmetic device 1, and transmits non-verbal information (non-verbal information, for example, speaker 1) from the speaker 10.
0 lips, eyes and other facial movements, reading, gestures,
Including one or more of sign language and facial expressions: a transparent portion (transmissive display) for obtaining a sign (dotted arrow in the figure) is provided.

【００１４】演算装置１は、例えば、音声入力部５０、
処理部（ＣＰＵ）５５、音声認識部６０、レイアウト設
定部９０及び出力部９５を備える。音声入力部５０は、
話者１０の音声を入力する。音声認識部６０は、例え
ば、音声入力部５０から入力された音声の音声認識を行
うものであって、データベース選択部７０と、尤度閾値
設定部８０を備える。The arithmetic unit 1 includes, for example, a voice input unit 50,
A processing unit (CPU) 55, a voice recognition unit 60, a layout setting unit 90, and an output unit 95 are provided. The voice input unit 50
The voice of the speaker 10 is input. The speech recognition unit 60 performs, for example, speech recognition of the speech input from the speech input unit 50, and includes a database selection unit 70 and a likelihood threshold setting unit 80.

【００１５】データベース選択部７０は、例えば、音声
認識部６０内又は音声表示システム１００内に適宜設け
られたひとつ又は複数種類の言語データベース（漢字Ｄ
Ｂ）に予め記憶された漢字の難易度（例えば、ＪＩＳ水
準レベルに従って、小学２年生レベル、中学生レベルな
ど）を、ユーザ４０の語彙力（どの程度の漢字を含む文
字列を作成するべきかを判断することになる）に応じて
選択する。これにより、話者１０の音声を音声認識する
際、ユーザ４０の語彙力に応じた漢字を含む文字列を作
成することができる。なお、言語データベースにひらが
なに該当する漢字を認識しない場合、文字列は、全てひ
らがな及び／又は非文字（例えば、記号など）として透
過型表示装置２の透過性ディスプレイ上に表示される。The database selection unit 70 may include, for example, one or a plurality of language databases (Kanji D) provided in the speech recognition unit 60 or the speech display system 100 as appropriate.
The degree of difficulty of the kanji stored in advance in B) (for example, the second grade of elementary school, the level of junior high school, etc., according to the JIS standard level) is determined by the vocabulary of the user 40 (how much kanji should be created. Will be determined). Thereby, when recognizing the voice of the speaker 10, a character string including kanji according to the vocabulary of the user 40 can be created. If a kanji corresponding to Hiragana is not recognized in the language database, all character strings are displayed on the transmissive display of the transmissive display device 2 as Hiragana and / or non-characters (for example, symbols).

【００１６】また、音声認識部６０は、図示しない音声
データベースに記憶された話者１０からの連続音声に対
して、言語データベースを用いて音声認識が行われた場
合、単語系列と各単語の尤度（例えば、単語間のつなが
りの確からしさであって、単語間の距離と捉えても良
い。したがって、尤度が高いことと、単語間の距離が小
さくこととは、概ね同義である）を認識結果として得
る。When speech recognition is performed on a continuous speech from the speaker 10 stored in a speech database (not shown) using a language database, the speech recognition section 60 performs a word sequence and the likelihood of each word. Degree (for example, the likelihood of connection between words, which may be regarded as the distance between words; therefore, a high likelihood and a small distance between words are generally synonymous). Obtained as a recognition result.

【００１７】尤度閾値設定部８０は、例えば、音声認識
部６０での認識結果のうち正しく認識できた文字列のみ
を呈示するために尤度に対応した閾値を設定する。具体
的には、各単語の尤度が高い場合、その単語が正しく認
識できている可能性が高く、尤度が低い場合、その単語
が正しく認識できている可能性が低いと想定される。こ
のため、尤度閾値設定部８０で閾値を設定することによ
り、認識結果の単語系列に対して、各単語の対数尤度が
ある閾値より大きければ、この単語を呈示し、対数尤度
が閾値より小さいならば、この単語を呈示しないように
することもできる（閾値の具体的な数値については、後
述）。なお、音声認識部６０では、この対数尤度が閾値
より小さい単語（単語間のつながりの確からしさが小さ
く、単語間の距離が大きい）を呈示しないだけでなく、
記号等の非文字として透過型表示装置２の透過性ディス
プレイ上に表示するように適宜設定することができる。The likelihood threshold setting unit 80 sets a threshold corresponding to the likelihood in order to present only a character string that has been correctly recognized among the recognition results of the speech recognition unit 60, for example. Specifically, when the likelihood of each word is high, it is assumed that the word is likely to be correctly recognized, and when the likelihood is low, it is assumed that the word is not likely to be correctly recognized. Therefore, by setting a threshold in the likelihood threshold setting unit 80, if the log likelihood of each word is larger than a certain threshold with respect to the word sequence of the recognition result, this word is presented, and the log likelihood is set to the threshold. If it is smaller, this word may not be presented (specific numerical values of the threshold will be described later). In addition, the speech recognition unit 60 not only presents a word whose log likelihood is smaller than a threshold value (the probability of connection between words is small and the distance between words is large).
It can be set appropriately so as to be displayed on the transmissive display of the transmissive display device 2 as non-characters such as symbols.

【００１８】レイアウト設定部９０は、透過型表示装置
２への表示状態を設定するものであって、例えば、呈示
文字数設定部９１、色調整用設定部９２、大きさ調整用
設定部９３及び表示位置調整用設定部９４を備える。呈
示文字数設定部９１は、例えば、演算装置３より出力さ
れ、透過型表示装置２に表示される文字列の呈示文字数
を適宜調整することができる。また、呈示文字数設定部
９１は、例えば、ユーザ４０の読話及び／又は手話の習
熟度と呈示文字数との対応を示す習熟度フォルダを含む
（図２参照）。The layout setting section 90 sets the display state on the transmissive display device 2. For example, a layout character setting section 91, a color adjustment setting section 92, a size adjustment setting section 93, and a display section. A position adjustment setting unit 94 is provided. The number-of-presented-characters setting unit 91 can appropriately adjust the number of presented characters of a character string output from the arithmetic device 3 and displayed on the transmissive display device 2, for example. In addition, the presented character number setting unit 91 includes, for example, a proficiency folder indicating a correspondence between the proficiency of the reading and / or sign language of the user 40 and the number of presented characters (see FIG. 2).

【００１９】色調整用設定部９２は、同じく、文字列の
色を適宜調整することができる。大きさ調整用設定部９
３は、同じく、文字列の大きさを適宜調整することがで
きる。表示位置調整用設定部９４は、同じく、文字列の
表示位置、表示の焦点距離（例えば、話者１０の距離に
応じて、文字列の表示される焦点距離を調整可能）を適
宜調整することができる。なお、レイアウト設定部９０
に含まれる各種設定部の設定は、ユーザ４０自身により
適宜設定される（図中、実線の矢印）。これにより、透
過型表示装置２に表示される文字列のレイアウトは、ユ
ーザ４０の所望する状態に変更することができる。出力
部９５は、ユーザ４０によるレイアウト設定部９１の設
定に従って、音声認識部６０により音声認識された結果
である文字列（バーバル情報）を、透過表示装置２に出
力する。Similarly, the color adjustment setting section 92 can appropriately adjust the color of the character string. Size adjustment setting section 9
3 can similarly adjust the size of the character string appropriately. Similarly, the display position adjustment setting unit 94 appropriately adjusts the display position of the character string and the focal length of the display (for example, the focal length at which the character string is displayed can be adjusted according to the distance of the speaker 10). Can be. The layout setting unit 90
The settings of the various setting units included in are set as appropriate by the user 40 himself (solid arrows in the figure). Thus, the layout of the character strings displayed on the transmissive display device 2 can be changed to a state desired by the user 40. The output unit 95 outputs a character string (verbal information) as a result of voice recognition by the voice recognition unit 60 to the transparent display device 2 according to the setting of the layout setting unit 91 by the user 40.

【００２０】図２は、習熟度フォルダ２０の説明図であ
る。習熟度フォルダ２０は、上述のように、レイアウト
設定部９０内の呈示文字数設定部９１に含まれており、
例えば、ユーザ４０の読話及び／又は手話の習熟度２１
と呈示文字数（の割合）２２との対応を示している。こ
こでは、習熟度２１と呈示文字数（の割合）２２との対
応としては、「低い、８０％」「普通、６０％」「高
い、４０％」を予め記憶している。なお、この習熟度フ
ォルダ２０による習熟度２１と呈示文字数（の割合）２
２との対応は、適宜設定することができる。FIG. 2 is an explanatory diagram of the proficiency level folder 20. The proficiency level folder 20 is included in the number-of-presented-characters setting unit 91 in the layout setting unit 90 as described above.
For example, the reading and / or sign language proficiency 21 of the user 40
And the number of characters to be presented (ratio) 22. Here, "low, 80%", "normal, 60%", and "high, 40%" are stored in advance as the correspondence between the proficiency level 21 and the (number of) presented characters 22. It should be noted that the proficiency level 21 of this proficiency level folder 20 and the number of presented characters (ratio) 2
2 can be set as appropriate.

【００２１】ユーザ４０（例えば、聴覚障害者）は、透
過型表示装置２の透過性ディスプレイにより、話者１０
の音声を音声認識した結果である文字列（バーバル情
報）だけでなく、透過性ディスプレイを介して話者１０
のノンバーバル情報をも取得することができる。ユーザ
４０は、例えば、図示しない習熟度設定ボタンなどを用
いて、ユーザ４０自身の習熟度２１を演算装置１に入力
する。なお、ユーザ４０自身の習熟度２１を演算装置１
に入力する場合、ユーザ４０だけでなく保守者、家族、
医者など適宜の人間によって設定するようにしてもよ
い。The user 40 (for example, a hearing-impaired person) uses the transmissive display of the transmissive display device 2 to make the speaker 10
Not only a character string (verbal information) as a result of speech recognition of the voice of
Non-verbal information can also be obtained. The user 40 inputs the proficiency level 21 of the user 40 into the arithmetic unit 1 using, for example, a proficiency level setting button (not shown). It should be noted that the proficiency level 21 of the user 40 himself is
, Not only the user 40 but also the maintenance person, family,
The setting may be made by an appropriate person such as a doctor.

【００２２】ユーザ４０は、読話、手話についての習熟
度２１が高い場合（この習熟度は、個人差が大きい）、
ノンバーバル情報を汲み取ることで、話者１０とのコミ
ュケーションを円滑に行うことが想定される。この場
合、ユーザ４０にとって文字列は、補助（又は確認）と
して機能することになる（ここでは、「高い、４０
％」）。If the user 40 has a high level of proficiency 21 in reading and sign language (this level of proficiency has a large individual difference),
It is assumed that communication with the speaker 10 will be performed smoothly by extracting nonverbal information. In this case, the character string functions as an auxiliary (or confirmation) for the user 40 (here, “high, 40
% ").

【００２３】一方、ユーザ４０は、例えば、読話、手話
についての習熟度２１が低い場合、ノンバーバル情報だ
けでは、話者１０とのコミュケーションを行うことが困
難であることが想定される。この場合、ユーザ４０は、
話者１０とのコミュケーションを行うためにバーバル情
報に依存する（ここでは、「低い、８０％」）。なお、
本発明に関する音声表示システム１００は、聴覚障害者
の読話、手話についての習熟度１２を向上させるため
の、一種の訓練システムにも適用できる。具体的には、
聴覚障害者の読話、手話についての習熟度２１が向上す
るにつれて、呈示文字数（の割合）２２を小さくした
り、又は、習熟度２１を確認するために呈示文字数（の
割合）２２を大きくしたりしてもよい。On the other hand, when the user 40 has a low proficiency level 21 in reading and sign language, for example, it is assumed that it is difficult to communicate with the speaker 10 using only nonverbal information. In this case, the user 40
It relies on verbal information to communicate with speaker 10 (here, "low, 80%"). In addition,
The voice display system 100 according to the present invention can also be applied to a kind of training system for improving the proficiency level 12 in reading and sign language of a hearing-impaired person. In particular,
As the proficiency level 21 for reading and sign language of hearing-impaired persons increases, the number of presented characters (ratio) 22 decreases, or the number of presented characters (ratio) 22 increases to confirm the proficiency level 21. May be.

【００２４】図３は、本発明に関する音声表示システム
１００の使用状態を示す概略説明図である。音声表示シ
ステム１００において図示しないユーザ４０は、透過型
メガネディスプレイ２を装着する。ユーザ４０は、透過
型メガネディスプレイ２上に表示されるバーバル情報
（ここでは、「文字列もくしは、・・・・など」）だけ
でなく、透過型メガネディスプレイ２を介して取得され
る話者１０のノンバーバル情報（ここでは、表情、口の
動き、ジェスチャー）を用いて話者１０とのコミュニケ
ーションを行う。なお、図中、バーバル情報が話者１０
上に重なるように描かれているが、これは、ユーザ４０
を主体とすれば、透過型メガネディスプレイ２上に表示
される「文字列もくしは、・・・・など」は、話者１０
の手前に表示されているように見えるからである。FIG. 3 is a schematic explanatory diagram showing a use state of the audio display system 100 according to the present invention. A user 40 (not shown) in the audio display system 100 wears the transmissive glasses display 2. The user 40 speaks not only the verbal information (here, “character string or the like,..., Etc.”) displayed on the transmissive glasses display 2 but also the story acquired via the transmissive glasses display 2. Communication with the speaker 10 is performed using the non-verbal information (here, facial expression, mouth movement, gesture) of the speaker 10. In the figure, the verbal information is the speaker 10
Although drawn on top of this, this
, Etc., the “character string or the like, etc.” displayed on the transmissive glasses display 2 is the speaker 10
Because it appears to be displayed in front of the.

【００２５】図４は、本発明に関する音声表示システム
１００のフローチャートである。まず、呈示文字数設定
部９１は、例えば、図示しない習熟度設定ボタンを介し
て入力されたユーザ４０の習熟度２１に関する情報に基
づいて、呈示文字数（の割合）２２を設定する（Ｓ２０
１）。なお、ここで、上述の尤度閾値の設定及び／又は
漢字ＤＢの選択を、必要に応じて行うことができる。つ
ぎに、レイアウト設定部９０に含まれる色調整用設定部
９２、大きさ調整用設定部９３及び表示位置調整用設定
部９４の各種設定を行う（Ｓ２０３）。話者１０からの
音声が図示しないマイクロホンを介して音声入力部５０
に入力される（Ｓ２０５）。FIG. 4 is a flowchart of the voice display system 100 according to the present invention. First, the number-of-presented-characters setting unit 91 sets the (number of) presented characters 22 based on, for example, information on the proficiency level 21 of the user 40 input via a proficiency level setting button (not shown) (S20).
1). Here, the setting of the likelihood threshold and / or the selection of the kanji DB can be performed as necessary. Next, various settings of the color adjustment setting unit 92, the size adjustment setting unit 93, and the display position adjustment setting unit 94 included in the layout setting unit 90 are performed (S203). The voice from the speaker 10 is input to a voice input unit 50 via a microphone (not shown).
(S205).

【００２６】ステップＳ２０５による入力音声が、音声
認識部６０によって、音声認識される（Ｓ２０７）。ス
テップＳ２０７の音声認識結果データであるバーバル情
報は、出力部９５を介して、透過型表示装置２に出力さ
れる（Ｓ２０９）。透過型表示装置２に出力されるバー
バル情報についての変更（ここでは、呈示文字数（の割
合）２２の設定、必要に応じて尤度閾値の設定及び／又
は漢字ＤＢの選択、さらに、ステップＳ２０３による各
種設定）が入力されたかを判定する（Ｓ２１１）。ステ
ップＳ２１１よりバーバル情報についての変更が入力さ
れた場合、再びステップＳ２０１及び／又はステップＳ
２０３に戻り、呈示文字数（の割合）２２の設定、さら
に、必要に応じて尤度閾値の設定及び／又は漢字ＤＢの
選択、さらに、ステップＳ２０３による各種設定を行
う。また、ステップＳ２１１よりバーバル情報について
の変更が入力されていない場合、一連の処理を終了す
る。The voice input in step S205 is voice-recognized by the voice recognition unit 60 (S207). The verbal information, which is the speech recognition result data in step S207, is output to the transmissive display device 2 via the output unit 95 (S209). Change of verbal information output to transmissive display device 2 (here, setting of (number of characters to be presented) 22; setting of likelihood threshold if necessary and / or selection of kanji DB; furthermore, step S203) It is determined whether various settings have been input (S211). When a change in verbal information is input from step S211, step S201 and / or step S201 are performed again.
Returning to 203, the setting of (the ratio of) the presented characters 22 and the setting of the likelihood threshold and / or the selection of the kanji DB as necessary, and the various settings in step S203 are performed. If a change in verbal information has not been input from step S211, a series of processing ends.

【００２７】図５は、被験者による文意の理解を客観的
に示す実験結果を示す図である。但し、ここでは、従来
技術（参照：齊藤幹、失聴者のための音声認識技術を利
用したマン・マン・インターフェースに関する研究;北
海道大学大学院工学研究科修士論文、1999２）で示され
た実験を行ったものであるため、本発明の前提条件「人
間は、例えば、音声認識結果の文字列の認識率が６０％
程度確保されれば（不完全なバーバル情報）、前後の文
脈及び／又は認識結果から文字列を類推し、結果的にコ
ミュニケーションを図ることができる」を示し、さら
に、「この前提条件に加えて、本発明者らによるノンバ
ーバル情報の同時呈示という着想によれば、コミュニケ
ーションがさらに円滑に行われる」という根拠を導く程
度に簡潔に説明する。FIG. 5 is a view showing the results of an experiment which objectively shows the understanding of the meaning by the subject. However, here, the experiment shown in the conventional technology (see: Miki Saito, research on man-man interface using speech recognition technology for hearing loss; Master's thesis, Graduate School of Engineering, Hokkaido University, 19992) was performed. Therefore, the precondition of the present invention is that a human being has, for example, a character string recognition rate of 60%
If the degree is secured (incomplete verbal information), the character string can be inferred from the context and / or the recognition result before and after, and communication can be achieved as a result. " According to the idea of simultaneous presentation of non-verbal information by the present inventors, communication can be performed more smoothly. "

【００２８】図５（ａ）は、対数尤度閾値と呈示文の変
化を示す図である。この対数尤度閾値と呈示文の変化３
０は、例えば、尤度３０、呈示文３２を含む。音声認識
部６０により、認識単語及びその尤度が算出される。つ
ぎに、予め設定された閾値より大きい尤度に対応する認
識単語のみを呈示する。なお、閾値より小さい尤度に対
応する単語については、非文字（ここでは、「？」）と
して表示した。これにより、尤度３０と呈示文３２との
対応としては、図示のように、対数尤度閾値を小さくす
ることで、呈示する単語数が増加している。FIG. 5A is a diagram showing a change in the log likelihood threshold value and the presentation sentence. This log likelihood threshold and the change of the presentation sentence 3
0 includes, for example, the likelihood 30 and the presentation sentence 32. The speech recognition unit 60 calculates the recognition word and its likelihood. Next, only the recognized words corresponding to the likelihood larger than the preset threshold are presented. Note that words corresponding to likelihoods smaller than the threshold are displayed as non-characters (here, “?”). As a result, as the correspondence between the likelihood 30 and the presentation sentence 32, the number of words to be presented is increased by reducing the log likelihood threshold as illustrated.

【００２９】図５（ｂ）は、対数尤度閾値による文意理
解精度の変化を示す図である。ここでの実験結果として
は、図示のように、すべての被験者（Ａ〜Ｆ）に共通し
て、対数尤度閾値が「−２５００」、すなわち、単語認
識精度が約４０％を越えると文理解精度は急激に上昇し
ている。なお、失聴者においては個人差が大きく、これ
は獲得語彙数の差が要因のひとつだと考えられる。FIG. 5B is a diagram showing a change in the accuracy of sentence understanding according to the log likelihood threshold. As a result of the experiment, as shown in the figure, the sentence comprehension is performed when the log likelihood threshold is “−2500”, that is, when the word recognition accuracy exceeds about 40%, common to all the subjects (A to F). Accuracy is increasing rapidly. In addition, individual differences are large among the hearing-impaired persons, and this is considered to be one of the factors due to the difference in the number of acquired vocabularies.

【００３０】以上により、本発明の前提条件「人間は、
例えば、音声認識結果の文字列の認識率が６０％程度確
保されれば（不完全なバーバル情報）、前後の文脈及び
／又は認識結果から文字列を類推し、結果的にコミュニ
ケーションを図ることができる」を客観的に示唆した。As described above, the precondition of the present invention is that “human is
For example, if the recognition rate of the character string of the voice recognition result is about 60% (incomplete verbal information), the character string can be inferred from the context and / or the recognition result before and after, and as a result communication can be achieved. "I can do it" objectively.

【００３１】図６は、本発明に関する音声表示システム
１００における実験結果を示す図である。本実験は、演
算装置１から得られる不完全なバーバル情報と、話者１
０から得られるノンバーバル情報とを、透過型表示装置
２を着用するユーザ４０に対して同時呈示することによ
る文章理解変化について実施されたものである。FIG. 6 is a diagram showing experimental results in the audio display system 100 according to the present invention. In this experiment, the incomplete verbal information obtained from the arithmetic unit 1 and the speaker 1
The non-verbal information obtained from 0 is simultaneously presented to the user 40 wearing the transmissive display device 2 to change the sentence comprehension.

【００３２】本実験では、聴覚障害者に協力してもらう
前に、聴覚に障害を持たない２３歳〜３０歳の日本人男
性、３名を被験者とした。なお、３人の被験者は、これ
まで特に読話の訓練を受けたことはない。さらに、呈示
する文章は、図５（ａ）に示した呈示文３２と同様とし
た。In this experiment, three Japanese males aged 23 to 30 years without hearing impairment were used as subjects before having the hearing impaired cooperate. The three subjects have not received any particular training in reading so far. Further, the sentence to be presented was the same as the presented sentence 32 shown in FIG.

【００３３】呈示文３２と共に呈示するノンバーバル情
報としては、デジタルビデオ（Ｖｉｃｔｏｒ：ＧＲ−Ｄ
Ｖ１、５７万画素）で撮影した顔の映像を用いた。この
映像は、２３歳の日本人男性に音声処理を施す前の正解
の文章を読み上げてもらい、その時の顔を中心に撮影し
た。この映像にＰＣによるディジタル処理を施し、呈示
文章字幕を重ね合わせた。文章字幕は、話者が文章をし
ゃべり終わり口の動きが止まった後に、話者の口元に重
ね合わせて呈示した。なお、元の文章が同じであれば、
４段階の尤度の違い（図６に示すプロット位置に対応）
によらず、同じ顔の映像を利用した。The non-verbal information presented together with the presentation sentence 32 includes digital video (Victor: GR-D).
(V1, 570,000 pixels). This video was taken by a 23-year-old Japanese man who read the correct sentence before voice processing and focused on the face at that time. This video was subjected to digital processing by a PC and superimposed on the subtitles of the presented text. The text subtitles were superimposed on the speaker's mouth after the speaker finished speaking and the mouth stopped moving. If the original sentences are the same,
Difference of likelihood in four stages (corresponding to plot position shown in FIG. 6)
Regardless, the video of the same face was used.

【００３４】実験としては、バーバル情報だけの意味理
解の変化を調べる第１実験（バーバル情報）と、バーバ
ル情報だけでなく映像試料、すなわちノンバーバル情報
を付加した状態での意味理解の変化を調べる第２実験と
を行う。第１実験の内容は、図５と同様であり、説明を
省略する。また、ここでは、元の文章に応じて無作為に
被験者をＡ、Ｂの２つのグループに分けた。グループ
Ａ、Ｂの被験者には、それぞれ２５文に対し、４段階の
尤度に分けた１００の文章を呈示した。各グループの被
験者は、例えば、紙に印刷された呈示文章を順番に読ん
でいき、その意味が理解できれば自分の理解した内容を
答えるようにした。その際、普段被験者が使い慣れてい
るパソコンを用いて、テキストエディタにキーボードで
打ち込ませた。なお、被験者に対しては、指示された順
番通りに進み、呈示文章を飛ばしたり、前の呈示文章に
戻ることはしないように予め教示している。As an experiment, a first experiment (verbal information) for examining a change in semantic understanding of only verbal information and a second experiment for examining a change of semantic understanding in a state where not only verbal information but also a video sample, that is, non-verbal information are added, are performed. Perform two experiments. The contents of the first experiment are the same as those in FIG. 5, and the description is omitted. Here, the subjects were randomly divided into two groups A and B according to the original sentences. The subjects in groups A and B presented 100 sentences divided into four levels of likelihood for each of 25 sentences. Subjects in each group, for example, read presentation texts printed on paper in order, and if they could understand the meaning, responded to the content they understood. At that time, using a personal computer to which the test subject was accustomed, the text editor was hit with a keyboard. Note that the subject is instructed in advance so as to proceed in the order instructed and not to skip the presentation sentence or return to the previous presentation sentence.

【００３５】第２実験（バーバル情報＋ノンバーバル情
報）では、映像を被験者に呈示するために透過型ＨＭＤ
（ＯＬＹＭＰＵＳ：Ｍｅｄｉａｍａｓｋ）を用いた。被
験者はデジタルビデオを再生し、顔の映像から読話を試
み、続いて字幕が現れたところでビデオを一旦停止さ
せ、その文章の内容が理解できれば先ほどと同じように
パソコン上のテキストエディタにキーボードで打ち込ま
せた。In the second experiment (verbal information + non-verbal information), a transmission type HMD was used to present an image to a subject.
(OLYMPUS: Mediamask) was used. The subject plays the digital video, attempts to read from the facial image, then pauses when the subtitles appear, and once he understands the contents of the sentence, he types it into the text editor on the personal computer with the keyboard as before. I let you.

【００３６】また、実験手順としては、第１実験のＡグ
ループを試行し、つぎに、第２実験を行い、最後に、第
１実験のＢグループを試行した。なお、それぞれの実験
の間には被験者の判断により休憩を挟んだ。この実験手
順を採用したのは、元の文章が同じであり実験を繰り返
すことによって生じる文章に対する慣れを少しでも減ら
すためであり、また、被験者の集中力やモチベーション
に結果が影響されやすく、その影響をいくらかは少なく
するためである。As an experimental procedure, the group A of the first experiment was tried, the second experiment was performed, and finally, the group B of the first experiment was tried. In addition, a break was inserted between each experiment according to the judgment of the subject. This experimental procedure was used in order to reduce any habitual use of the sentence caused by repeating the experiment because the original sentence was the same, and the result was easily affected by the concentration and motivation of the subject. In order to reduce some.

【００３７】また、図示のグラフは、被験者Ａの実験結
果を示しており、グラフの横軸は対数尤度閾値（単語認
識精度）であり、縦軸は文意理解精度（％）である。な
お、四角のマーカーが施された実線は、第２実験による
文理解精度である。また、三角のマーカーが施された実
線は、第１実験でのＡ、Ｂグループの結果を平均したも
のである。The graph shown in the figure shows the experimental result of the subject A. The horizontal axis of the graph is the log likelihood threshold (word recognition accuracy), and the vertical axis is the sentence understanding accuracy (%). The solid line with the square marker is the sentence comprehension accuracy in the second experiment. The solid line with the triangular marker is the average of the results of the groups A and B in the first experiment.

【００３８】被験者Ａでは、対数尤度閾値が−２０００
から−３０００（認識率で約４０％）になると急激に文
理解精度が向上している。また、被験者Ａで対数尤度閾
値が下がれば文理解精度は概ね上昇している。すなわ
ち、被験者Ａでは、ノンバーバル情報を付加することに
よる、文理解精度の向上が明らかに示されている。In the subject A, the log likelihood threshold is -2000
From -3000 (approximately 40% in recognition rate), the sentence comprehension accuracy sharply improves. In addition, when the log likelihood threshold value of the subject A decreases, the sentence comprehension accuracy generally increases. That is, in the subject A, the improvement of the sentence comprehension accuracy by adding the non-verbal information is clearly shown.

【００３９】このように本実施の形態の音声表示システ
ム１００によれば、話者による言語情報と非言語情報と
を、同時にユーザに呈示することで、話者とユーザとの
コミュニケーションの補助を行うことができる。また、
ユーザの特性と音声認識装置の性能に応じて、読話、手
話による非言語情報と、音声認識結果である言語情報と
を、同時にユーザに呈示することができる。また、音声
表示システム１００によれば、特に、読話や手話による
文理解能力が低い中途失聴者であっても、円滑なコミュ
ニケーションを図ることができる。As described above, according to the audio display system 100 of the present embodiment, the linguistic information and the non-linguistic information by the speaker are simultaneously presented to the user, thereby assisting the communication between the speaker and the user. be able to. Also,
According to the characteristics of the user and the performance of the speech recognition device, non-verbal information based on reading and sign language and linguistic information as a speech recognition result can be simultaneously presented to the user. In addition, according to the voice display system 100, smooth communication can be achieved even for a mid-hearing person who has a low ability to understand a sentence in reading and sign language.

【００４０】本発明の音声表示システムは、各部を実現
するための機能を含む音声表示方法、その各手順をコン
ピュータに実行させるための音声表示プログラム、音声
表示プログラムを記録したコンピュータ読み取り可能な
記録媒体、音声表示プログラムを含みコンピュータの内
部メモリにロード可能なプログラム製品、そのプログラ
ムを含むサーバ等のコンピュータ、音声表示装置、等に
より提供されることができる。A voice display system according to the present invention includes a voice display method including a function for realizing each unit, a voice display program for causing a computer to execute each procedure, and a computer-readable recording medium storing the voice display program. , A program product that includes a voice display program and can be loaded into an internal memory of a computer, a computer such as a server including the program, a voice display device, or the like.

【００４１】[0041]

【発明の効果】本発明によると、以上説明した通り、話
者による言語情報と非言語情報とを、同時にユーザに呈
示することで、話者とユーザとのコミュニケーションの
補助を行うができる。また、本発明は、ユーザの特性と
音声認識装置の性能に応じて、読話、手話による非言語
情報と、音声認識結果である言語情報とを、同時にユー
ザに呈示することができる。また、本発明は、例えば、
読話や手話による文理解能力が低い中途失聴者であって
も、円滑なコミュニケーションを図ることができる。According to the present invention, as described above, communication between the speaker and the user can be assisted by presenting the linguistic information and the non-linguistic information of the speaker to the user at the same time. Further, according to the present invention, non-verbal information based on reading and sign language and linguistic information as a voice recognition result can be simultaneously presented to the user according to the characteristics of the user and the performance of the voice recognition device. Also, the present invention, for example,
Even a half-hearing person who has a low ability to understand sentences by reading and sign language can communicate smoothly.

[Brief description of the drawings]

【図１】本発明に関する音声表示システム１００の概略
構成図。FIG. 1 is a schematic configuration diagram of a voice display system 100 according to the present invention.

【図２】習熟度フォルダ２０の説明図。FIG. 2 is an explanatory diagram of a proficiency folder 20.

【図３】本発明に関する音声表示システム１００の使用
状態を示す概略説明図。FIG. 3 is a schematic explanatory diagram showing a use state of the audio display system 100 according to the present invention.

【図４】本発明に関する音声表示システム１００のフロ
ーチャート。FIG. 4 is a flowchart of the voice display system 100 according to the present invention.

【図５】被験者による文意の理解を客観的に示す実験結
果を示す図。FIG. 5 is a view showing an experimental result that objectively shows the understanding of the meaning by the subject.

【図６】本発明に関する音声表示システム１００におけ
る実験結果を示す図。FIG. 6 is a view showing experimental results in the audio display system 100 according to the present invention.

【符号の説明】１演算装置２透過型表示装置１０話者４０ユーザ５０音声入力部５５処理部（ＣＰＵ）６０音声認識部９０レイアウト設定部９５出力部１００音声表示システム[Description of Signs] 1 Arithmetic device 2 Transmissive display device 10 Speaker 40 User 50 Voice input unit 55 Processing unit (CPU) 60 Voice recognition unit 90 Layout setting unit 95 Output unit 100 Voice display system

───────────────────────────────────────────────────── フロントページの続き (72)発明者伊福部達北海道札幌市中央区南13条西13丁目１−43 Ｆターム(参考） 5D015 KK02 LL03 LL05 LL07 5E501 AC37 CA02 CB15 CC11 EA01 EA21 FA45 ────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Tatsu Ifukube F-term (reference) 5D015 KK02 LL03 LL05 LL07 5E501 AC37 CA37 CB15 CC11 EA01 EA21 FA45

Claims

[Claims]

1. An audio display system for assisting communication, comprising: an arithmetic unit for recognizing an input voice, converting the input voice into linguistic information, and outputting the linguistic information; A transmissive display device including a transmissive unit for obtaining non-verbal information, wherein the arithmetic unit comprises: a voice input unit for inputting the voice; and a voice recognition unit for recognizing the voice input from the voice input unit. A layout setting unit for setting a display state on the transmissive display device; and an output for outputting a result of speech recognition by the speech recognition unit to the transmissive display device according to the setting of the layout setting unit. And a voice display system.

2. The method according to claim 1, wherein the layout setting unit sets one or more of the number of characters, a color, a size, a display position, and a display focal length of a character string displayed on the transmissive display device. Item 2. The audio display system according to item 1.

3. The method according to claim 1, wherein the non-verbal information includes at least one of a motion of a face portion such as a lip and an eye of the speaker, a gesture, a sign language, a reading, and a facial expression. The audio display system according to 1.

4. The layout setting section decreases or increases the number of displayed characters or the ratio thereof when the proficiency of the reading or sign language is high, and increases or reduces the number of displayed characters or the ratio thereof when the proficiency is low. 3. The method according to claim 2, wherein the values are set smaller.
Or the voice display system according to 3.

5. The audio display system according to claim 1, wherein the arithmetic device is a portable computer, and the transmission type display device is a glasses display.

6. The system according to claim 1, wherein the voice recognition unit converts the displayed character string into a kanji based on a language database selected according to a set kanji difficulty level.
6. The audio display system according to any one of claims 1 to 5.

7. The speech recognition unit includes a likelihood threshold setting unit that sets a likelihood threshold of a character string that is speech-recognized, and sets a likelihood threshold according to a threshold set in advance by the likelihood threshold setting unit. 7. The voice display system according to claim 1, wherein when the degree is lower than the threshold value, the displayed character string is displayed in non-characters.

8. A transmissive display including an arithmetic unit for recognizing input speech, converting it into linguistic information and outputting the same, and a transmissive unit for displaying the linguistic information and obtaining non-linguistic information from a speaker. A voice display program used for a voice display system for assisting communication with a device, comprising: a voice input procedure for inputting the voice; and a preset threshold value for the likelihood of a voice-recognized character string. When the likelihood is lower than the threshold value, a voice recognition for recognizing voice input from the voice input procedure is performed so as to recognize a displayed character string as non-character. A voice display program for causing a computer to execute a procedure and an output procedure of outputting a result of voice recognition by the voice recognition procedure to the transparent display device.

9. The voice according to claim 8, wherein the voice recognition step converts the displayed character string into Chinese characters based on a language database selected according to the set difficulty level of Chinese characters. Display program.