JP6298806B2

JP6298806B2 - Speech translation system, control method therefor, and speech translation program

Info

Publication number: JP6298806B2
Application number: JP2015241459A
Authority: JP
Inventors: 知高大越
Original assignee: RECRUIT LIFESTYLE CO., LTD.
Current assignee: RECRUIT LIFESTYLE CO., LTD.
Priority date: 2015-12-10
Filing date: 2015-12-10
Publication date: 2018-03-20
Anticipated expiration: 2035-12-10
Also published as: JP2017107098A

Description

本発明は、音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムに関する。 The present invention relates to a speech translation system, a speech translation method, and a speech translation program.

一般に、音声翻訳システムにおいては、ある言語による音声を他の言語による音声に翻訳する際に、入力音声を認識するための音声認識エンジンが用いられる（例えば特許文献１及び２）。かかる音声認識エンジンでは、例えば、発話された音声が音響モデルのデータベースと照合されて「音」が「読み」に変換された後、その「読み」が言語モデルのデータベースと照合されて「文字」に変換され、更に単語の並びが調整され、必要に応じて一連のテキストとして表示される。音声翻訳システムでは、こうして認識されたある言語の入力音声が、翻訳エンジンで他の言語に翻訳され、その翻訳結果が音声合成エンジンにより出力音声に変換される。 In general, a speech translation system uses a speech recognition engine for recognizing input speech when translating speech in one language into speech in another language (for example, Patent Documents 1 and 2). In such a speech recognition engine, for example, after the spoken speech is collated with an acoustic model database and "sound" is converted into "reading", the "reading" is collated with the language model database and "character" The word sequence is further adjusted and displayed as a series of text as necessary. In the speech translation system, the input speech of a certain language recognized in this way is translated into another language by the translation engine, and the translation result is converted into output speech by the speech synthesis engine.

特開２０１５−４０９４６号公報Japanese Patent Laying-Open No. 2015-40946 特開２０１１−２２８１３号公報JP 2011-22813 A 特開２００５−２８４８８０号公報JP 2005-284880 A

ところで、上記従来の音声翻訳システムでは、音声認識エンジンにおいて入力音声を誤認識すると、その誤った音声認識結果が翻訳されるため、その翻訳結果も誤ったものとなってしまう。かかる誤認識率を低くするための対策としては、音声認識処理の高精度化、認識対象単語の長大化（複数の単語を極力まとめて処理する）、発話者による音声の再入力又は音声認識結果の確認等が挙げられる。 By the way, in the conventional speech translation system, if the input speech is erroneously recognized by the speech recognition engine, the erroneous speech recognition result is translated, so that the translation result is also incorrect. Measures to reduce the misrecognition rate include high accuracy of speech recognition processing, lengthening of recognition target words (processing a plurality of words as much as possible), re-input of speech by a speaker, or speech recognition result Confirmation.

しかし、音声認識処理の高精度化を行うには、多くの語彙を対象とした負荷の重い処理が必要となる結果、処理時間が増大してしまう傾向にあり、この場合、高速処理が可能なハードウェアを用いると、装置コストが増大してしまう。また、認識対象単語の長大化を行うと、複数の単語のまとまりが音声認識結果として出力されるので、その後の翻訳エンジンにおける取り扱いが複雑又は煩雑となり、翻訳処理に手間が掛ったり、翻訳精度が低下したりするおそれがある。さらに、発話者による音声の再入力又は音声認識結果の確認を求めると、そのためのメッセージの生成と表示といった処理が複雑化し、また、最終的な翻訳結果を取得するまでに時間を要することとなるので、ユーザ（利用者、発話者）の負担の増加や利便性の低下を招いてしまう。 However, in order to increase the accuracy of the speech recognition processing, a heavy load processing for many vocabularies is required, and as a result, processing time tends to increase. In this case, high-speed processing is possible. If hardware is used, the cost of the apparatus increases. Also, if the recognition target word is lengthened, a group of a plurality of words is output as a speech recognition result, so that the subsequent handling in the translation engine becomes complicated or complicated, and the translation processing takes time and the translation accuracy is high. It may decrease. Further, when the speaker re-inputs the voice or confirms the voice recognition result, the process for generating and displaying the message is complicated, and it takes time to obtain the final translation result. As a result, the burden on the user (user, speaker) increases and convenience decreases.

そこで、本発明は、このような事情に鑑みてなされたものであり、音声認識ひいては音声翻訳の精度を簡易に向上させることができる音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of such circumstances, and provides a speech translation system, a speech translation method, and a speech translation program that can easily improve the accuracy of speech recognition and thus speech translation. With the goal.

上記課題を解決するため、本発明者は、サンプリングデータの収集や試験データの分析及び解析を含む研究を鋭意実施してきた。その結果、通常の音声認識エンジンでは、例えば標準的な成人の音声データの分析結果に基づいて作成された音響モデル等が使用されることから、例えば高齢者や子供が発話したときの音声認識率が低い傾向にあることを見出し、本発明を完成するに至った。 In order to solve the above problems, the present inventor has intensively conducted research including collection of sampling data and analysis and analysis of test data. As a result, the normal speech recognition engine uses, for example, an acoustic model created based on the analysis results of standard adult speech data. For example, the speech recognition rate when an elderly person or a child speaks Has been found to be low, and the present invention has been completed.

すなわち、本発明の一態様による音声翻訳システムは、まず、発話者の音声を入力するための音声入力部と、音声入力部に入力された音声の内容を認識する音声認識部と、音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部と、翻訳部で翻訳された内容の音声を合成する音声合成部と、音声合成部で合成された音声を出力する音声出力部と、記憶部とを備える。そして、記憶部は、音声認識部で認識できなかった認識不可音声を、その認識不可音声の正しい認識内容とともに、音声データベースとして記憶する。また、音声認識部が、音声データベースを参照し、音声入力部に入力された他の発話者の音声を音声データベースに記憶された認識不可音声と照合し、その照合結果（例えば両者の一致度や類似度等）に基づいて、該音声の正しい認識内容を翻訳部に提供する音声照会処理を実行する。 That is, a speech translation system according to an aspect of the present invention includes a speech input unit for inputting a speaker's speech, a speech recognition unit for recognizing the content of speech input to the speech input unit, and a speech recognition unit. A translation unit that translates the content recognized in the different language content, a speech synthesis unit that synthesizes the speech of the content translated by the translation unit, a speech output unit that outputs the speech synthesized by the speech synthesis unit, And a storage unit. Then, the storage unit stores the unrecognizable speech that could not be recognized by the speech recognition unit, together with the correct recognition content of the unrecognizable speech, as a speech database. Further, the speech recognition unit refers to the speech database, collates the speech of another speaker input to the speech input unit with unrecognizable speech stored in the speech database, and the collation result (for example, the degree of coincidence between the two or On the basis of the similarity, etc., a voice inquiry process for providing correct recognition content of the voice to the translation unit is executed.

なお、「認識不可音声の正しい認識内容」とは、換言すれば、その認識不可音声の正しい「読み」といえる。また、その取得方法としては、例えば、記憶部に記憶された認識不可音声の語彙の内容を、その発話者から聴取したり、他の人が実際に聞いて認識することにより、その正しい「読み」を判断したりといった例が挙げられる。後者の場合、例えば、何れの年齢や世代の人が聞いても認識することができない認識不可音声は、記憶部における記憶対象から除外しても構わない。さらに、「発話者」と「他の発話者」は、本発明を特定する便宜上、異なる用語として区別して用いているが、「発話者」と「他の発話者」が同一である場合も本発明の技術的範囲に含まれる。 In addition, “correct recognition content of unrecognizable speech” can be said to be correct “reading” of unrecognizable speech. As the acquisition method, for example, the content of the vocabulary of unrecognizable speech stored in the storage unit is listened to by the speaker or actually heard and recognized by another person. "Is judged. In the latter case, for example, unrecognizable speech that cannot be recognized by any age or generation of people may be excluded from the storage target in the storage unit. Furthermore, “speaker” and “other speaker” are distinguished from each other for convenience in specifying the present invention. However, even when “speaker” and “other speaker” are the same, this term is also used. It is included in the technical scope of the invention.

また、音声認識部が照合処理を実行するタイミングは特に制限されず、例えば、他の発話者の音声を認識できなかったとき（音声認識を一旦実行した後）に、或いは、他の発話者の音声を認識前に、音声照会処理を実行してもよい。 The timing at which the voice recognition unit executes the collation processing is not particularly limited. For example, when the voice of another speaker cannot be recognized (after voice recognition is once executed), or The voice inquiry process may be executed before the voice is recognized.

また、本発明の一態様による音声翻訳システムは、発話者及び他の発話者の属性に関する情報を取得する情報取得部を更に備えてもよい。そして、記憶部は、認識不可音声、並びに、その正しい認識内容を、発話者の属性に関連付けて上記の音声データベースとして記憶してもよい。さらに、音声認識部は、他の発話者の属性を音声データベースに記憶された属性と照合し、その照合結果（例えば両者の一致度や類似度等）に基づいて、上記の音声照会処理を実行することができる。 The speech translation system according to an aspect of the present invention may further include an information acquisition unit that acquires information related to the attributes of the speaker and other speakers. Then, the storage unit may store the unrecognizable voice and the correct recognition content in association with the attribute of the speaker as the voice database. Further, the speech recognition unit collates the attributes of other speakers with the attributes stored in the speech database, and executes the above-described speech inquiry process based on the collation result (for example, the degree of coincidence or similarity). can do.

具体的には、「属性」が、発話者の年齢若しくは年齢の範囲（世代ともいえる）又は性別である例が挙げられる。このとき、同じ「読み」の語彙であっても、例えば世代や性別によって抑揚や音調（音節音調、単語音調、句音調、文音調等）が異なる場合に対応した複数のデータレコードを音声データベースの一部として作成し、記憶部に記憶してもよい。また、「属性」に関する情報を取得する方法としては、ユーザ（利用者、発話者）が音声翻訳システムに係るサービスを使用する際、又は、音声翻訳プログラムであるアプリケーションを情報端末等のコンピュータにインストールして使用する際のユーザ情報の登録画面に記入してもらったり、音声翻訳システムを利用する際に属性に関する質問アンケートに回答してもらったりといった例が挙げられる。 Specifically, an example in which the “attribute” is the age of the speaker or the age range (also referred to as a generation) or gender is given. At this time, even for the same “reading” vocabulary, a plurality of data records corresponding to different inflections and tones (syllable tones, word tones, phrasal tones, sentence tones, etc.) depending on the generation and gender, It may be created as a part and stored in the storage unit. In addition, as a method for acquiring information on “attributes”, when a user (user, speaker) uses a service related to a speech translation system, or an application that is a speech translation program is installed in a computer such as an information terminal. For example, the user information may be filled in on the user information registration screen, or the attribute question questionnaire may be answered when using the speech translation system.

或いは、本発明の一態様による音声翻訳システムは、発話者の音声を入力するための音声入力部と、音声入力部に入力された音声の内容を認識する音声認識部と、音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部と、翻訳部で翻訳された内容の音声を合成する音声合成部と、音声合成部で合成された音声を出力する音声出力部と、音声認識部で認識できなかった認識不可音声を記憶する記憶部と、音声認識部における入力された音声の認識に用いる第１の音響モデルに対して、認識不可音声を用いた適応処理を実施して第２の音響モデルを生成する音響モデル生成部を備える。なお、本態様による音声翻訳システムに、上述した音声データベース、及びそれを用いた音声照会処理を組み合わせてもよい。 Alternatively, the speech translation system according to an aspect of the present invention recognizes the speech input unit for inputting the speech of the speaker, the speech recognition unit that recognizes the content of the speech input to the speech input unit, and the speech recognition unit. A translation unit that translates the translated content into a different language content, a speech synthesis unit that synthesizes the speech of the content translated by the translation unit, a speech output unit that outputs the speech synthesized by the speech synthesis unit, and speech recognition The adaptive processing using the unrecognizable speech is performed on the storage unit that stores the unrecognizable speech that could not be recognized by the unit and the first acoustic model used for the recognition of the input speech in the speech recognition unit. An acoustic model generation unit that generates two acoustic models is provided. The speech translation system according to this aspect may be combined with the above-described speech database and speech query processing using the speech database.

この場合、音声翻訳システムが、発話者の属性に関する情報を取得する情報取得部を更に備え、音響モデル生成部は、発話者の属性、例えば、発話者の年齢若しくは年齢の範囲（世代）又は性別毎に第２の音響モデルを生成し、音声認識部が、他の発話者の属性に対応した第２の音響モデルを用いて、入力された音声の内容を認識するように構成してもよい。 In this case, the speech translation system further includes an information acquisition unit that acquires information related to the attributes of the speaker, and the acoustic model generation unit includes the attributes of the speaker, for example, the age of the speaker or the age range (generation) or gender. A second acoustic model may be generated every time, and the speech recognition unit may be configured to recognize the content of the input speech using the second acoustic model corresponding to the attributes of other speakers. .

また、本発明の一態様による音声入力部、音声認識部、翻訳部、音声出力部、及び記憶部を備える音声翻訳システムの制御方法は、以下の各ステップを有する。すなわち、当該方法は、音声入力部により、発話者の音声を入力するステップと、音声翻訳部により、音声入力部に入力された音声の内容を認識するステップと、翻訳部により、音声認識部で認識された内容を異なる言語の内容に翻訳するステップと、音声合成部により、翻訳部で翻訳された内容の音声を合成するステップと、音声出力部により、音声合成部で合成された音声を出力するステップと、記憶部により、音声認識部で認識できなかった（或いは、認識できない又は認識できないであろう；以下同様）音声（「認識不可音声」）を、それらの認識不可音声の正しい認識内容とともに、音声データベースとして記憶するステップとを有する。そして、音声の内容を認識するステップにおいては、音声データベースを参照し、音声入力部に入力された他の発話者の音声を音声データベースに記憶された認識不可音声と照合し、その照合結果に基づいて、音声の正しい認識内容を翻訳部に提供する音声照会処理を実行する。 In addition, a method for controlling a speech translation system including a speech input unit, a speech recognition unit, a translation unit, a speech output unit, and a storage unit according to an aspect of the present invention includes the following steps. That is, the method includes a step of inputting the voice of the speaker by the voice input unit, a step of recognizing the content of the voice input to the voice input unit by the voice translation unit, and a voice recognition unit by the translation unit. The step of translating the recognized contents into the contents of different languages, the step of synthesizing the speech of the content translated by the translation unit by the speech synthesis unit, and outputting the speech synthesized by the speech synthesis unit by the speech output unit And the storage unit recognizes the voice ("unrecognizable voice") that could not be recognized by the voice recognition unit (or could not be recognized; And storing as a voice database. Then, in the step of recognizing the content of the voice, referring to the voice database, the voice of another speaker input to the voice input unit is collated with the unrecognizable voice stored in the voice database, and based on the collation result Then, the voice inquiry process for providing the correct recognition content of the voice to the translation unit is executed.

また、本発明の一態様による音声翻訳プログラムは、コンピュータ（単数又は単一種に限られず、複数又は複数種でもよい；以下同様）を、発話者の音声を入力するための音声入力部、音声入力部に入力された音声の内容を認識する音声認識部、音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部、翻訳部で翻訳された内容の音声を合成する音声合成部と、音声合成部で合成された音声を出力する音声出力部、及び記憶部として機能させるものである。そして、当該プログラムは、記憶部に、音声認識部で認識できなかった認識不可音声を、認識不可音声の正しい認識内容とともに、音声データベースとして記憶させる。また、当該プログラムは、音声認識部に、音声データベースを参照し、音声入力部に入力された他の発話者の音声を音声データベースに記憶された認識不可音声と照合し、その照合結果に基づいて、その音声の正しい認識内容を翻訳部に提供する音声照会処理を実行させる。 In addition, the speech translation program according to one aspect of the present invention includes a computer (not limited to a single type or a single type, but may be a plurality or a plurality of types; the same shall apply hereinafter), a voice input unit for inputting a speaker's voice, a voice input A speech recognition unit for recognizing the content of speech input to the unit, a translation unit for translating the content recognized by the speech recognition unit into content in a different language, a speech synthesis unit for synthesizing speech of the content translated by the translation unit, and The voice synthesizing unit functions as a voice output unit that outputs the synthesized voice and a storage unit. Then, the program causes the storage unit to store unrecognizable speech that could not be recognized by the speech recognition unit, together with correct recognition contents of unrecognizable speech, as a speech database. In addition, the program refers to the speech recognition unit with reference to the speech database, collates speech of other speakers input to the speech input unit with unrecognizable speech stored in the speech database, and based on the collation result The voice inquiry process for providing the correct recognition content of the voice to the translation unit is executed.

本発明によれば、ユーザの発話した音声を、音声データベースに予め記憶された認識不可音声と照合することにより、通常の処理では認識できない音声の正しい認識内容を簡易に得ることができる。よって、音声認識自体の高精度化及び高速処理化に起因する装置コストの増大、各種処理の煩雑化や翻訳精度の低下、及び、ユーザの負担の増大や利便性の低下を招くことなく、音声認識の精度ひいては音声翻訳の精度を簡易に且つ効率的に向上させることが可能となる。 According to the present invention, correct recognition content of speech that cannot be recognized by normal processing can be easily obtained by collating speech uttered by the user with unrecognizable speech stored in advance in the speech database. Therefore, the voice can be recognized without increasing the device cost due to the high accuracy and high speed processing of the speech recognition itself, complicating various processes and reducing the translation accuracy, and increasing the burden on the user and the convenience. It is possible to improve the accuracy of recognition and thus the accuracy of speech translation easily and efficiently.

本発明による音声翻訳システムに係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。1 is a system block diagram schematically showing a preferred embodiment of a network configuration and the like related to a speech translation system according to the present invention. 本発明による音声翻訳システムにおけるユーザ者装置（情報端末）の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram which shows roughly an example of a structure of the user apparatus (information terminal) in the speech translation system by this invention. 本発明による音声翻訳システムにおけるサーバの構成の一例を概略的に示すシステムブロック図である。It is a system block diagram which shows roughly an example of a structure of the server in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声データベース構築を含む処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process including the speech database construction in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声照会を含む処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process including the speech inquiry in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声照会を含む処理の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process including the speech inquiry in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声データベース構築を含む処理の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process including the speech database construction in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声照会を含む処理の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process including the speech inquiry in the speech translation system by this invention. 本発明による音声翻訳システムにおける音声照会を含む処理の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process including the speech inquiry in the speech translation system by this invention. 本発明による音声翻訳システムにおける音響モデル生成（改良）を含む処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process including the acoustic model production | generation (improvement) in the speech translation system by this invention.

以下、本発明の実施の形態について詳細に説明する。なお、以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。またさらに、必要に応じて示す上下左右等の位置関係は、特に断らない限り、図示の表示に基づくものとする。さらにまた、図面における各種の寸法比率は、その図示の比率に限定されるものではない。 Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

（システム構成）
図１は、本発明による音声翻訳システムに係るネットワーク構成等の好適な一実施形態を概略的に示すシステムブロック図である。音声翻訳システム１００は、ユーザ（発話者、他の発話者）が使用する情報端末１０（ユーザ装置）にネットワークＮを介して電子的に接続されるサーバ２０を備える。 (System configuration)
FIG. 1 is a system block diagram schematically showing a preferred embodiment such as a network configuration related to a speech translation system according to the present invention. The speech translation system 100 includes a server 20 that is electronically connected via a network N to an information terminal 10 (user device) used by a user (speaker or other speaker).

情報端末１０は、例えば、タッチパネル等のユーザインターフェイス及び視認性が高いディスプレイを採用する。また、ここでの情報端末１０は、ネットワークＮとの通信機能を有するスマートフォンに代表される携帯電話を含む可搬型のタブレット型端末装置である。さらに、情報端末１０は、プロセッサ１１、記憶資源１２、音声入出力デバイス１３、通信インターフェイス１４、入力デバイス１５、表示デバイス１６、及びカメラ１７を備えている。また、情報端末１０は、インストールされた音声翻訳アプリケーションソフト（本発明の一実施形態による音声翻訳プログラムの少なくとも一部）が動作することにより、本発明の一実施形態による音声翻訳システムの一部又は全部として機能するものである。 The information terminal 10 employs a user interface such as a touch panel and a display with high visibility, for example. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N. The information terminal 10 further includes a processor 11, a storage resource 12, a voice input / output device 13, a communication interface 14, an input device 15, a display device 16, and a camera 17. In addition, the information terminal 10 operates as a part of the speech translation system according to the embodiment of the present invention by operating the installed speech translation application software (at least a part of the speech translation program according to the embodiment of the present invention). It functions as a whole.

プロセッサ１１は、算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成される。また、プロセッサ１１は、記憶資源１２に格納されているプログラムＰ１０である音声翻訳アプリケーションソフトを解釈及び実行し、各種処理を行う。このプログラムＰ１０としての音声翻訳アプリケーションソフトは、例えばサーバ２０からネットワークＮを通じて配信可能なものであり、手動的に又は自動的にインストール及びアップデートされてもよい。 The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server 20 through the network N, for example, and may be installed and updated manually or automatically.

なお、ネットワークＮは、例えば、有線ネットワーク（近距離通信網（ＬＡＮ）、広域通信網（ＷＡＮ）、又は付加価値通信網（ＶＡＮ）等）と無線ネットワーク（移動通信網、衛星通信網、ブルートゥース（Bluetooth：登録商標）、ＷｉＦｉ(Wireless Fidelity)、ＨＳＤＰＡ(High Speed Downlink Packet Access)等）が混在して構成される通信網である。 The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access), etc.).

記憶資源１２は、物理デバイス（例えば、半導体メモリ等のコンピュータ読み取り可能な記録媒体）の記憶領域が提供する論理デバイスであり、情報端末１０の処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、各種データ等を格納する。ドライバプログラムとしては、例えば、音声入出力デバイス１３を制御するための入出力デバイスドライバプログラム、入力デバイス１５を制御するための入力デバイスドライバプログラム、表示デバイス１６を制御するための出力デバイスドライバプログラム等が挙げられる。さらに、音声入出力デバイス１３は、例えば、一般的なマイクロフォン、及びサウンドデータを再生可能なサウンドプレイヤである。 The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable recording medium such as a semiconductor memory), and an operating system program, a driver program, various data, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

通信インターフェイス１４は、例えばサーバ２０との接続インターフェイスを提供するものであり、無線通信インターフェイス及び／又は有線通信インターフェイスから構成される。また、入力デバイス１５は、例えば、表示デバイス１６に表示されるアイコン、ボタン、仮想キーボード等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、タッチパネルの他、情報端末１０に外付けされる各種入力装置を例示することができる。 The communication interface 14 provides a connection interface with the server 20, for example, and is configured from a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel. Various input devices can be exemplified.

表示デバイス１６は、画像表示インターフェイスとして各種の情報をユーザや、必要に応じて会話の相手方に提供するものであり、例えば、有機ＥＬディスプレイ、液晶ディスプレイ、ＣＲＴディスプレイ等が挙げられる。また、カメラ１７は、種々の被写体の静止画や動画を撮像するためのものである。 The display device 16 provides various information as an image display interface to a user or a conversation partner as necessary, and examples thereof include an organic EL display, a liquid crystal display, and a CRT display. The camera 17 is for capturing still images and moving images of various subjects.

サーバ２０は、例えば、演算処理能力の高いホストコンピュータによって構成され、そのホストコンピュータにおいて所定のサーバ用プログラムが動作することにより、サーバ機能を発現するものであり、例えば、音声認識サーバ、翻訳サーバ、及び音声合成サーバとして機能する単数又は複数のホストコンピュータから構成される（図示においては単数で示すが、これに限定されない）。そして、各サーバ２０は、プロセッサ２１、通信インターフェイス２２、及び記憶資源２３（記憶部）を備える。 The server 20 is constituted by, for example, a host computer having a high arithmetic processing capability, and expresses a server function by operating a predetermined server program in the host computer, for example, a speech recognition server, a translation server, And a single or a plurality of host computers functioning as a speech synthesis server (in the drawing, it is indicated by a single, but is not limited thereto). Each server 20 includes a processor 21, a communication interface 22, and a storage resource 23 (storage unit).

プロセッサ２１は、算術演算、論理演算、ビット演算等を処理する算術論理演算ユニット及び各種レジスタ（プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等）から構成され、記憶資源２３に格納されているプログラムＰ２０を解釈及び実行し、所定の演算処理結果を出力する。また、通信インターフェイス２２は、ネットワークＮを介して情報端末１０に接続するためのハードウェアモジュールであり、例えば、ＩＳＤＮモデム、ＡＤＳＬモデム、ケーブルモデム、光モデム、ソフトモデム等の変調復調装置である。 The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

記憶資源２３は、例えば、物理デバイス（ディスクドライブ又は半導体メモリ等のコンピュータ読み取り可能な記録媒体等）の記憶領域が提供する論理デバイスであり、それぞれ単数又は複数の、プログラムＰ２０、各種モジュールＬ２０、各種データベースＤ２０、及び各種モデルＭ２０が格納されている。 The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable recording medium such as a disk drive or a semiconductor memory). Each of the storage resources 23 includes one or more programs P20, various modules L20, various types. A database D20 and various models M20 are stored.

プログラムＰ１０は、サーバ２０のメインプログラムである上述したサーバ用プログラム等である。また、各種モジュールＬ２０は、情報端末１０から送信されてくる要求及び情報に係る一連の情報処理を行うため、プログラムＰ１０の動作中に適宜呼び出されて実行されるソフトウェアモジュール（モジュール化されたサブプログラム）である。かかるモジュールＬ２０としては、音声認識モジュール、翻訳モジュール、音声合成モジュール等が挙げられる。 The program P10 is the above-described server program that is the main program of the server 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, so that they are appropriately called and executed during the operation of the program P10 (moduleized subprograms). ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

また、各種データベースＤ２０としては、音声翻訳処理のために必要な各種コーパス（例えば、日本語と英語の音声翻訳の場合、日本語音声コーパス、英語音声コーパス、日本語文字（語彙）コーパス、英語文字（語彙）コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等）、後述する音声データベース、ユーザに関する情報を管理するための管理用データベース等が挙げられる。また、各種モデルＭ２０としては、後述する音声認識に使用する音響モデルや言語モデル等が挙げられる。 The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like. In addition, examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.

以上のとおり構成された音声翻訳システム１００における、音声翻訳処理の操作及び動作の一例について、以下に更に説明する。 An example of operations and operations of speech translation processing in the speech translation system 100 configured as described above will be further described below.

［第１実施形態］
（音声翻訳における音声データベース構築処理１）
図４は、音声翻訳システム１００における音声データベース構築を含む処理の一例を示すフローチャートである。かかる音声データベース構築は、音声翻訳システム１００による音声翻訳処理の一部を構成する。 [First Embodiment]
(Speech database construction process 1 in speech translation)
FIG. 4 is a flowchart showing an example of processing including speech database construction in the speech translation system 100. Such speech database construction constitutes a part of speech translation processing by the speech translation system 100.

ユーザ（発話者）は、まず、情報端末１０の表示デバイス１６に表示されている音声翻訳アプリケーションソフトのアイコン（図示せず）をタップして当該アプリケーションを起動する。これにより、表示デバイス１６には、音声翻訳の対象の言語を選択する画面が適宜表示され、ユーザの言語（ここでは「日本語」）と、例えば会話の相手の言語（ここでは「英語」））を選択することができる。その後、表示デバイス１６に、ユーザによる発話内容を受け付ける音声入力画面が表示されると、音声入出力デバイス１３からの音声入力が可能な状態となる。 A user (speaker) first activates the application by tapping an icon (not shown) of speech translation application software displayed on the display device 16 of the information terminal 10. As a result, a screen for selecting a language for speech translation is appropriately displayed on the display device 16, and the language of the user (here “Japanese”) and the language of the conversation partner (here “English”), for example. ) Can be selected. Thereafter, when a voice input screen for accepting the utterance content by the user is displayed on the display device 16, voice input from the voice input / output device 13 is possible.

この状態で、ユーザ（発話者）が例えば日本語で音声入力する（ステップＳＴ１）と、プロセッサ１１は、その音声入力に基づいて音声信号を生成し、その音声信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。このとおり、情報端末１０自体、又はプロセッサ１１及び音声入出力デバイス１３が「音声入力部」として機能する。 In this state, when the user (speaker) inputs a voice in, for example, Japanese (step ST1), the processor 11 generates a voice signal based on the voice input, and the voice signal is transmitted through the communication interface 14 and the network N. Send to server 20. As described above, the information terminal 10 itself, or the processor 11 and the voice input / output device 13 function as a “voice input unit”.

サーバ２０のプロセッサ２１は、通信インターフェイス２２を通してその音声信号を受信し、音声認識処理を行う（ステップＳＪ１）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声認識モジュール、日本語音声コーパス、音響モデル、言語モデル等）を呼び出し、入力音声の「音」を「読み」（文字）へ変換する。このとおり、プロセッサ２１は、「音声認識部」として機能し、サーバ２０は、全体として「音声認識サーバ」として機能する。 The processor 21 of the server 20 receives the voice signal through the communication interface 22 and performs voice recognition processing (step SJ1). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and obtains “sound” of the input speech. Convert to "reading" (character). As described above, the processor 21 functions as a “voice recognition unit”, and the server 20 functions as a “voice recognition server” as a whole.

ここで、音声の認識が「可」であった場合（ステップＳＪ１において「Ｙｅｓ」）、プロセッサ２１は、認識された音声の「読み」（文字）を他の言語に翻訳する多言語翻訳処理へ移行する（ステップＳＪ２）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０及びデータベースＤ２０（翻訳モジュール、日本語文字コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等）を呼び出し、認識結果である入力音声の「読み」（文字列）を適切に並び替えて日本語の句、節、文等へ変換し、その変換結果に対応する英語を抽出し、それらを英文法に従って並び替えて自然な英語の句、節、文等へと変換する。このとおり、プロセッサ２１は、「翻訳部」としても機能し、サーバ２０は、全体として「翻訳サーバ」として機能する。 If the speech recognition is “possible” (“Yes” in step SJ1), the processor 21 proceeds to multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language. Transition (step SJ2). At this time, the processor 21 calls the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage resource 23 and recognizes them. The resulting input speech “reading” (character string) is properly sorted and converted into Japanese phrases, clauses, sentences, etc., the English corresponding to the conversion result is extracted, and these are sorted according to the English grammar. To natural English phrases, clauses, sentences, etc. As described above, the processor 21 also functions as a “translation unit”, and the server 20 functions as a “translation server” as a whole.

一方、音声の認識が「不可」であった場合（ステップＳＪ１において「Ｎｏ」）、プロセッサ２１は、音声データベース構築処理（ステップＳＪ５）へ移行する。ここでは、認識できなかった音声を、記憶資源２３に確保されたデータベースＤ２０の１つである音声データベースの領域に、「認識不可音声」として記憶し蓄積していく（ステップＳＪ５１）。それから、適宜のタイミングで、その「認識不可音声」の正しい認識内容を取得し、同じ音声データベースに記憶させる（ステップＳＪ５２）。具体的には、この場合の取得方法として、例えば、以下に列挙する（１）乃至（３）の手法が挙げられる。何れの場合においても、サーバ２０のプロセッサ２１は、「正しい認識内容」を「認識不可音声」に関連付けて記憶資源２３の音声データベースへ保存する。 On the other hand, if the speech recognition is “impossible” (“No” in step SJ1), the processor 21 proceeds to a speech database construction process (step SJ5). Here, the voice that could not be recognized is stored and accumulated as “unrecognizable voice” in the voice database area that is one of the databases D20 secured in the storage resource 23 (step SJ51). Then, at the appropriate timing, the correct recognition content of the “unrecognizable voice” is acquired and stored in the same voice database (step SJ52). Specifically, examples of the acquisition method in this case include the methods (1) to (3) listed below. In any case, the processor 21 of the server 20 stores the “correct recognition content” in the voice database of the storage resource 23 in association with “unrecognizable voice”.

（１）発話したユーザに、音声が認識不可であった旨を情報端末１０に表示する等してその場で伝え、その音声の正しい「読み」（文字）を情報端末１０から直ちに入力してもらう。情報端末１０のプロセッサ１１は、その正しい読み（つまり正しい認識内容）をその都度、サーバ２０へ送信する。 (1) The user who spoke is notified on the spot that the voice is unrecognizable by displaying it on the information terminal 10, and the correct "reading" (character) of the voice is immediately input from the information terminal 10. get. The processor 11 of the information terminal 10 transmits the correct reading (that is, the correct recognition content) to the server 20 each time.

（２）「認識不可音声」が記憶資源２３にある程度蓄積されてから、属性（例えば年齢や性別）が種々異なる人々に、それらの音声を聞いてもらい、正しく認識された場合に、その正しい読み（つまり正しい認識内容）をその都度又は一括で、サーバ２０へ送信又は入力する。 (2) After “unrecognizable speech” is accumulated in the storage resource 23 to some extent, people who have different attributes (for example, age and gender) listen to their speech and are correctly recognized. (That is, correct recognition content) is transmitted or input to the server 20 each time or in a batch.

（３）情報端末１０で実行する音声翻訳アプリケーションのメニューに、「認識不可音声」の認識への協力を依頼するアンケート形式のページやカラムを用意しておき、音声翻訳アプリケーションを実行した（不特定の）ユーザに、単数又は複数の「認識不可音声」を聞いてもらい、その正しい読み（つまり正しい認識内容）その都度、サーバ２０へ送信する。 (3) In the menu of the speech translation application executed on the information terminal 10, a questionnaire-type page or column for requesting cooperation for recognition of “unrecognizable speech” is prepared, and the speech translation application is executed (unspecified The user listens to one or a plurality of “unrecognizable voices” and transmits the correct reading (that is, correct recognition content) to the server 20 each time.

次に、音声の内容の翻訳が完了すると、プロセッサ２１は、音声合成処理へ移行する（ステップＳＪ３）。このとき、プロセッサ２１は、記憶資源２３から、必要なモジュールＬ２０、データベースＤ２０、及びモデルＭ２０（音声合成モジュール、英語音声コーパス、音響モデル、言語モデル等）を呼び出し、翻訳結果である英語の句、節、文等を自然な音声に変換する。このとおり、プロセッサ２１は、「音声合成部」としても機能し、サーバ２０は、全体として「音声合成サーバ」として機能する。 Next, when the translation of the speech content is completed, the processor 21 proceeds to speech synthesis processing (step SJ3). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) from the storage resource 23, and the English phrase that is the translation result, Convert clauses, sentences, etc. to natural speech. As described above, the processor 21 also functions as a “speech synthesis unit”, and the server 20 functions as a “speech synthesis server” as a whole.

次いで、プロセッサ２１は、合成された音声に基づいて音声出力用の音声信号を生成し、通信インターフェイス２２及びネットワークＮを通して、情報端末１０へ送信する。情報端末１０のプロセッサ１１は、通信インターフェイス１４を通してその音声信号を受信し、音声出力処理を行う（ステップＳＴ２）。 Next, the processor 21 generates a voice signal for voice output based on the synthesized voice, and transmits the voice signal to the information terminal 10 through the communication interface 22 and the network N. The processor 11 of the information terminal 10 receives the audio signal through the communication interface 14 and performs an audio output process (step ST2).

（音声翻訳における音声照会処理１−１）
図５は、音声翻訳システム１００における音声照会を含む処理の一例を示すフローチャートである。かかる音声照会は、音声翻訳システム１００による音声翻訳処理の一部を構成する。ここでの処理は、「ユーザ（発話者）」に代えて「ユーザ（他の発話者）」が音声入力を行い、ステップＳＪ５に代えてステップＳＪ６の処理を実行すること以外は、図４に示す処理と実質的に同一である。よって、以下、この相違点に関連する処理以外の処理については説明を省略する。また、図５に示す音声翻訳処理は、音声照会処理をユーザ（他の発話者）の音声を認識できなかったとき（音声認識を一旦実行した後）に実行する手順の一例である。 (Speech inquiry processing in speech translation 1-1)
FIG. 5 is a flowchart illustrating an example of processing including speech inquiry in the speech translation system 100. Such speech inquiry constitutes a part of speech translation processing by the speech translation system 100. The process here is the same as that shown in FIG. 4 except that “user (other speaker)” performs voice input instead of “user (speaker)” and executes the process of step SJ6 instead of step SJ5. This is substantially the same as the processing shown. Therefore, the description of processes other than those related to this difference will be omitted below. The speech translation process shown in FIG. 5 is an example of a procedure for executing the speech inquiry process when the speech of the user (other speaker) cannot be recognized (after speech recognition is once performed).

図４に示す処理と同様に音声の認識が「可」であった場合（ステップＳＪ１において「Ｙｅｓ」）には、プロセッサ２１は、多言語翻訳処理へ移行する（ステップＳＪ２）一方、音声の認識が「不可」であった場合（ステップＳＪ１において「Ｎｏ」）、プロセッサ２１は、音声照会処理（ステップＳＪ６）へ移行する。 If the speech recognition is “possible” as in the process shown in FIG. 4 (“Yes” in step SJ1), the processor 21 proceeds to multilingual translation processing (step SJ2), while the speech recognition is performed. Is “impossible” (“No” in step SJ1), the processor 21 proceeds to the voice inquiry process (step SJ6).

ここで、プロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、認識できなかった音声を、その音声データベースに記憶された「認識不可音声」と照合する（ステップＳＪ６１）。このとき、例えば、両者の音声マッチングにおける一致度又は類似度等が所定の値以上であると判断された場合、プロセッサ２１は、音声データベースに記憶されているその「認識不可音声」の「正しい認識内容」を、認識できなかった音声の「正しい認識内容」として多言語翻訳処理（ステップＳＪ２）側へ出力する（ステップＳＪ６２）。 Here, the processor 21 calls and references the voice database from the storage resource 23, and collates the voice that could not be recognized with the “unrecognizable voice” stored in the voice database (step SJ61). At this time, for example, if it is determined that the degree of coincidence or similarity or the like in the voice matching between the two is greater than or equal to a predetermined value, the processor 21 determines “correct recognition” of the “unrecognizable voice” stored in the voice database. The “content” is output to the multilingual translation processing (step SJ2) side as the “correct recognition content” of the speech that could not be recognized (step SJ62).

（音声翻訳における音声照会処理１−２）
図６は、音声翻訳システム１００における音声照会を含む処理の他の一例を示すフローチャートである。かかる音声照会も、音声翻訳システム１００による音声翻訳処理の一部を構成する。ここでの処理は、「ユーザ（発話者）」に代えて「ユーザ（他の発話者）」が音声入力（ステップＳＴ１）を行い、且つ、ステップＳＪ５に代えてステップＳＪ７の処理を実行すること以外は、図４に示す処理と実質的に同一である。よって、以下、この相違点に関連する処理以外の処理については説明を省略する。また、図６に示す音声翻訳処理は、ユーザ（他の発話者）の音声を認識する前に、音声照会処理を実行する手順の一例である。 (Speech inquiry process 1-2 in speech translation)
FIG. 6 is a flowchart showing another example of processing including speech inquiry in the speech translation system 100. Such a speech inquiry also constitutes a part of speech translation processing by the speech translation system 100. In this processing, “user (other speaker)” performs voice input (step ST1) instead of “user (speaker)”, and the processing of step SJ7 is executed instead of step SJ5. Except for this, it is substantially the same as the processing shown in FIG. Therefore, the description of processes other than those related to this difference will be omitted below. Moreover, the speech translation process shown in FIG. 6 is an example of a procedure for executing the speech inquiry process before recognizing the speech of the user (another speaker).

すなわち、ユーザ（他の発話者）が日本語で音声入力し（ステップＳＴ１）、その音声信号を受信したサーバ２０のプロセッサ２１は、音声照会処理（ステップＳＪ７）へ移行する。ここで、プロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、入力された音声を、その音声データベースに記憶された「認識不可音声」と照合する（ステップＳＪ７１；ステップＳＪ６１に対応）。 That is, the user (another speaker) inputs a voice in Japanese (step ST1), and the processor 21 of the server 20 that receives the voice signal moves to a voice inquiry process (step SJ7). Here, the processor 21 calls and refers to the voice database from the storage resource 23, and collates the input voice with the “unrecognizable voice” stored in the voice database (step SJ71; corresponding to step SJ61).

そして、プロセッサ２１は、入力された音声が音声データベースに記憶された「認識不可音声」に該当するか否かを判定する（ステップＳＪ７２；実質的にはステップＳＪ７１の処理に含まれる）。例えば、両者の音声マッチングにおける一致度又は類似度等が所定の値以上であると判断された場合、プロセッサ２１は、音声の「該当有り」（ステップＳＪ７２で「Ｙｅｓ」）として、その該当した「認識不可音声」の「正しい認識内容」を、入力された音声の「正しい認識内容」として多言語翻訳処理（ステップＳＪ２）側へ出力する（ステップＳＪ７３）。 Then, the processor 21 determines whether or not the input voice corresponds to “unrecognizable voice” stored in the voice database (step SJ72; substantially included in the process of step SJ71). For example, when it is determined that the degree of coincidence or similarity or the like in the voice matching between the two is greater than or equal to a predetermined value, the processor 21 determines that the “applicable” (“Yes” in step SJ72) corresponds to the corresponding “ The “correct recognition content” of “unrecognizable speech” is output to the multilingual translation processing (step SJ2) side as the “correct recognition content” of the input speech (step SJ73).

一方、両者の音声マッチングにおける一致度又は類似度等が所定の値未満であると判断された場合、プロセッサ２１は、音声の「該当無し」（ステップＳＪ７２で「Ｎｏ」）として、音声認識処理（ステップＳＪ１）へ移行する。すなわち、この場合、ユーザ（他の発話者）による音声は、「認識不可音声」ではないから、通常の音声認識処理によって認識されるか、或いは、その可能性が極めて高いこととなる。 On the other hand, if it is determined that the degree of coincidence or similarity in the voice matching between the two is less than a predetermined value, the processor 21 determines that the voice is “not applicable” (“No” in step SJ72), and performs voice recognition processing ( The process proceeds to step SJ1). That is, in this case, the voice by the user (another speaker) is not “unrecognizable voice”, and is recognized by the normal voice recognition process, or the possibility thereof is extremely high.

なお、以上の如く、図６に示す音声翻訳処理の例では、ユーザ（他の発話者）の音声を認識する前に、音声照会処理を実行するので、図５に示す音声認識処理（ステップＳＪ１）における判定処理は不要となる。 As described above, in the example of the speech translation process shown in FIG. 6, since the speech inquiry process is executed before the speech of the user (other speaker) is recognized, the speech recognition process (step SJ1) shown in FIG. ) Is not required.

［第２実施形態］
（音声翻訳における音声データベース構築処理２）
図７は、音声翻訳システム１００における音声データベース構築を含む処理の他の一例を示すフローチャートである。かかる音声データベース構築も、音声翻訳システム１００による音声翻訳処理の一部を構成する。ここでの処理は、ユーザ（発話者）による音声入力（ステップＳＴ１）に先立って、ユーザ（発話者）によるユーザ情報の入力（ステップＳＴ０）を実施し、且つ、「認識不可音声」の記憶・蓄積（ステップＳＪ５１）に先立って、「ユーザ情報」の記憶・蓄積を実施すること以外は、図４に示す処理と実質的に同一である。よって、以下、この相違点に関連する処理以外の処理については説明を省略する。 [Second Embodiment]
(Speech database construction process 2 in speech translation)
FIG. 7 is a flowchart showing another example of processing including construction of a speech database in the speech translation system 100. The construction of the speech database also constitutes a part of speech translation processing by the speech translation system 100. In this processing, prior to voice input (step ST1) by the user (speaker), user information (step ST0) is input by the user (speaker), and “unrecognizable voice” is stored. Prior to the accumulation (step SJ51), the process is substantially the same as the process shown in FIG. 4 except that the “user information” is stored and accumulated. Therefore, the description of processes other than those related to this difference will be omitted below.

ここでは、ユーザ（発話者）が音声翻訳アプリケーションを起動すると、例えば、音声翻訳の対象言語を選択する画面が情報端末１０の表示デバイス１６に表示される前に、或いは、対象言語を選択した後に、ユーザに関する情報を入力してもらうための情報登録画面が、情報端末１０の表示デバイス１６に表示される。ユーザに関する情報としては特に制限されないが、ユーザの年齢、性別、出身地、居住地等の属性情報が含まれる。 Here, when the user (speaker) starts the speech translation application, for example, before the screen for selecting the target language for speech translation is displayed on the display device 16 of the information terminal 10 or after selecting the target language. An information registration screen for inputting information related to the user is displayed on the display device 16 of the information terminal 10. Although it does not restrict | limit especially as information regarding a user, Attribute information, such as a user's age, sex, a birthplace, a residence, is contained.

この状態で、ユーザ（発話者）がユーザ情報を入力する（ステップＳＴ０）と、プロセッサ１１は、その情報入力に基づいて情報信号を生成し、その情報信号を通信インターフェイス１４及びネットワークＮを通してサーバ２０へ送信する。このとおり、情報端末１０自体又はプロセッサ１１が「情報取得部」としても機能する。 In this state, when the user (speaker) inputs user information (step ST0), the processor 11 generates an information signal based on the information input, and the information signal is transmitted to the server 20 through the communication interface 14 and the network N. Send to. As described above, the information terminal 10 itself or the processor 11 also functions as an “information acquisition unit”.

サーバ２０のプロセッサ２１は、通信インターフェイス２２を通してその情報信号を受信すると、音声データベース構築処理（ステップＳＪ５）へ移行する。ここでは、入力されたユーザ情報（属性を含む）を、記憶資源２３に確保されたデータベースＤ２０の１つである音声データベースの領域に記憶する（ステップＳＪ５０）。次いで、ユーザ（発話者）による音声入力（ステップＳＴ１）、及び、それに続く音声認識処理（ステップＳＪ１）で音声の認識が不可であった場合、プロセッサ２１は、その「認識不可音声」を、そのユーザの属性情報に関連付けて、音声データベースに記憶し蓄積していく（ステップＳＪ５１）。また、適宜のタイミングで、その「認識不可音声」の正しい認識内容を取得し、同じ音声データベースに記憶する（ステップＳＪ５２）。 When the processor 21 of the server 20 receives the information signal through the communication interface 22, the processor 21 proceeds to a voice database construction process (step SJ5). Here, the input user information (including attributes) is stored in a voice database area, which is one of the databases D20 secured in the storage resource 23 (step SJ50). Next, when voice recognition by the user (speaker) (step ST1) and subsequent voice recognition processing (step SJ1) are impossible, the processor 21 determines the “unrecognizable voice” as the voice recognition process. The information is stored and accumulated in the voice database in association with the user attribute information (step SJ51). Further, the correct recognition content of the “unrecognizable voice” is acquired at an appropriate timing and stored in the same voice database (step SJ52).

（音声翻訳における音声照会処理２−１）
図８は、音声翻訳システム１００における音声照会を含む処理の他の一例を示すフローチャートである。かかる音声照会も、音声翻訳システム１００による音声翻訳処理の一部を構成する。ここでの処理は、「ユーザ（発話者）」に代えて「ユーザ（他の発話者）」が音声入力を行い、ステップＳＪ５に代えてステップＳＪ６の処理を実行すること以外は、図７に示す処理と実質的に同一である。よって、以下、この相違点に関連する処理以外の処理については説明を省略する。また、図８に示す音声翻訳処理は、図５に示す例と同様に、ユーザ（他の発話者）の音声を認識できなかったとき（音声認識を一旦実行した後）に実行する手順の一例である。 (Speech inquiry process 2-1 in speech translation)
FIG. 8 is a flowchart illustrating another example of processing including speech inquiry in the speech translation system 100. Such a speech inquiry also constitutes a part of speech translation processing by the speech translation system 100. The processing here is the same as that shown in FIG. 7 except that “user (other speaker)” performs voice input instead of “user (speaker)” and executes the processing of step SJ6 instead of step SJ5. This is substantially the same as the processing shown. Therefore, the description of processes other than those related to this difference will be omitted below. In addition, the speech translation process shown in FIG. 8 is an example of a procedure executed when the speech of the user (another speaker) cannot be recognized (after speech recognition is once executed), as in the example shown in FIG. It is.

図７に示す処理と同様にユーザ（他の発話者）の情報が情報端末１０から入力されると、プロセッサ２１は、音声照会処理（ステップＳＪ６）へ移行する。ここで、プロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、ユーザの属性（年齢、性別等）による絞り込みを行う。 When the user (other speaker) information is input from the information terminal 10 as in the process illustrated in FIG. 7, the processor 21 proceeds to the voice inquiry process (step SJ6). Here, the processor 21 calls and references the voice database from the storage resource 23, and narrows down by the user attributes (age, gender, etc.).

具体的には、例えば、ユーザが７０代の男性である場合、プロセッサ２１は、音声データベースに記憶されている「７０代の男性」に関連付けられた「認識不可音声」を抽出（選別）し、或いは、必要に応じて、それらの抽出された音声データから副次的な新たな音声データベースを作成する。同様に、例えば、ユーザが１０歳未満の女性（女の子）である場合、プロセッサ２１は、音声データベースに記憶されている「１０歳未満の女性」に関連付けられた「認識不可音声」を抽出（選別）し、或いは、必要に応じて、それらの抽出された音声データから副次的な新たな音声データベースを作成する。 Specifically, for example, when the user is a man in his 70s, the processor 21 extracts (selects) “unrecognizable speech” associated with “male in his 70s” stored in the speech database, Alternatively, if necessary, a new secondary voice database is created from the extracted voice data. Similarly, for example, when the user is a woman (girl) younger than 10 years, the processor 21 extracts (screens) “unrecognizable speech” associated with “female under 10 years” stored in the speech database. Or, if necessary, a new secondary audio database is created from the extracted audio data.

それから、図７に示す処理と同様に、音声入力（ステップＳＴ１）に引き続き音声の認識が「可」であった場合（ステップＳＪ１において「Ｙｅｓ」）には、プロセッサ２１は、多言語翻訳処理へ移行する（ステップＳＪ２）。一方、音声の認識が「不可」であった場合（ステップＳＪ１において「Ｎｏ」）、プロセッサ２１は、再び音声照会処理（ステップＳＪ６）へ移行する。ここで、プロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、認識できなかった音声を、ユーザの属性に基づいて抽出された「認識不可音声」」（又は、それらから作成された新たな音声データベースにおける「認識不可音声」）と照合する（ステップＳＪ６１）。 Then, similarly to the process shown in FIG. 7, when the speech recognition is “possible” following the speech input (step ST1) (“Yes” in step SJ1), the processor 21 proceeds to the multilingual translation process. Transition (step SJ2). On the other hand, when the speech recognition is “impossible” (“No” in step SJ1), the processor 21 proceeds to the voice inquiry process (step SJ6) again. Here, the processor 21 calls and refers to the voice database from the storage resource 23 and refers to the “unrecognizable voice” extracted based on the user's attributes for the voice that could not be recognized (or a new voice created from them). In the voice database (step SJ61).

（音声翻訳における音声照会処理２−２）
図９は、音声翻訳システム１００における音声照会を含む処理の他の一例を示すフローチャートである。かかる音声照会も、音声翻訳システム１００による音声翻訳処理の一部を構成する。ここでの処理は、「ユーザ（発話者）」に代えて「ユーザ（他の発話者）」が情報入力（ステップＳＴ０）及び音声入力（ステップＳＴ１）を行い、且つ、ステップＳＪ５に代えてステップＳＪ７の処理を実行すること以外は、図７に示す処理と実質的に同一である。よって、以下、この相違点に関連する処理以外の処理については説明を省略する。また、図９に示す音声翻訳処理は、ユーザ（他の発話者）の音声を認識する前に、音声照会処理を実行する手順の一例である。 (Speech inquiry process 2-2 in speech translation)
FIG. 9 is a flowchart showing another example of processing including speech inquiry in the speech translation system 100. Such a speech inquiry also constitutes a part of speech translation processing by the speech translation system 100. In this processing, instead of “user (speaker)”, “user (other speaker)” performs information input (step ST0) and voice input (step ST1), and step is replaced with step SJ5. Except for executing the processing of SJ7, it is substantially the same as the processing shown in FIG. Therefore, the description of processes other than those related to this difference will be omitted below. Moreover, the speech translation process shown in FIG. 9 is an example of a procedure for executing the speech inquiry process before recognizing the speech of the user (another speaker).

すなわち、ユーザ（他の発話者）がユーザ情報を入力し（ステップＳＴ０）、その情報信号を受信したサーバ２０のプロセッサ２１は、音声照会処理（ステップＳＪ７）へ移行する。ここで、プロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、ユーザの属性（年齢、性別等）による絞り込みを行う（ステップＳＪ７０；ステップＳＪ６０と同じ）。次いで、ユーザ（他の発話者）が日本語で音声入力し（ステップＳＴ１）、その音声信号を受信したプロセッサ２１は、記憶資源２３から音声データベースを呼び出して参照し、入力された音声を、ユーザの属性に基づいて抽出された「認識不可音声」（又は、それらから作成された新たな音声データベース「認識不可音声」）と照合する（ステップＳＪ７１；ステップＳＪ６１に対応）。 That is, a user (another speaker) inputs user information (step ST0), and the processor 21 of the server 20 that has received the information signal moves to a voice inquiry process (step SJ7). Here, the processor 21 calls and references the voice database from the storage resource 23, and narrows down by the user attributes (age, sex, etc.) (step SJ70; same as step SJ60). Next, the user (another speaker) inputs the voice in Japanese (step ST1), and the processor 21 that receives the voice signal calls the voice database from the storage resource 23 and refers to the input voice. Are collated with “unrecognizable speech” (or a new speech database “unrecognizable speech” created from them) extracted based on the attribute of step SJ71; corresponding to step SJ61).

そして、プロセッサ２１は、入力された音声が属性に関連付けられた「認識不可音声」に該当するか否かを判定する（ステップＳＪ７２；実質的にはステップＳＪ７１の処理に含まれる）。例えば、両者の音声マッチングにおける一致度又は類似度等が所定の値以上であると判断された場合、プロセッサ２１は、音声の「該当有り」（ステップＳＪ７２で「Ｙｅｓ」）として、その該当した「認識不可音声」の「正しい認識内容」を、入力された音声の「正しい認識内容」として多言語翻訳処理（ステップＳＪ２）側へ出力する（ステップＳＪ７３）。 Then, the processor 21 determines whether or not the input voice corresponds to “unrecognizable voice” associated with the attribute (step SJ72; substantially included in the process of step SJ71). For example, when it is determined that the degree of coincidence or similarity or the like in the voice matching between the two is greater than or equal to a predetermined value, the processor 21 determines that the “applicable” (“Yes” in step SJ72) corresponds to the corresponding “ The “correct recognition content” of “unrecognizable speech” is output to the multilingual translation processing (step SJ2) side as the “correct recognition content” of the input speech (step SJ73).

なお、以上の如く、図９に示す音声翻訳処理の例では、図６に示す例と同様に、ユーザ（他の発話者）の音声を認識する前に、音声照会処理を実行するので、図８に示すような音声認識処理（ステップＳＪ１）における判定処理は不要となる。 As described above, in the example of the speech translation process shown in FIG. 9, the speech inquiry process is executed before recognizing the speech of the user (another speaker) as in the example shown in FIG. The determination process in the voice recognition process (step SJ1) as shown in FIG.

［第３実施形態］
（音声翻訳における音響モデル生成処理）
図１０は、音声翻訳システム１００における音響モデル生成（改良）を含む処理の一例を示すフローチャートである。かかる音響モデル生成も、音声翻訳システム１００による音声翻訳処理の一部を構成する。 [Third Embodiment]
(Acoustic model generation processing in speech translation)
FIG. 10 is a flowchart showing an example of processing including acoustic model generation (improvement) in the speech translation system 100. Such acoustic model generation also constitutes part of speech translation processing by the speech translation system 100.

ここでは、まず、図４（第１実施形態）に示す音声データベース構築までの処理（ステップＳＴ１，ＳＴ２，ＳＪ５）、又は、図７（第２実施形態）に示す音声データベース構築までの処理（ステップＳＴ０，ＳＴ１，ＳＴ２，ＳＪ５）を実行する。 Here, first, the process up to the construction of the speech database shown in FIG. 4 (first embodiment) (steps ST1, ST2, SJ5) or the process up to the construction of the speech database shown in FIG. 7 (second embodiment) (steps). ST0, ST1, ST2, SJ5) are executed.

次に、サーバ２０のプロセッサ２１は、処理をステップＳＪ８へ移行し、記憶資源２３から、モデルＭ２０として音声認識に使用する従来の音響モデル（第１の音響モデル）を呼び出す（ステップＳＪ８１）。それから、その従来の音響モデルに対して、音声データベースに記憶された「認識不可音声」を用いた公知の適応処理（例えば特許文献１において引用されている適応処理）を実施する（ステップＳＪ８２）ことにより、新たな音響モデル（第２の音響モデル）を生成し（ステップＳＪ８３）、記憶資源２３に記憶する。 Next, the processor 21 of the server 20 moves the process to step SJ8, and calls the conventional acoustic model (first acoustic model) used for speech recognition as the model M20 from the storage resource 23 (step SJ81). Then, a known adaptive process using the “unrecognizable speech” stored in the speech database (for example, the adaptive process cited in Patent Document 1) is performed on the conventional acoustic model (step SJ82). Thus, a new acoustic model (second acoustic model) is generated (step SJ83) and stored in the storage resource 23.

なお、このステップＳＪ８３においては、図４（第１実施形態）に示す処理で構築した音声データベースを用いる場合、従来の音響モデル１つから、新たな音響モデルを１つ生成することができる。また、図７（第２実施形態）に示す処理で構築した音声データベースを用いる場合、従来の音響モデル１つから、例えばユーザの属性毎（年齢、世代、性別等）に複数の新たな音響モデルを生成することができる。このとき、新たな音響モデルは、ユーザの属性に関連付けて、記憶資源２３に記憶される。以下、後者の如く、新たな音響モデルが属性毎に生成された場合を例に説明する。 In step SJ83, when a speech database constructed by the processing shown in FIG. 4 (first embodiment) is used, one new acoustic model can be generated from one conventional acoustic model. Moreover, when using the audio | voice database constructed | assembled by the process shown in FIG. 7 (2nd Embodiment), several new acoustic models for every user attribute (age, generation, sex, etc.) from one conventional acoustic model, for example. Can be generated. At this time, the new acoustic model is stored in the storage resource 23 in association with the user attribute. Hereinafter, a case where a new acoustic model is generated for each attribute as in the latter case will be described as an example.

次いで、図８（第２実施形態）におけるのと同様にユーザ（他の発話者）の情報が情報端末１０から入力されると、プロセッサ２１は、音響モデル選択処理（ステップＳＪ９）へ移行する。ここで、プロセッサ２１は、記憶資源２３に記憶された複数の新たな音響モデルのなかから、ユーザの属性（年齢、性別等）に応じた新たな音響モデルを指定し、それを音声認識に用いる準備を行う。 Next, when the user (other speaker) information is input from the information terminal 10 as in FIG. 8 (second embodiment), the processor 21 proceeds to acoustic model selection processing (step SJ9). Here, the processor 21 designates a new acoustic model according to the user's attributes (age, sex, etc.) from among a plurality of new acoustic models stored in the storage resource 23, and uses it for speech recognition. Make preparations.

そして、第３実施形態におけるステップＳＪ９以降の処理（ステップＳＪ１，ＳＪ２，ＳＪ３）については、従来の音響モデルに代えて、新たに作成され且つ属性に応じて選択された音響モデルを用いること以外は、図６（第１実施形態）及び図９（第２実施形態）における一連の処理と実質的に同一であるため、ここでの説明を省略する。 And about the process (step SJ1, SJ2, SJ3) after step SJ9 in 3rd Embodiment, it replaces with the conventional acoustic model, and uses the acoustic model newly created and selected according to the attribute. Since it is substantially the same as a series of processes in FIG. 6 (first embodiment) and FIG. 9 (second embodiment), description thereof is omitted here.

このように構成された音声翻訳システム１００及びその制御方法並びに音声翻訳プログラムによれば、ユーザ（発話者）が発話した音声のうち、当初は認識できなかった音声（認識不可音声）を予め収集し、また、それらの正しい認識内容を適宜取得することにより、両者のデータを含む音声データベースが構築され、記憶資源２３に記憶される。 According to the speech translation system 100 configured as described above, the control method thereof, and the speech translation program, speech (unrecognizable speech) that was not initially recognized among speech uttered by the user (speaker) is collected in advance. In addition, by appropriately acquiring the correct recognition contents, a voice database including both data is constructed and stored in the storage resource 23.

そして、ユーザ（他の発話者）が発話した音声を、一旦音声認識後、或いは、音声認識に先立って、その音声データベースに記憶された認識不可音声と照合することにより、通常の処理では認識できない音声の正しい認識内容を簡易に得ることができる。したがって、音声認識自体の高精度化及び高速処理化に起因する装置コストの増大、各種処理の煩雑化や翻訳精度の低下、及び、ユーザの負担の増大や利便性の低下を招くことなく、音声認識の精度ひいては音声翻訳の精度を簡易に且つ効率的に向上させることが可能となる。 Then, the speech uttered by the user (other speaker) cannot be recognized by normal processing by comparing it with the unrecognizable speech stored in the speech database after speech recognition or prior to speech recognition. It is possible to easily obtain the correct content of speech recognition. Therefore, the voice recognition itself can be performed without increasing the device cost due to high accuracy and high-speed processing, complicating various processes and decreasing translation accuracy, and increasing the burden on the user and reducing convenience. It is possible to improve the accuracy of recognition and thus the accuracy of speech translation easily and efficiently.

また、前述のとおり、ユーザが高齢者や子供である場合に、その発話した音声の認識率が低い傾向にあるところ、音声翻訳システム１００では、認識不可音声とそれらの正しい認識内容を、ユーザの属性（年齢、世代、性別等）に関連付けて音声データベースとして記憶資源２３に記憶することができる。よって、ユーザの属性に応じて、対応する認識不可音声を絞り込むことにより、かかるユーザが発話した音声の音声認識率ひいては音声翻訳率を更に高めることができる。 Further, as described above, when the user is an elderly person or a child, the recognition rate of the spoken speech tends to be low. In the speech translation system 100, the unrecognizable speech and the correct recognition contents thereof are displayed. It can be stored in the storage resource 23 as a voice database in association with attributes (age, generation, sex, etc.). Therefore, by narrowing down the corresponding unrecognizable speech according to the user attribute, the speech recognition rate of the speech uttered by the user, and thus the speech translation rate, can be further increased.

さらに、音声翻訳システム１００によれば、記憶資源２３に記憶された認識不可音声を用い、従来の音響モデルに対する適応処理を実施して新たな音響モデルを生成し、そのようにしていわば改良された新たな音響モデルを用いて音声認識を行うことができる。このようにしても、音声認識自体の高精度化及び高速処理化に起因する装置コストの増大、各種処理の煩雑化や翻訳精度の低下、及び、ユーザの負担の増大や利便性の低下を招くことなく、音声認識の精度ひいては音声翻訳の精度を簡易に且つ効率的に向上させることが可能となる。 Furthermore, according to the speech translation system 100, the unrecognizable speech stored in the storage resource 23 is used to perform an adaptive process on the conventional acoustic model to generate a new acoustic model. Speech recognition can be performed using a new acoustic model. Even if it does in this way, the increase in the apparatus cost resulting from the high precision and high-speed processing of voice recognition itself, the complexity of various processes and the fall of translation accuracy, and the increase in a user's burden and the convenience will be caused. Therefore, the accuracy of speech recognition and thus the accuracy of speech translation can be improved easily and efficiently.

またこの場合、ユーザ（発話者）の属性、例えば、年齢若しくは年齢の範囲（世代）又は性別毎に新たな音響モデルを生成し、かかるユーザの属性に対応した新たな音響モデルを用いて、音声認識を行うこともできる。このようにしてユーザの属性に応じた音響モデルを用いた音声認識が可能となるので、ユーザが発話した音声の音声認識率ひいては音声翻訳率を更に一層高めることができる。 In this case, a new acoustic model is generated for each attribute of the user (speaker), for example, age or age range (generation) or gender, and the new acoustic model corresponding to the attribute of the user is used. Recognition can also be performed. In this way, since speech recognition using an acoustic model corresponding to the user's attribute is possible, the speech recognition rate of the speech uttered by the user, and hence the speech translation rate, can be further increased.

なお、上述したとおり、上記の各実施形態は、本発明を説明するための一例であり、本発明をその実施形態に限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、様々な変形が可能である。例えば、当業者であれば、実施形態で述べたリソース（ハードウェア資源又はソフトウェア資源）を均等物に置換することが可能であり、そのような置換も本発明の範囲に含まれる。 Note that, as described above, each of the above embodiments is an example for explaining the present invention, and is not intended to limit the present invention to the embodiment. The present invention can be variously modified without departing from the gist thereof. For example, those skilled in the art can replace the resources (hardware resources or software resources) described in the embodiments with equivalents, and such replacements are also included in the scope of the present invention.

また、上記各実施形態では、音声認識、翻訳、及び音声合成の各処理をサーバ２０によって実行する例について記載したが、これらの処理を情報端末１０において実行するように構成してもよい。この場合、それらの処理に用いるモジュールＬ２０は、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。さらに、音声データベースのデータベースＤ２０、及び／又は、音響モデル等のモデルＭ２０も、情報端末１０の記憶資源１２に保存されていてもよいし、サーバ２０の記憶資源２３に保存されていてもよい。このとおり、音声翻訳システムは、ネットワークＮ及びサーバ２０を備えなくてもよい。 Moreover, although each said embodiment described the example which performs each process of speech recognition, translation, and a speech synthesis by the server 20, you may comprise so that these processes may be performed in the information terminal 10. FIG. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. Furthermore, the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10, or may be stored in the storage resource 23 of the server 20. As described above, the speech translation system may not include the network N and the server 20.

さらに、図５若しくは図６に示す処理（第１実施形態）又は図８若しくは図９に示す処理（第２実施形態）を、図１０に示す処理（第３実施形態）と組み合わせてもよい。すなわち、第３実施形態として説明した新たな音響モデルの生成及び使用とともに、その前に、又は、その後に、第１又は第２実施形態における音声照会処理を実行するようにしてもよい。またさらに、音声認識が可能であったものの、多言語翻訳において適切な翻訳ができなかった内容の元の音声も一種の「認識不可音声」として、或いは「翻訳不可音声」として音声データベースに記憶・蓄積してもよい。さらにまた、第３実施形態においては、ユーザの属性毎に新たな音響モデルを生成せずに、ユーザの属性に依存しない新たな音響モデルを生成するようにしてもよい。 Furthermore, the process shown in FIG. 5 or 6 (first embodiment) or the process shown in FIG. 8 or 9 (second embodiment) may be combined with the process shown in FIG. 10 (third embodiment). That is, the voice inquiry process in the first or second embodiment may be executed before or after the generation and use of the new acoustic model described as the third embodiment. Furthermore, the original speech whose content could not be properly translated in multilingual translation was stored as a kind of “unrecognizable speech” or “untranslatable speech” in the speech database. You may accumulate. Furthermore, in the third embodiment, a new acoustic model that does not depend on user attributes may be generated without generating a new acoustic model for each user attribute.

また、情報端末１０とネットワークＮとの間には、両者間の通信プロトコルを変換するゲートウェイサーバ等が介在してももちろんよい。また、情報端末１０は、携帯型装置に限らず、例えば、デスクトップ型パソコン、ノート型パソコン、タブレット型パソコン、ラップトップ型パソコン等でもよい。 Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like.

本発明によれば、通常の処理では認識できない音声の正しい認識内容を簡易に得ることができるので、例えば、互いの言語を理解できない人同士の会話に関連するサービスを提供する分野における、プログラム、システム、及び方法の設計、製造、提供、販売等の活動に広く利用することができる。 According to the present invention, it is possible to easily obtain correct recognition content of speech that cannot be recognized by normal processing. For example, a program in the field of providing services related to conversations between people who cannot understand each other's language, It can be widely used for activities such as designing, manufacturing, providing and selling systems and methods.

１０情報端末（音声翻訳システム）
１１プロセッサ
１２記憶資源
１３音声入出力デバイス
１４通信インターフェイス
１５入力デバイス
１６表示デバイス
１７カメラ
２０サーバ（音声翻訳システム）
２１プロセッサ
２２通信インターフェイス
２３記憶資源
１００音声翻訳システム
Ｄ２０データベース
Ｌ２０モジュール
Ｍ２０モデル
Ｎネットワーク
Ｐ１０，Ｐ２０プログラム 10 Information terminal (speech translation system)
11 processor 12 storage resource 13 voice input / output device 14 communication interface 15 input device 16 display device 17 camera 20 server (speech translation system)
21 processor 22 communication interface 23 storage resource 100 speech translation system D20 database L20 module M20 model N network P10, P20 program

Claims

A voice input unit for inputting the voice of the speaker;
A voice recognition unit for recognizing the content of the voice input to the voice input unit;
A translation unit that translates the content recognized by the voice recognition unit into content in a different language;
A speech synthesizer for synthesizing speech of the content translated by the translation unit;
A voice output unit for outputting the voice synthesized by the voice synthesis unit;
A storage unit for storing unrecognizable speech that could not be recognized by the speech recognition unit;
With
The voice recognition unit includes the following (1) or (2) ;
( 1 ) Let the user different from the speaker hear the unrecognizable voice stored in the storage unit, and receive the correct recognition content of the unrecognizable voice from the user.
( 2 ) Requesting a user different from the speaker to input the correct recognition content of the unrecognizable speech stored in the storage unit, and receiving the correct recognition content of the unrecognizable speech from the user.
Run the processing,
The storage unit stores the unrecognizable speech and the correct recognition content of the unrecognizable speech as a speech database,
The speech recognition unit refers to the speech database, collates the speech of another speaker input to the speech input unit with the unrecognizable speech stored in the speech database, and based on the collation result, Executing a voice query process for providing the correct recognition content of the voice to the translation unit;
Speech translation system.

The voice recognition unit executes the voice inquiry process when the voice of the other speaker cannot be recognized or before the voice of the other speaker is recognized.
The speech translation system according to claim 1.

An information acquisition unit for acquiring information related to attributes of the speaker and the other speaker;
The storage unit stores the unrecognizable speech and the correct recognition content as the speech database in association with the attributes of the speaker,
The speech recognition unit collates the attribute of the other speaker with the attribute stored in the speech database, and executes the speech inquiry process based on the collation result.
The speech translation system according to claim 1 or 2.

The attribute is the speaker's age or age range or gender,
The speech translation system according to claim 3.

A second acoustic model is obtained by performing adaptive processing using the unrecognizable speech and the correct recognition content of the unrecognizable speech on the first acoustic model used for recognizing the input speech in the speech recognition unit. An acoustic model generation unit for generating
The speech translation system according to any one of claims 1 to 4.

An information acquisition unit for acquiring information related to the attribute of the speaker;
The acoustic model generation unit generates the second acoustic model for each attribute of the speaker,
The speech recognition unit recognizes the content of the input speech using the second acoustic model corresponding to the attribute of the other speaker.
The speech translation system according to claim 5.

A method for controlling a speech translation system including a speech input unit, a speech recognition unit, a translation unit, a speech synthesis unit, a speech output unit, and a storage unit,
Inputting a voice of a speaker by the voice input unit;
Recognizing the content of speech input to the speech input unit by the speech translation unit;
Translating the content recognized by the speech recognition unit into content of a different language by the translation unit;
Synthesizing speech of contents translated by the translation unit by the speech synthesis unit;
Outputting the voice synthesized by the voice synthesis unit by the voice output unit;
Storing unrecognizable speech that could not be recognized by the speech recognition unit by the storage unit;
Have
In the step of recognizing the contents of the voice, the following (1) or (2) ;
( 1 ) Let the user different from the speaker hear the unrecognizable voice stored in the storage unit, and receive the correct recognition content of the unrecognizable voice from the user.
( 2 ) Requesting a user different from the speaker to input the correct recognition content of the unrecognizable speech stored in the storage unit, and receiving the correct recognition content of the unrecognizable speech from the user.
Run the processing,
In the storing step, the unrecognizable speech and the correct recognition content of the unrecognizable speech are stored as a speech database,
In the step of recognizing the content of the speech, the speech database is further referred to, the speech of another speaker input to the speech input unit is compared with the unrecognizable speech stored in the speech database, Based on the collation result, a voice inquiry process for providing correct translation content of the voice to the translation unit is performed.
A method for controlling a speech translation system.

Computer
Voice input unit for inputting the voice of the speaker,
A voice recognition unit for recognizing the content of the voice input to the voice input unit;
A translation unit that translates the content recognized by the voice recognition unit into content in a different language;
A speech synthesizer for synthesizing speech of the content translated by the translation unit;
A voice output unit for outputting the voice synthesized by the voice synthesis unit;
A storage unit for storing unrecognizable speech that could not be recognized by the speech recognition unit;
Function as
In the voice recognition unit, the following (1) or (2) ;
( 1 ) Let the user different from the speaker hear the unrecognizable voice stored in the storage unit, and receive the correct recognition content of the unrecognizable voice from the user.
( 2 ) Requesting a user different from the speaker to input the correct recognition content of the unrecognizable speech stored in the storage unit, and receiving the correct recognition content of the unrecognizable speech from the user.
To execute the processing,
In the storage unit, the unrecognizable speech and the correct recognition content of the unrecognizable speech are stored as a speech database,
The speech recognition unit refers to the speech database, collates the speech of another speaker input to the speech input unit with the unrecognizable speech stored in the speech database, and based on the collation result, Causing voice translation processing to provide the translation unit with the correct recognition content of the voice,
Speech translation program.