JP2009152650A

JP2009152650A - Telephone device and speech communication translating method

Info

Publication number: JP2009152650A
Application number: JP2007326082A
Authority: JP
Inventors: Yoshiaki Tanaka; 義明田中
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-12-18
Filing date: 2007-12-18
Publication date: 2009-07-09

Abstract

<P>PROBLEM TO BE SOLVED: To improve precision of data to be translated by a telephone device having a speech communication translating function. <P>SOLUTION: A telephone device includes a photography unit (101) which photographs a mouth expression of a speaker, a speech recognition unit (201) which converts speech data of the speaker into character data, an image analysis unit (205) which detects feature information associated with the mouth expression from image data, a DB (207) which stores relations between feature patterns associated with a plurality of mouth expressions and character data corresponding to the feature patterns, a retrieval unit (206) which retrieves a character data candidate corresponding to the feature information detected by the image analysis unit from the DB, a correction unit (202) which corrects speech recognition data according to a retrieval result, a translation unit (203) which translates the character data from the correction unit into character data of the language of a partner speaker, and a transmission unit (204) which generates transmission data to be sent out to a line using the translation data. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、電話装置に関し、特に、通話を異なる言語間で翻訳する機能を具備する電話装置に関する。 The present invention relates to a telephone device, and more particularly to a telephone device having a function of translating a call between different languages.

電話を利用するうえでの便宜を図るため、電話装置には様々な機能が設けられている。その一つに、話者の音声を自動翻訳して送信するという機能がある。かかる機能により、日本語や英語といった言語種が異なる利用者間の通話が簡便となる。 In order to facilitate the use of the telephone, the telephone device has various functions. One of them is the function of automatically translating and transmitting the speaker's voice. This function makes it easy to talk between users with different language types such as Japanese and English.

通話の自動翻訳にあたり、電話装置は、送話部のマイクから入力された音声データをいったん文字データに変換する。そして、この文字データに対し、対向話者の言語に対応した翻訳処理を施し、翻訳処理後のデータを回線へ送出する。 For automatic translation of a call, the telephone device temporarily converts voice data input from the microphone of the transmitter to character data. The character data is subjected to a translation process corresponding to the language of the opposite speaker, and the translated data is sent to the line.

上記のような翻訳機能を具備する携帯電話機が、例えば、後述の特許文献１及び２に記載されている。特許文献１に記載の携帯電話機は、マイクから入力された音声と通話相手からの音声とを互いに異なる言語に翻訳して出力するというものである。また、特許文献２には、通話を翻訳するうえで、会話のリズムを乱さないように単語を区切って逐次翻訳する技術が記載されている。
特開２００４−０９４７２１号公報特開２００４−３５０２９８号公報 A cellular phone having the above translation function is described in, for example, Patent Documents 1 and 2 described later. The mobile phone described in Patent Document 1 translates and outputs voice input from a microphone and voice from a call partner into different languages. Patent Document 2 describes a technique for translating words by separating words so as not to disturb the rhythm of conversation when translating a call.
JP 2004-094721 A JP 2004-350298 A

上記各特許文献に記載の技術において、携帯電話機から送信されるデータは、話者の発声についての音声認識結果が翻訳されたものである。しかしながら、発声の内容や話者の声質によっては、音声認識に誤りが生じる可能性がある。音声認識の結果に誤りが含まれる場合、そのデータが翻訳されても、本来の発言内容とは異なる内容のデータが送られるという問題が生じる。 In the technology described in each of the above patent documents, data transmitted from a mobile phone is obtained by translating a speech recognition result for a speaker's utterance. However, depending on the content of the utterance and the voice quality of the speaker, an error may occur in speech recognition. When an error is included in the result of speech recognition, there arises a problem that even if the data is translated, data having a content different from the original content of the message is transmitted.

本発明は、上記課題に鑑みてなされたものであり、その目的は、通話の翻訳機能を具備する電話装置において翻訳対象となるデータの精度を高める技術を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique for improving the accuracy of data to be translated in a telephone device having a call translation function.

本発明に係る電話装置は、話者の口表情を撮影する撮影部と、前記話者の発声の音声データを文字データに変換する音声認識部と、前記口表情の画像データから該口表情に関する特徴情報を検出する画像解析部と、複数の口表情に関する特徴パターンと各特徴パターンに対応する文字データとの関連を記憶するデータベースと、前記画像解析部が検出した特徴情報に対応する文字データの候補を前記データベースから検索する検索部と、前記検索部の検索結果に応じて前記音声認識部からの文字データを補正する補正部と、前記データ補正部からの文字データを対向話者の言語の文字データに翻訳する翻訳部と、前記翻訳部からの文字データを用いて回線へ送出すべき送信データを生成する送信部とを備える。 A telephone device according to the present invention relates to an imaging unit that captures a mouth expression of a speaker, a voice recognition unit that converts voice data of the speaker's utterance into character data, and the mouth expression from the image data of the mouth expression. An image analysis unit that detects feature information, a database that stores a relationship between a feature pattern related to a plurality of mouth expressions and character data corresponding to each feature pattern, and character data corresponding to feature information detected by the image analysis unit A search unit that searches the database for candidates, a correction unit that corrects character data from the speech recognition unit according to a search result of the search unit, and character data from the data correction unit in the language of the opposite speaker A translation unit that translates character data and a transmission unit that generates transmission data to be sent to a line using the character data from the translation unit are provided.

本発明に係る通話翻訳方法は、電話装置を使用する話者の口表情を撮影し、前記話者の発声の音声データを文字データに変換し、前記口表情の画像データから該口表情に関する特徴情報を検出し、複数の口表情に関する特徴パターンと各特徴パターンに対応する文字データとの関連を記憶するデータベースから、前記検出した特徴情報に対応する文字データの候補を検索し、前記音声データから変換された文字データを前記検索の結果に応じて補正し、前記補正した文字データを対向話者の言語の文字データに翻訳し、前記翻訳した文字データを用いて回線へ送出すべき送信データを生成するという方法である。 The call translation method according to the present invention captures a mouth expression of a speaker who uses a telephone device, converts voice data of the speaker's utterance into character data, and relates to the mouth expression from the image data of the mouth expression. Information is detected, a candidate for character data corresponding to the detected feature information is searched from a database that stores a relationship between a feature pattern relating to a plurality of mouth expressions and character data corresponding to each feature pattern, and from the voice data The converted character data is corrected according to the search result, the corrected character data is translated into the character data of the language of the opposite speaker, and the transmission data to be sent to the line using the translated character data It is a method of generating.

本発明によれば、翻訳対象となる通話のデータをより正確なものとすることができる。これにより、話者の発声内容と異なる内容が対向話者に伝えられることが防止されることから、話者間の円滑な通話が可能となる。 According to the present invention, it is possible to make call data to be translated more accurate. As a result, it is possible to prevent the content different from the content of the speaker's utterance from being transmitted to the opposite speaker, thereby enabling a smooth call between the speakers.

図１に、本発明の実施形態の構成を示す。本実施形態は、本発明に係る電話装置を携帯電話機100に適用したものである。携帯電話機100は、カメラ101、マイク102、スピーカ103、操作部104、制御部105、無線部106、及び、アンテナ107を備える。操作部104には、画面を表示するＬＣＤ104ａと、操作のためのテンキー104b及びスイッチ104cとが設けられている。 FIG. 1 shows the configuration of an embodiment of the present invention. In the present embodiment, the telephone device according to the present invention is applied to a mobile phone 100. The mobile phone 100 includes a camera 101, a microphone 102, a speaker 103, an operation unit 104, a control unit 105, a radio unit 106, and an antenna 107. The operation unit 104 is provided with an LCD 104a for displaying a screen, a numeric keypad 104b and a switch 104c for operation.

カメラ101は、話者の口表情を撮影し、その口表情の画像データを出力する。マイク102は、話者の発声を捕捉し、その発声の音声データを入力する。スピーカ103は、対向話者の音声を出力する。無線部106は、アンテナ107を用いて通話の無線信号を送受信する。制御部105は、携帯電話機100の全体的な動作を制御する。 Camera 101 captures the mouth expression of the speaker and outputs image data of the mouth expression. The microphone 102 captures a speaker's utterance and inputs voice data of the utterance. The speaker 103 outputs the voice of the opposite speaker. The radio unit 106 transmits and receives radio signals for calls using the antenna 107. The control unit 105 controls the overall operation of the mobile phone 100.

図２に、制御部105の機能的な構成を示す。音声認識部201は、マイク102から入力された音声データを音声認識処理により文字データに変換する。画像解析部205は、話者の口表情に関しカメラ101から入力された画像データを解析し、その口表情の特徴情報を検出する。この画像解析部205に対しカメラ101が入力する画像は、連続撮影により話者の口の動きを捉えたものである。撮影範囲は、話者の口周辺に限らず、顔全体であってもよい。顔全体の場合、画像解析部205は、顔画像から口周辺の画像部分を判別し、その画像部分から口表情の特徴を検出する。口表情の特徴情報としては、連続撮影された口表情の相対位置および変化量を表す情報を用いる。 FIG. 2 shows a functional configuration of the control unit 105. The voice recognition unit 201 converts voice data input from the microphone 102 into character data by voice recognition processing. The image analysis unit 205 analyzes the image data input from the camera 101 regarding the mouth expression of the speaker, and detects feature information of the mouth expression. The image input by the camera 101 to the image analysis unit 205 captures the movement of the speaker's mouth by continuous shooting. The shooting range is not limited to the vicinity of the speaker's mouth, but may be the entire face. In the case of the entire face, the image analysis unit 205 discriminates the image portion around the mouth from the face image, and detects the feature of the mouth expression from the image portion. As the facial expression feature information, information indicating the relative position and the amount of change of the continuously taken mouth facial expression is used.

データベース（以下、「ＤＢ」と称する。）207には、任意の口表情に関する特徴パターンと、各特徴パターンが表す口表情に対応した文字データとの関連が登録されている。登録の際は、例えば、口表情の１種類の特徴に対し、文字データとしての一つの単語あるいは慣用句を関連付けて登録する。登録の方法としては、外部の端末から携帯電話機100へ転送する、あるいは、インターネットのサイトからダウンロードする等、任意の方法であってよい。 A database (hereinafter referred to as “DB”) 207 registers the association between a feature pattern related to an arbitrary mouth expression and character data corresponding to the mouth expression represented by each feature pattern. At the time of registration, for example, one word or common phrase as character data is registered in association with one kind of feature of the mouth expression. The registration method may be any method such as transferring from an external terminal to the mobile phone 100 or downloading from an Internet site.

ＤＢ検索部206は、画像解析部205から供給された口表情の特徴情報をキーとしてＤＢ207の文字データを検索する。すなわち、画像解析部205からの特徴情報に最も近似する特徴パターンをＤＢ207から検索し、該当の特徴パターンとペアで登録されている文字データを検出する。 The DB search unit 206 searches the character data in the DB 207 using the mouth facial feature information supplied from the image analysis unit 205 as a key. That is, a feature pattern that most closely approximates the feature information from the image analysis unit 205 is searched from the DB 207, and character data registered as a pair with the corresponding feature pattern is detected.

データ補正部202は、音声認識部201にて音声から変換された文字データを、ＤＢ検索部206の検索結果に応じて補正する。補正において、データ補正部202は、音声認識部201からの文字データと、ＤＢ検索部206からの文字データ候補、すなわち口表情の画像に基づく文字データとを比較する。そして、両者に差異がある場合、音声認識部201からの文字データを画像に基づく文字データに置き換える。 The data correction unit 202 corrects the character data converted from the voice by the voice recognition unit 201 according to the search result of the DB search unit 206. In the correction, the data correction unit 202 compares the character data from the speech recognition unit 201 with the character data candidates from the DB search unit 206, that is, character data based on the mouth expression image. If there is a difference between the two, the character data from the speech recognition unit 201 is replaced with character data based on an image.

翻訳部203は、データ補正部202から供給される文字データを対向話者の言語の文字データに変換する。例えば、携帯電話機100の話者の言語が英語であり、対向話者の言語が日本語である場合、データ補正部202からの英語の文字データが日本語に翻訳される。携帯電話機100のユーザは、翻訳前後の言語を通話開始時などに予め設定しておく。また、翻訳処理および翻訳辞書の取得については、本発明の技術分野において知られている任意の技術を適用することができる。 The translation unit 203 converts the character data supplied from the data correction unit 202 into character data in the language of the opposite speaker. For example, when the language of the speaker of the mobile phone 100 is English and the language of the opposite speaker is Japanese, English character data from the data correction unit 202 is translated into Japanese. The user of the mobile phone 100 presets the language before and after translation at the start of a call or the like. In addition, any technique known in the technical field of the present invention can be applied to the translation processing and the translation dictionary acquisition.

送信部204は、翻訳部203で翻訳された文字データから、無線部106へ供給すべき送信データを準備する。送信データとしては、翻訳された文字データを音声データに変換したもの、あるいは、翻訳された文字データそのものとすることができる。いずれのデータ形式を採るかは、ユーザの事前設定による。 The transmission unit 204 prepares transmission data to be supplied to the radio unit 106 from the character data translated by the translation unit 203. As the transmission data, the translated character data can be converted into voice data, or the translated character data itself can be used. Which data format is used depends on the user's prior setting.

図３に示すフローチャートと、図１及び図２とを参照して、本実施形態の動作を説明する。携帯電話機100においてユーザが通話を開始すると、マイク102が話者の音声データを入力すると共に、カメラ101が口表情の画像データを入力する（ステップS1）。このとき、後段のデータ補正部202での処理のために、例えば、対応し合う音声データ及び画像データのそれぞれに、両者の同期をとるための情報を付加することが望ましい。同期情報は、例えば、共通の時刻情報とすることができる。 The operation of the present embodiment will be described with reference to the flowchart shown in FIG. 3 and FIGS. 1 and 2. When the user starts a call on the mobile phone 100, the microphone 102 inputs the voice data of the speaker, and the camera 101 inputs the image data of the mouth expression (step S1). At this time, for the processing in the data correction unit 202 at the subsequent stage, for example, it is desirable to add information for synchronizing both to the corresponding audio data and image data. The synchronization information can be, for example, common time information.

音声認識部201は、マイク102からの音声データを文字データに変換する（ステップS2）。画像解析部205は、カメラ101により連続撮影された画像データを認識すると、それらから口表情の特徴情報を検出する（ステップS3）。ＤＢ検索部206は、画像解析部205からの特徴情報を検索キーとし、これに最も近似する特徴パターンをＤＢ207にて検索する。そして、該当の特徴パターンとペアの文字データ候補を検出する（ステップS4）。 The voice recognition unit 201 converts voice data from the microphone 102 into character data (step S2). When recognizing image data continuously photographed by the camera 101, the image analysis unit 205 detects feature information of the mouth expression from them (step S3). The DB search unit 206 uses the feature information from the image analysis unit 205 as a search key, and searches the DB 207 for a feature pattern that is most similar to the search key. Then, a character data candidate paired with the corresponding feature pattern is detected (step S4).

ＤＢ207から文字データ候補が得られた場合（ステップS5：Yes）、データ補正部202は、その候補と、同期情報が対応する音声認識部201からの文字データとを比較する（ステップS6）。そして、比較の結果、両者が一致しない場合（ステップS7：No）、データ補正部202は、音声認識部201からの文字データをＤＢ検索部206からの文字データ候補により補正する（ステップS8）。すなわち、音声に基づく文字データを、画像に基づく文字データにより置換する。 When the character data candidate is obtained from the DB 207 (step S5: Yes), the data correction unit 202 compares the candidate with the character data from the voice recognition unit 201 corresponding to the synchronization information (step S6). If the result of the comparison is that they do not match (step S7: No), the data correction unit 202 corrects the character data from the speech recognition unit 201 with the character data candidates from the DB search unit 206 (step S8). That is, the character data based on the voice is replaced with the character data based on the image.

また、ＤＢ207から候補が検出されなかった場合（ステップS5：No）、あるいは、音声認識部201からの文字データがＤＢ207の検索候補と一致する場合（ステップS7：Yes）、データ補正部202は、音声認識部201からの文字データを、補正することなく翻訳部203へ供給する（ステップS9）。 When no candidate is detected from the DB 207 (step S5: No), or when the character data from the speech recognition unit 201 matches the search candidate in the DB 207 (step S7: Yes), the data correction unit 202 The character data from the speech recognition unit 201 is supplied to the translation unit 203 without correction (step S9).

翻訳部203は、データ補正部202を経た文字データを対向話者の言語の文字データに翻訳する（ステップS10）。送信部204は、対向話者への送信データに関する事前設定を確認し、音声を送信する場合は（ステップS11：音声）、翻訳部203からの文字データを音声データに変換し、それを無線部106へ供給する。また、対向話者へ文字を送信する場合は（ステップS11：文字）、翻訳部203からの文字データをそのまま無線部106へ供給する。 The translation unit 203 translates the character data that has passed through the data correction unit 202 into character data in the language of the opposite speaker (step S10). The transmitting unit 204 confirms the preset setting related to the transmission data to the opposite speaker, and when transmitting the voice (step S11: voice), converts the character data from the translation unit 203 into the voice data and converts it to the radio unit. Supply to 106. When transmitting characters to the opposite speaker (step S11: characters), the character data from the translation unit 203 is supplied to the radio unit 106 as it is.

無線部106は、送信データとしての音声データ又は文字データをアンテナ107により送出する（ステップS13）。これにより、携帯電話機100における話者の発声内容が、必要に応じて補正され且つ翻訳された状態で対向話者へ伝えられる。 The wireless unit 106 transmits voice data or character data as transmission data through the antenna 107 (step S13). Thereby, the utterance content of the speaker in the mobile phone 100 is transmitted to the opposite speaker in a corrected and translated state as necessary.

図４を参照して、上記実施形態の具体例を説明する。本例では、携帯電話機100の話者の言語が英語であり、対向話者の言語が日本語であるケースを想定する。いま、話者の発声内容（英語）について、音声に基づく文字データ列D1として“It’s ten fifty.”が得られた一方で、画像に基づく文字データ列D2として“It’s ten fifteen.”が得られたとする。 A specific example of the above embodiment will be described with reference to FIG. In this example, a case is assumed in which the language of the speaker of the mobile phone 100 is English and the language of the opposite speaker is Japanese. Now, “It's ten fifty.” Is obtained as a character data string D1 based on speech, while “It's ten fifteen.” Is obtained as a character data string D2 based on an image. Suppose.

上記ケースにおいて、データ補正部202は、D1中の文字データ“fifty”（d1）と、D2中の文字データ“fifteen”（d2）とが一致しないことを認識し、音声に基づく文字データd1を、画像に基づく文字データd2により置換する。かかる補正により、文字データ列D3“It’s ten fifteen.”が得られる。翻訳部203は、この文字データ列D3を英語から日本語に翻訳することにより、文字データD4「１０時１５分です。」を送信部204へ供給する。 In the above case, the data correction unit 202 recognizes that the character data “fifty” (d1) in D1 and the character data “fifteen” (d2) in D2 do not match, and converts the character data d1 based on speech. The character data d2 based on the image is replaced. With this correction, the character data string D3 “It ’s ten fifteen.” Is obtained. The translation unit 203 supplies the character data D4 “10:15” to the transmission unit 204 by translating the character data string D3 from English to Japanese.

本実施形態によれば、翻訳部203へ供給する文字データがデータ補正部202により補正されることから、翻訳対象となる文字データをより正確なものとすることができる。これにより、話者の発声内容と異なる内容が対向話者に伝えられることが防止され、話者間の円滑な通話が可能となる。 According to the present embodiment, the character data supplied to the translation unit 203 is corrected by the data correction unit 202, so that the character data to be translated can be made more accurate. As a result, content different from the content of the speaker's utterance is prevented from being transmitted to the opposite speaker, and smooth communication between the speakers becomes possible.

（他の実施形態）
上記実施形態では、ＤＢ207を検索する際、検出する候補を１つとしたが、検索キーに対し所定の範囲内で近似する複数の候補を検出するようにしてもよい。この場合の動作を図５のフローチャートにより説明する。ここでは、主に、前述の実施形態における動作（図３）との差異を述べる。 (Other embodiments)
In the above embodiment, when searching the DB 207, one candidate is detected. However, a plurality of candidates that approximate the search key within a predetermined range may be detected. The operation in this case will be described with reference to the flowchart of FIG. Here, the difference from the operation (FIG. 3) in the above-described embodiment will be mainly described.

前述の実施形態と同様に、ＤＢ検索部206が口表情の特徴情報についてＤＢ207を検索する（ステップS4）。その結果、検索候補が無かった場合は（ステップS21：No）、音声認識部201からの文字データは補正されない（図３：ステップS9）。また、ＤＢ207の検索候補が１つ検出された場合は（ステップS21：Yes、ステップS22：No）、前述の実施形態と同様に、その候補と音声認識部201からの文字データとを比較するステップに移行する（図３：ステップS6）。 Similar to the above-described embodiment, the DB search unit 206 searches the DB 207 for the feature information of the mouth expression (step S4). As a result, when there is no search candidate (step S21: No), the character data from the speech recognition unit 201 is not corrected (FIG. 3: step S9). When one search candidate in the DB 207 is detected (Step S21: Yes, Step S22: No), the candidate is compared with the character data from the speech recognition unit 201 as in the above-described embodiment. (FIG. 3: Step S6).

一方、ＤＢ207の検索候補が複数検出された場合、ＤＢ検索部206は、各候補について、検索キーに対する確度を記録する（ステップS23）。この確度は、検索キーに対する、すなわち画像解析部205からの特徴情報に対する、近似の度合いを示すものである。確度の数値は、口表情に関する相対位置や変化量といった特徴量を、検索キーと特徴パターンとの間で対比することにより求めることができる。 On the other hand, when a plurality of search candidates in the DB 207 are detected, the DB search unit 206 records the accuracy with respect to the search key for each candidate (step S23). This accuracy indicates the degree of approximation for the search key, that is, for the feature information from the image analysis unit 205. The numerical value of the accuracy can be obtained by comparing a feature amount such as a relative position or a change amount with respect to the mouth expression between the search key and the feature pattern.

データ補正部202は、複数の検索候補のそれぞれを音声認識部201からの文字データと比較する。その結果、音声認識データと一致する候補がある場合（ステップS24：Yes）、データ補正部202は、音声認識データをそのまま翻訳すべく翻訳部203へ供給する（図３：ステップS9）。 The data correction unit 202 compares each of the plurality of search candidates with the character data from the voice recognition unit 201. As a result, if there is a candidate that matches the voice recognition data (step S24: Yes), the data correction unit 202 supplies the voice recognition data to the translation unit 203 for translation as it is (FIG. 3: step S9).

また、複数の検索候補のいずれも音声認識データと一致しない場合（ステップS24：No）、データ補正部202は、先に記録された各候補の確度を参照し、最も高い確度に対応する候補を選択する。そして、選択した候補により、音声認識データを補正し（ステップS25）、補正後の文字データを翻訳部203へ供給する（図３：ステップS10）。 If none of the plurality of search candidates matches the voice recognition data (step S24: No), the data correction unit 202 refers to the accuracy of each candidate recorded earlier, and selects the candidate corresponding to the highest accuracy. select. Then, the voice recognition data is corrected by the selected candidate (step S25), and the corrected character data is supplied to the translation unit 203 (FIG. 3: step S10).

図６に、上記動作に関する具体例を示す。本例では、ＤＢ207から“fifteen”、“fifty”、“fifteenth”の３つの候補が検出され、それぞれの確度が「９０％」、「７０％」、「４０％」であったとする。データ補正部202は、これらの候補のうちの何れも音声認識データと一致しない場合、最も高い確度「９０％」に対応する“fifteen”を補正用として選択する。 FIG. 6 shows a specific example related to the above operation. In this example, it is assumed that three candidates “fifteen”, “fifty”, and “fifteenth” are detected from the DB 207, and the respective probabilities are “90%”, “70%”, and “40%”. If none of these candidates matches the speech recognition data, the data correction unit 202 selects “fifteen” corresponding to the highest accuracy “90%” for correction.

一方、上記候補のうちの“fifty”が音声認識データと一致した場合、音声認識結果（“fifty”）は補正されることなく翻訳される。候補としての“fifty”は、画像的には上記“fifteen”よりも確度が低い候補ではあるが、音声認識結果と一致することが優先される。 On the other hand, when “fifty” of the candidates matches the speech recognition data, the speech recognition result (“fifty”) is translated without correction. Although “fifty” as a candidate is a candidate having a lower accuracy than the above “fifteen” in terms of image, priority is given to matching with the speech recognition result.

このように、図５の実施形態によれば、画像に基づく複数の文字データ候補が得られ、且つ、何れかの候補が音声認識データに一致する場合は、その音声認識データの内容が優先される。また、音声認識データに一致する候補がない場合は、最も高い確度の候補により音声認識データが補正される。かかる動作により、翻訳対象となる文字データに、音声認識および画像解析を適応的に反映させることができる。その結果、翻訳対象となる文字データの精度が一層高められる。 As described above, according to the embodiment of FIG. 5, when a plurality of character data candidates based on the image are obtained and any one of the candidates matches the voice recognition data, the content of the voice recognition data is given priority. The When there is no candidate that matches the voice recognition data, the voice recognition data is corrected with the candidate having the highest accuracy. With this operation, speech recognition and image analysis can be adaptively reflected in character data to be translated. As a result, the accuracy of character data to be translated is further improved.

なお、複数の候補のうち、ある候補が音声認識データと一致しても、その候補の確度が規定値を下回る場合は、最も高い確度の候補により音声認識結果を補正するようにしてもよい。これにより、より正確な内容を翻訳対象とすることができる。 Note that even if a certain candidate among the plurality of candidates matches the voice recognition data, if the accuracy of the candidate falls below a specified value, the voice recognition result may be corrected with the candidate having the highest accuracy. As a result, more accurate contents can be targeted for translation.

本発明は、上記実施形態のような無線機能を具備する携帯電話機（100）に限らず、有線通信の固定電話機にも適用可能である。また、一般の通話機能に加え、話者間で互いの顔画像を閲覧しながら通話を行うテレビ電話機能を備える電話装置であってもよい。テレビ電話の場合、撮影した顔画像を利用して、話者の口表情を検出することができる。 The present invention is not limited to the mobile phone (100) having the wireless function as in the above embodiment, but can also be applied to a fixed telephone for wired communication. Further, in addition to a general call function, a telephone device having a videophone function for making a call while browsing each other's face images between speakers may be used. In the case of a videophone, the mouth expression of the speaker can be detected using the captured face image.

本発明の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of embodiment of this invention. 本発明の実施形態における制御部の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the control part in embodiment of this invention. 本発明の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of embodiment of this invention. 本発明の実施形態の具体例に関する説明図である。It is explanatory drawing regarding the specific example of embodiment of this invention. 本発明の他の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of other embodiment of this invention. 本発明の他の実施形態の具体例に関する説明図である。It is explanatory drawing regarding the specific example of other embodiment of this invention.

Explanation of symbols

100：携帯電話機、101：カメラ、102：マイク、103：スピーカ、104：操作部、105：制御部、106：無線部、107：アンテナ
201：音声認識部、202：データ補正部、203：翻訳部、204：送信部、205：画像解析部、206：ＤＢ検索部、207：ＤＢ 100: mobile phone, 101: camera, 102: microphone, 103: speaker, 104: operation unit, 105: control unit, 106: wireless unit, 107: antenna
201: Voice recognition unit, 202: Data correction unit, 203: Translation unit, 204: Transmission unit, 205: Image analysis unit, 206: DB search unit, 207: DB

Claims

A filming unit for photographing the mouth expression of the speaker;
A voice recognition unit for converting voice data of the utterance of the speaker into character data;
An image analysis unit for detecting feature information about the mouth expression from the image data of the mouth expression;
A database for storing the relationship between feature patterns related to a plurality of mouth expressions and character data corresponding to each feature pattern;
A search unit that searches the database for character data candidates corresponding to the feature information detected by the image analysis unit;
A correction unit that corrects character data from the voice recognition unit in accordance with a search result of the search unit;
A translation unit that translates the character data from the data correction unit into the character data of the language of the opposite speaker;
A telephone apparatus comprising: a transmission unit that generates transmission data to be transmitted to a line using character data from the translation unit.

The telephone device according to claim 1, wherein the correction unit replaces the character data with the candidate when the candidate from the search unit and the character data from the voice recognition unit do not match.

When a plurality of candidates are detected from the database, the search unit records the accuracy of each candidate,
If any of the plurality of candidates does not match the character data from the voice recognition unit, the correction unit replaces the character data with a candidate with the highest probability of the plurality of candidates, and any one of the plurality of candidates 3. The telephone device according to claim 1, wherein when the character data matches the character data from the voice recognition unit, the character data is supplied to the translation unit.

4. The telephone device according to claim 1, wherein the transmission unit generates voice data as the transmission data from character data from the translation unit. 5.

The telephone device according to any one of claims 1 to 3, wherein the transmission unit outputs character data from the translation unit as the transmission data.

6. The telephone device according to claim 1, further comprising a wireless unit that transmits the transmission data to a line by wireless communication.

Take a picture of the mouth of the speaker using the phone
Converting voice data of the utterance of the speaker into character data;
Detecting feature information related to the mouth expression from the image data of the mouth expression;
Search for candidate character data corresponding to the detected feature information from a database that stores the association between the feature patterns related to a plurality of mouth expressions and the character data corresponding to each feature pattern,
Correct the character data converted from the voice data according to the search result,
Translating the corrected character data into the character data of the language of the opposite speaker;
A call translation method characterized by generating transmission data to be sent to a line using the translated character data.

In correction of character data converted from the voice data,
8. The call translation method according to claim 7, wherein if the candidate as a result of the search and the character data converted from the voice data do not match, the character data is replaced with the candidate.

When a plurality of candidates are detected from the database as a result of the search, the accuracy of each candidate is recorded,
In the correction of the character data converted from the speech data, if none of the plurality of candidates matches the character data converted from the speech data, the character data is replaced with the candidate with the highest probability of the plurality of candidates. The call translation according to claim 7 or 8, wherein if any of the plurality of candidates matches the character data converted from the voice data, the character data is translated as the corrected character data. Method.

10. The call translation method according to claim 7, wherein voice data is generated from the translated character data as the transmission data.

The call translation method according to claim 7, wherein the translated character data is output as the transmission data.

12. The call translation method according to claim 7, wherein the transmission data is transmitted to a line by wireless communication.