JP2015015623A

JP2015015623A - Television telephone set and program

Info

Publication number: JP2015015623A
Application number: JP2013141511A
Authority: JP
Inventors: 智之土谷; Tomoyuki Tsuchiya
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2013-07-05
Filing date: 2013-07-05
Publication date: 2015-01-22

Abstract

PROBLEM TO BE SOLVED: To provide a television telephone set capable of allowing smooth communication without letting a called person misunderstand by translating a gesture that a calling person takes.SOLUTION: A television telephone set (1) includes: gesture dictionaries (31A-31D) in which reference feature data representing features of a gesture and translation information representing a translation of the gesture are made to correspond to each other; a gesture analysis part (33) which generates photography-time feature data representing features of a gesture of a subject on the basis of an image of the subject, and selects translation information corresponding to photography-time feature data as a translation result from translation information of the gesture dictionary (31B); and a translation result composition part (34) which generates a translation result composite image by putting the translation result together with the image of the subject.

Description

本発明は、映像及び音声を用いて通話を行なうテレビ電話機、及び、プログラムに関する。 The present invention relates to a video phone for making a call using video and audio, and a program.

テレビ電話機を使用して発話者と受話者とで通話を行なう場合、発話者のテレビ電話機は、カメラにより撮影された発話者の画像とマイクにより取り込まれた発話者の音声とを含む通話データを、電話網を介して、受話者のテレビ電話機に送信する。これにより、受話者は、発話者の音声を聞きながら、発話者の画像を見ることができる。 When a videophone is used to make a call between a speaker and a receiver, the videophone of the speaker has call data including the image of the speaker captured by the camera and the voice of the speaker captured by the microphone. , To the receiver's video phone via the telephone network. As a result, the listener can view the image of the speaker while listening to the voice of the speaker.

発話者と受話者との言語が異なる場合、受話者は、発話者の音声を瞬時に翻訳して、テレビ電話機にて会話を行なうことになる。ところが、受話者が発話者の言語に対する知識に乏しい場合や、発話者の音声を瞬時に翻訳できない場合では、会話をスムーズに行なうことができない。この問題を解決するために、特許文献１には、発話者の音声を翻訳する技術が記載されている。 When the language of the speaker is different from that of the receiver, the receiver instantly translates the speaker's voice and has a conversation on the video phone. However, if the speaker has little knowledge of the speaker's language or if the speaker's voice cannot be translated instantaneously, the conversation cannot be performed smoothly. In order to solve this problem, Patent Document 1 describes a technique for translating a speaker's voice.

特許文献１に記載された技術では、発話者のテレビ電話機と受話者のテレビ電話機との間に音声翻訳部（図示しない）を設けている。音声翻訳部は、発話者のテレビ電話機から送信された通話データに含まれる発話者の音声（例えば英語の音声）をテキストデータ（英語表記のテキストデータ）に変換する音声テキスト変換部（図示しない）と、そのテキストデータに表記された言語を受話者の指定言語（例えば日本語）に翻訳するテキスト言語翻訳部（図示しない）と、翻訳されたテキストデータを字幕として受話者のテレビ電話機の表示部の端（例えば下端）に表示されるように、通話データに含まれる発話者の画像に合成して、受話者のテレビ電話機に伝送する画像テキスト合成部（図示しない）と、を具備している。これにより、受話者は、発話者の音声（英語）を聞きながら、発話者の画像と共に、翻訳された字幕を見ることができる。 In the technique described in Patent Document 1, a speech translation unit (not shown) is provided between a speaker's video phone and a receiver's video phone. The voice translation unit (not shown) converts a speaker's voice (for example, English voice) included in the call data transmitted from the speaker's videophone into text data (English notation text data). A text language translation unit (not shown) that translates the language described in the text data into the language specified by the listener (for example, Japanese), and the display unit of the receiver's videophone using the translated text data as subtitles An image text synthesizing unit (not shown) that synthesizes the image of the speaker included in the call data and transmits it to the receiver's video phone so that the image is displayed at the end (for example, the lower end). . Thus, the listener can see the translated subtitles along with the image of the speaker while listening to the voice (English) of the speaker.

特開平５−２６０１９３号公報JP-A-5-260193

しかしながら、特許文献１に記載された技術では、通話データに含まれる発話者の音声は翻訳されて、字幕として受話者のテレビ電話機に伝送されるが、通話データに含まれる発話者の画像は、そのまま受話者のテレビ電話機に伝送される。このため、相手（受話者）の国籍・文化・風習により、送信元の被写体（発話者）がとるジェスチャが、その相手には不適切なジェスチャであると受け取られ、コミュニケーションに齟齬をきたす可能性がある。例えば、日本では、相手に軽く挨拶をする際に「手のひらを見せるように手を上げる」動作を行なうことがあるが、このような動作は、欧州では「侮辱行為」として認識されることがある。 However, in the technique described in Patent Document 1, the voice of the speaker included in the call data is translated and transmitted to the receiver's video phone as subtitles, but the image of the speaker included in the call data is It is transmitted to the receiver's video phone as it is. For this reason, depending on the nationality, culture, and customs of the other party (listener), the gesture taken by the subject (speaker) of the sender may be perceived as an inappropriate gesture for the other party, and the communication may be frustrated There is. For example, in Japan, when you greet the other person lightly, you may perform a “raise your hand to show your palm” action, but in Europe this action may be recognized as an “insulting act”. .

本発明は、上記の点に鑑みてなされたものであり、送信元の被写体（発話者）がとるジェスチャを翻訳することにより、相手（受話者）に対して誤解を生むことなく、円滑なコミュニケーションを図ることができるテレビ電話機、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above points, and by translating a gesture taken by a subject (speaker) of a transmission source, smooth communication without causing misunderstanding to the other party (listener). An object of the present invention is to provide a videophone and a program capable of achieving the above.

本発明のテレビ電話機は、ジェスチャの特徴を示す参照用特徴データとそのジェスチャの翻訳を示す翻訳情報とを対応付けるジェスチャ辞書と、被写体の画像に基づいて前記被写体のジェスチャの特徴を示す撮影時特徴データを生成し、前記ジェスチャ辞書の翻訳情報の中から、前記撮影時特徴データに対応する翻訳情報を翻訳結果として選択するジェスチャ解析部と、前記被写体の画像に前記翻訳結果を合成して翻訳結果合成画像を生成する翻訳結果合成部と、を具備することを特徴とする。 The video phone according to the present invention includes a gesture dictionary that associates reference feature data indicating a feature of a gesture and translation information indicating translation of the gesture, and feature data at the time of shooting that indicates the feature of the subject based on the image of the subject. A gesture analysis unit that selects, as translation results, translation information corresponding to the shooting feature data from translation information in the gesture dictionary, and synthesizes the translation results with the subject image. A translation result synthesis unit for generating an image.

本発明によれば、送信元の被写体（発話者）がとるジェスチャを翻訳することにより、相手（受話者）に対して誤解を生むことなく、円滑なコミュニケーションを図ることができる。 ADVANTAGE OF THE INVENTION According to this invention, smooth communication can be aimed at without producing misunderstanding with respect to an other party (listener) by translating the gesture which the to-be-photographed object (speaker) takes.

本発明の第１実施形態に係るテレビ電話機１の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the video telephone 1 which concerns on 1st Embodiment of this invention. 図２は、図１の構成を用いて、本発明の第１実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。FIG. 2 is a flowchart for explaining the operation (videophone processing) of the videophone 1 according to the first embodiment of the present invention, using the configuration of FIG. 図３は、本発明の第２実施形態に係るテレビ電話機１の構成を示す概略ブロック図である。FIG. 3 is a schematic block diagram showing the configuration of the videophone 1 according to the second embodiment of the present invention. 図４は、図３の構成を用いて、本発明の第２実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。FIG. 4 is a flowchart for explaining the operation (videophone processing) of the videophone 1 according to the second embodiment of the present invention, using the configuration of FIG. 図５は、本発明の第３実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の構成を示す概略ブロック図である。FIG. 5 is a schematic block diagram showing the configuration of the gesture translation unit 30 in the videophone 1 according to the third embodiment of the present invention. 図６は、図５の判別用記憶部３６の例を示している。FIG. 6 shows an example of the determination storage unit 36 of FIG. 図５及び図６の構成を用いて、本発明の第３実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement (video telephone process) of the video telephone 1 which concerns on 3rd Embodiment of this invention using the structure of FIG.5 and FIG.6. 本発明の第４実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の辞書記憶部３１内のジェスチャ辞書の例を示している。The example of the gesture dictionary in the dictionary memory | storage part 31 of the gesture translation part 30 in the video telephone 1 which concerns on 4th Embodiment of this invention is shown. 本発明の第４実施形態に係るテレビ電話機１の動作を説明するための図である。It is a figure for demonstrating operation | movement of the video telephone 1 which concerns on 4th Embodiment of this invention. 本発明の第５実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の辞書記憶部３１内のジェスチャ辞書の例を示している。The example of the gesture dictionary in the dictionary memory | storage part 31 of the gesture translation part 30 in the video telephone 1 which concerns on 5th Embodiment of this invention is shown. 本発明の第５実施形態に係るテレビ電話機１の動作を説明するための図である。It is a figure for demonstrating operation | movement of the video telephone 1 which concerns on 5th Embodiment of this invention. 本発明の第６実施形態に係るテレビ電話機１の動作を説明するための図である。It is a figure for demonstrating operation | movement of the video telephone 1 which concerns on 6th Embodiment of this invention.

以下、図面を参照しながら本発明の実施形態について説明する。本実施形態に係るテレビ電話機は、携帯電話機、スマートフォンなどに適用される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The video phone according to the present embodiment is applied to a mobile phone, a smartphone, and the like.

［第１実施形態］
図１は、本発明の第１実施形態に係るテレビ電話機１の構成を示す概略ブロック図である。テレビ電話機１は、カメラ１１と、送信部１２と、マイク１３と、受信部２１と、表示部２２と、スピーカ２３と、制御部（図示しない）と、記憶装置（図示しない）を具備している。 [First Embodiment]
FIG. 1 is a schematic block diagram showing the configuration of the videophone 1 according to the first embodiment of the present invention. The video phone 1 includes a camera 11, a transmission unit 12, a microphone 13, a reception unit 21, a display unit 22, a speaker 23, a control unit (not shown), and a storage device (not shown). Yes.

制御部は、カメラ１１、送信部１２、マイク１３、受信部２１、表示部２２、スピーカ２３に対して制御を行う。制御部は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等から構成されている。記憶装置には、コンピュータが実行可能なコンピュータプログラムが格納され、ＣＰＵは、そのコンピュータプログラムを読み出して実行する。表示部２２としては、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）が例示される。 The control unit controls the camera 11, the transmission unit 12, the microphone 13, the reception unit 21, the display unit 22, and the speaker 23. The control unit includes, for example, a CPU (Central Processing Unit). The storage device stores a computer program executable by the computer, and the CPU reads and executes the computer program. As the display unit 22, an LCD (Liquid Crystal Display) is exemplified.

テレビ電話機１は、発話者と受話者に使用される。ここで、発話者が使用するテレビ電話機１を発話者のテレビ電話機１Ａと称し、受話者が使用するテレビ電話機１を受話者のテレビ電話機１Ｂと称する。 The video phone 1 is used for a speaker and a receiver. Here, the video phone 1 used by the speaker is referred to as the speaker's video phone 1A, and the video phone 1 used by the receiver is referred to as the receiver's video phone 1B.

図１では、説明の都合上、カメラ１１、送信部１２、マイク１３が発話者のテレビ電話機１Ａ内にのみ図示されている。カメラ１１は、被写体（発話者）を撮影する。マイク１３は、発話者の音声を取り込む。送信部１２は、カメラ１１により撮影された発話者の画像と、マイク１３により取り込まれた発話者の音声とを含む通話データを、電話網２を介して、受話者のテレビ電話機１Ｂに送信する。 In FIG. 1, for convenience of explanation, the camera 11, the transmission unit 12, and the microphone 13 are illustrated only in the video phone 1 A of the speaker. The camera 11 photographs a subject (speaker). The microphone 13 captures the voice of the speaker. The transmission unit 12 transmits call data including the image of the speaker captured by the camera 11 and the voice of the speaker captured by the microphone 13 to the receiver's video phone 1B via the telephone network 2. .

また、図１では、説明の都合上、受信部２１、表示部２２、スピーカ２３が受話者のテレビ電話機１Ｂ内にのみ図示されている。受信部２１は、発話者のテレビ電話機１Ａから送信された通話データを受信する。表示部２２は、受信部２１により受信された通話データに含まれる発話者の画像を表示する。スピーカ２３は、受信部２１により受信された通話データに含まれる発話者の音声を出力する。 In FIG. 1, for the sake of explanation, the receiving unit 21, the display unit 22, and the speaker 23 are shown only in the receiver's video phone 1 B. The receiving unit 21 receives call data transmitted from the videophone 1A of the speaker. The display unit 22 displays an image of the speaker included in the call data received by the receiving unit 21. The speaker 23 outputs the voice of the speaker included in the call data received by the receiving unit 21.

発話者のテレビ電話機１Ａと受話者のテレビ電話機１Ｂは、電話網２を介して接続されている。電話網２には、例えば、音声翻訳部４０が設けられている。音声翻訳部４０は、例えば、特許文献１に記載された技術と同じ構成であるものとし、本発明に関連する部分のみ説明する。 The videophone 1 A of the speaker and the videophone 1 B of the receiver are connected via the telephone network 2. The telephone network 2 is provided with a speech translation unit 40, for example. For example, the speech translation unit 40 is assumed to have the same configuration as the technique described in Patent Document 1, and only a portion related to the present invention will be described.

発話者のテレビ電話機１Ａは、更に、ジェスチャ翻訳部３０を具備している。そのジェスチャ翻訳部３０は、ソフトウェア（上述のコンピュータプログラム）により実現する。ジェスチャ翻訳部３０は、発話者のテレビ電話機１Ａ内の制御部により制御される。ジェスチャ翻訳部３０は、カメラ１１により撮影された画像内の被写体（発話者）のジェスチャを翻訳する。そのジェスチャ翻訳部３０は、辞書記憶部３１と、ジェスチャ辞書選択部３２と、ジェスチャ解析部３３と、翻訳結果合成部３４とを具備している。 The speaker's video phone 1 A further includes a gesture translation unit 30. The gesture translation unit 30 is realized by software (the above-described computer program). The gesture translation unit 30 is controlled by a control unit in the videophone 1A of the speaker. The gesture translation unit 30 translates the gesture of the subject (speaker) in the image taken by the camera 11. The gesture translation unit 30 includes a dictionary storage unit 31, a gesture dictionary selection unit 32, a gesture analysis unit 33, and a translation result synthesis unit 34.

辞書記憶部３１には、ジェスチャ辞書３１Ａ〜３１Ｄが言語別に登録されている。ジェスチャ辞書３１Ａ〜３１Ｄの各々には、利用者がジェスチャを行なうときの手の動きや顔の表情などを表す所作の特徴を示す参照用特徴データ（特徴ベクトル）と、そのジェスチャの翻訳（意味）を示す翻訳情報と、が複数種類対応付けられて登録されている。 In the dictionary storage unit 31, gesture dictionaries 31A to 31D are registered for each language. Each of the gesture dictionaries 31 A to 31 D includes reference feature data (feature vector) indicating features of an operation representing a hand movement or facial expression when a user performs a gesture, and translation (meaning) of the gesture. Are registered in association with a plurality of types.

ジェスチャ辞書選択部３２は、辞書記憶部３１に登録されたジェスチャ辞書３１Ａ〜３１Ｄの中から、発話者の使用言語に対応するジェスチャ辞書を選択ジェスチャ辞書として選択する。ジェスチャ辞書３１Ａ〜３１Ｄとしては日本語、英語（アメリカ）、英語（イギリス）、中国語などが例示されるが、これに限定されない。 The gesture dictionary selection unit 32 selects a gesture dictionary corresponding to the language used by the speaker as a selection gesture dictionary from the gesture dictionaries 31A to 31D registered in the dictionary storage unit 31. Examples of the gesture dictionaries 31A to 31D include Japanese, English (US), English (UK), and Chinese, but are not limited thereto.

ジェスチャ解析部３３は、発話者の画像に基づいて、発話者の所作の特徴を示す撮影時特徴データ（特徴ベクトル）を生成する。ジェスチャ解析部３３は、ジェスチャ辞書選択部３２により選択された選択ジェスチャ辞書を参照し、その選択ジェスチャ辞書の複数種類の翻訳情報の中から、撮影時特徴データに対応する翻訳情報を翻訳結果として選択する。 The gesture analysis unit 33 generates shooting feature data (feature vector) indicating the feature of the speaker's work based on the speaker's image. The gesture analysis unit 33 refers to the selection gesture dictionary selected by the gesture dictionary selection unit 32, and selects, as a translation result, translation information corresponding to the shooting feature data from a plurality of types of translation information of the selection gesture dictionary. To do.

翻訳結果合成部３４は、発話者の画像に翻訳結果を合成して、受話者のテレビ電話機１Ｂの表示部２２に表示するための翻訳結果合成画像を生成する。 The translation result synthesizing unit 34 synthesizes the translation result with the speaker's image and generates a translation result synthesized image to be displayed on the display unit 22 of the receiver's videophone 1B.

図２は、図１の構成を用いて、本発明の第１実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。 FIG. 2 is a flowchart for explaining the operation (videophone processing) of the videophone 1 according to the first embodiment of the present invention, using the configuration of FIG.

まず、発話者のテレビ電話機１Ａにおいて、ジェスチャ翻訳部３０のジェスチャ辞書選択部３２は、ジェスチャ辞書取得要求を行なう（Ｓ０１）。ここで、Ｓ０１は、例えば発話者の操作に応じて発話者のテレビ電話機１Ａが発呼するときに実行される。Ｓ０１において、ジェスチャ辞書選択部３２は、辞書記憶部３１に登録されたジェスチャ辞書３１Ａ〜３１Ｄの中から、発話者の使用言語に対応するジェスチャ辞書（例えばジェスチャ辞書３１Ｂ）を選択ジェスチャ辞書として選択する（Ｓ０２）。ジェスチャ辞書選択部３２は、ジェスチャ翻訳部３０のジェスチャ解析部３３に翻訳準備完了を通知する（Ｓ０３）。図示省略するが、ジェスチャ翻訳部３０のジェスチャ解析部３３及び翻訳結果合成部３４が同じ画像を取得できるように、ジェスチャ辞書選択部３２は、翻訳結果合成部３４に対しても翻訳準備完了を通知する。 First, in the videophone 1A of the speaker, the gesture dictionary selection unit 32 of the gesture translation unit 30 makes a gesture dictionary acquisition request (S01). Here, S01 is executed, for example, when the speaker's videophone 1A makes a call in response to the operation of the speaker. In S01, the gesture dictionary selection unit 32 selects, from the gesture dictionaries 31A to 31D registered in the dictionary storage unit 31, a gesture dictionary (for example, the gesture dictionary 31B) corresponding to the language used by the speaker as the selection gesture dictionary. (S02). The gesture dictionary selection unit 32 notifies the gesture analysis unit 33 of the gesture translation unit 30 that translation preparation is complete (S03). Although not shown, the gesture dictionary selection unit 32 notifies the translation result synthesis unit 34 of the completion of translation preparation so that the gesture analysis unit 33 and the translation result synthesis unit 34 of the gesture translation unit 30 can acquire the same image. To do.

次に、ジェスチャ解析部３３は、翻訳準備完了の通知に応じて、カメラ１１に対して発話者画像取得要求を行ない（Ｓ０４）、発話者の画像をカメラ１１から取得する（Ｓ０５）。ジェスチャ解析部３３は、カメラ１１からの発話者の画像に基づいて発話者の所作の特徴を示す撮影時特徴データを生成し、Ｓ０２により選択された選択ジェスチャ辞書（ジェスチャ辞書３１Ｂ）を参照する。次に、ジェスチャ解析部３３は、選択ジェスチャ辞書の複数種類の参照用特徴データの中から、撮影時特徴データに一致する参照用特徴データを選択参照用特徴データとして選択し、選択ジェスチャ辞書の複数種類の翻訳情報のうちの、選択参照用特徴データに対応する翻訳情報を、翻訳結果として、翻訳結果合成部３４に出力する（Ｓ０６）。 Next, in response to the notification of translation preparation completion, the gesture analysis unit 33 makes a speaker image acquisition request to the camera 11 (S04), and acquires a speaker image from the camera 11 (S05). The gesture analysis unit 33 generates shooting feature data indicating the feature of the speaker's work based on the speaker's image from the camera 11, and refers to the selected gesture dictionary (gesture dictionary 31B) selected in S02. Next, the gesture analysis unit 33 selects, as selection reference feature data, reference feature data that matches the shooting feature data from among a plurality of types of reference feature data in the selection gesture dictionary, and the plurality of selection gesture dictionaries. Of the types of translation information, the translation information corresponding to the selected reference feature data is output as a translation result to the translation result synthesis unit 34 (S06).

翻訳結果合成部３４は、翻訳準備完了の通知に応じて、カメラ１１に対して発話者画像取得要求を行ない（Ｓ０７）、発話者の画像をカメラ１１から取得する（Ｓ０８）。次に、翻訳結果合成部３４は、カメラ１１からの発話者の画像にジェスチャ解析部３３からの翻訳結果を合成して翻訳結果合成画像を生成し、送信部１２に出力する（Ｓ０９）。 In response to the notification of completion of translation preparation, the translation result synthesis unit 34 issues a speaker image acquisition request to the camera 11 (S07), and acquires a speaker image from the camera 11 (S08). Next, the translation result synthesizing unit 34 synthesizes the translation result from the gesture analyzing unit 33 with the image of the speaker from the camera 11 to generate a translation result synthesized image, and outputs it to the transmitting unit 12 (S09).

送信部１２は、翻訳結果合成部３４からの翻訳結果合成画像とマイク１３により取り込まれた発話者の音声とを含む通話データを受話者のテレビ電話機１Ｂに電話網２を介して送信する（Ｓ１０）。 The transmission unit 12 transmits call data including the translation result synthesis image from the translation result synthesis unit 34 and the voice of the speaker captured by the microphone 13 to the receiver's video phone 1B via the telephone network 2 (S10). ).

電話網２において、音声翻訳部４０の音声テキスト変換部は、発話者のテレビ電話機１Ａから送信された通話データに含まれる発話者の音声（例えば英語の音声）をテキストデータ（英語表記のテキストデータ）に変換する。音声翻訳部４０のテキスト言語翻訳部は、そのテキストデータに表記された言語を受話者の指定言語（例えば日本語）に翻訳する。音声翻訳部４０の画像テキスト合成部は、翻訳されたテキストデータを字幕として受話者のテレビ電話機１Ｂの表示部２２の端（例えば下端）に表示されるように、通話データに含まれる翻訳結果合成画像に合成して、受話者のテレビ電話機１Ｂに伝送する。 In the telephone network 2, the speech text conversion unit of the speech translation unit 40 converts the speech of the speaker (for example, English speech) included in the call data transmitted from the speaker's video phone 1 A into text data (text data in English notation). ). The text language translation unit of the speech translation unit 40 translates the language described in the text data into the language specified by the listener (for example, Japanese). The image text synthesis unit of the speech translation unit 40 synthesizes the translation result included in the call data so that the translated text data is displayed as a caption on the end (for example, the lower end) of the display unit 22 of the receiver's videophone 1B. The image is synthesized and transmitted to the receiver's video phone 1B.

受話者のテレビ電話機１Ｂにおいて、受信部２１は、発話者のテレビ電話機１Ａから送信された通話データを受信し、通話データに含まれる翻訳結果合成画像を表示部２２に出力し、通話データに含まれる発話者の音声をスピーカ２３に出力する。表示部２２は、受信部２１からの翻訳結果合成画像を表示する。スピーカ２３は、受信部２１からの音声を出力する。 In the videophone 1B of the receiver, the receiving unit 21 receives the call data transmitted from the videophone 1A of the speaker, outputs the translation result composite image included in the call data to the display unit 22, and is included in the call data. The speaker's voice is output to the speaker 23. The display unit 22 displays the translation result composite image from the receiving unit 21. The speaker 23 outputs sound from the receiving unit 21.

Ｓ１０の後、通話が継続している場合、即ち、テレビ電話処理が継続している場合、Ｓ０４以降が実行され、通話が継続しない場合、テレビ電話処理は終了する。 After S10, when the call is continued, that is, when the videophone process is continued, S04 and subsequent steps are executed, and when the call is not continued, the videophone process is terminated.

このように、本発明の第１実施形態に係るテレビ電話機１では、送信元の被写体（発話者）がとるジェスチャを発話者のテレビ電話機１Ａで翻訳することにより、相手（受話者）に対して誤解を生むことなく、円滑なコミュニケーションを図ることができる。 As described above, in the videophone 1 according to the first embodiment of the present invention, by translating the gesture taken by the subject (speaker) of the transmission source on the videophone 1A of the speaker, Smooth communication can be achieved without causing misunderstandings.

［第２実施形態］
第２実施形態では、第１実施形態からの変更点のみ説明する。 [Second Embodiment]
In the second embodiment, only the changes from the first embodiment will be described.

図３は、本発明の第２実施形態に係るテレビ電話機１の構成を示す概略ブロック図である。第１実施形態では、ジェスチャ翻訳部３０は発話者のテレビ電話機１Ａに設けられているが、第２実施形態では、ジェスチャ翻訳部３０は受話者のテレビ電話機１Ｂに設けられている。 FIG. 3 is a schematic block diagram showing the configuration of the videophone 1 according to the second embodiment of the present invention. In the first embodiment, the gesture translation unit 30 is provided in the video phone 1A of the speaker. In the second embodiment, the gesture translation unit 30 is provided in the video phone 1B of the receiver.

図４は、図３の構成を用いて、本発明の第２実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。 FIG. 4 is a flowchart for explaining the operation (videophone processing) of the videophone 1 according to the second embodiment of the present invention, using the configuration of FIG.

まず、受話者のテレビ電話機１Ｂにおいて、ジェスチャ翻訳部３０のジェスチャ辞書選択部３２は、ジェスチャ辞書取得要求を行なう（Ｓ０１）。ここで、Ｓ０１は、例えば発話者のテレビ電話機１Ａから受話者のテレビ電話機１Ｂに対して発呼があったときに受話者の操作に応じて実行される。Ｓ０１において、ジェスチャ辞書選択部３２は、辞書記憶部３１に登録されたジェスチャ辞書３１Ａ〜３１Ｄの中から、発話者の使用言語に対応するジェスチャ辞書（例えばジェスチャ辞書３１Ｂ）を選択ジェスチャ辞書として選択する（Ｓ０２）。ジェスチャ辞書選択部３２は、ジェスチャ翻訳部３０のジェスチャ解析部３３及び翻訳結果合成部３４に翻訳準備完了を通知する（Ｓ０３）。 First, in the videophone 1B of the receiver, the gesture dictionary selection unit 32 of the gesture translation unit 30 makes a gesture dictionary acquisition request (S01). Here, S01 is executed according to the operation of the receiver when, for example, a call is made from the speaker's video phone 1A to the receiver's video phone 1B. In S01, the gesture dictionary selection unit 32 selects, from the gesture dictionaries 31A to 31D registered in the dictionary storage unit 31, a gesture dictionary (for example, the gesture dictionary 31B) corresponding to the language used by the speaker as the selection gesture dictionary. (S02). The gesture dictionary selection unit 32 notifies the gesture analysis unit 33 and the translation result synthesis unit 34 of the gesture translation unit 30 that translation preparation is complete (S03).

いま、発話者のテレビ電話機１Ａにおいて、送信部１２は、カメラ１１により撮影された発話者の画像とマイク１３により取り込まれた発話者の音声とを含む通話データを受話者のテレビ電話機１Ｂに電話網２を介して送信する。 Now, in the video phone 1A of the speaker, the transmitter 12 calls the receiver's video phone 1B with call data including the image of the speaker captured by the camera 11 and the voice of the speaker captured by the microphone 13. It transmits via the network 2.

電話網２において、音声翻訳部４０の音声テキスト変換部は、発話者のテレビ電話機１Ａから送信された通話データに含まれる発話者の音声（例えば英語の音声）をテキストデータ（英語表記のテキストデータ）に変換する。音声翻訳部４０のテキスト言語翻訳部は、そのテキストデータに表記された言語を受話者の指定言語（例えば日本語）に翻訳する。音声翻訳部４０の画像テキスト合成部は、翻訳されたテキストデータを字幕として受話者のテレビ電話機１Ｂの表示部２２の端（例えば下端）に表示されるように、通話データに含まれる発話者の画像に合成して、受話者のテレビ電話機１Ｂに伝送する。 In the telephone network 2, the speech text conversion unit of the speech translation unit 40 converts the speech of the speaker (for example, English speech) included in the call data transmitted from the speaker's video phone 1 A into text data (text data in English notation). ). The text language translation unit of the speech translation unit 40 translates the language described in the text data into the language specified by the listener (for example, Japanese). The image text synthesizing unit of the speech translating unit 40 displays the translated text data as subtitles on the end of the display unit 22 (for example, the lower end) of the receiver's videophone 1B. The image is synthesized and transmitted to the receiver's video phone 1B.

受話者のテレビ電話機１Ｂにおいて、受信部２１は、発話者のテレビ電話機１Ａから送信された通話データを受信する。 In the receiver's videophone 1B, the receiving unit 21 receives the call data transmitted from the speaker's videophone 1A.

次に、ジェスチャ翻訳部３０のジェスチャ解析部３３は、翻訳準備完了の通知に応じて、受信部２１に対して発話者画像取得要求を行なう（Ｓ０４）。受信部２１は、ジェスチャ解析部３３からの発話者画像取得要求に応じて、受信した通話データに含まれる発話者の画像をジェスチャ解析部３３に出力し、ジェスチャ解析部３３は、その画像を受信部２１から取得する（Ｓ０５）。ジェスチャ解析部３３は、受信部２１からの発話者の画像に基づいて発話者の所作の特徴を示す撮影時特徴データを生成し、Ｓ０２により選択された選択ジェスチャ辞書（ジェスチャ辞書３１Ｂ）を参照する。次に、ジェスチャ解析部３３は、選択ジェスチャ辞書の複数種類の参照用特徴データの中から、撮影時特徴データに一致する参照用特徴データを選択参照用特徴データとして選択し、選択ジェスチャ辞書の複数種類の翻訳情報のうちの、選択参照用特徴データに対応する翻訳情報を、翻訳結果として、翻訳結果合成部３４に出力する（Ｓ０６）。 Next, the gesture analysis unit 33 of the gesture translation unit 30 makes a speaker image acquisition request to the reception unit 21 in response to the notification of completion of translation preparation (S04). In response to the speaker image acquisition request from the gesture analysis unit 33, the reception unit 21 outputs the image of the speaker included in the received call data to the gesture analysis unit 33, and the gesture analysis unit 33 receives the image. Obtained from the unit 21 (S05). The gesture analysis unit 33 generates shooting feature data indicating the features of the speaker's work based on the image of the speaker from the reception unit 21, and refers to the selected gesture dictionary (gesture dictionary 31B) selected in S02. . Next, the gesture analysis unit 33 selects, as selection reference feature data, reference feature data that matches the shooting feature data from among a plurality of types of reference feature data in the selection gesture dictionary, and the plurality of selection gesture dictionaries. Of the types of translation information, the translation information corresponding to the selected reference feature data is output as a translation result to the translation result synthesis unit 34 (S06).

翻訳結果合成部３４は、翻訳準備完了の通知に応じて、受信部２１に対して発話者画像取得要求を行なう（Ｓ０７）。受信部２１は、翻訳結果合成部３４からの発話者画像取得要求に応じて、受信した通話データに含まれる発話者の画像を翻訳結果合成部３４に出力し、翻訳結果合成部３４は、その画像を受信部２１から取得する（Ｓ０８）。次に、翻訳結果合成部３４は、受信部２１からの発話者の画像にジェスチャ解析部３３からの翻訳結果を合成して翻訳結果合成画像を生成し、表示部２２に出力する。表示部２２は、翻訳結果合成画像を表示する（Ｓ２１）。同時に、スピーカ２３は、通話データに含まれる発話者の音声を出力する。 In response to the notification of completion of translation preparation, the translation result synthesis unit 34 issues a speaker image acquisition request to the reception unit 21 (S07). In response to the speaker image acquisition request from the translation result synthesis unit 34, the reception unit 21 outputs the image of the speaker included in the received call data to the translation result synthesis unit 34. The translation result synthesis unit 34 An image is acquired from the receiving unit 21 (S08). Next, the translation result synthesis unit 34 synthesizes the translation result from the gesture analysis unit 33 with the image of the speaker from the reception unit 21 to generate a translation result synthesized image, and outputs it to the display unit 22. The display unit 22 displays the translation result composite image (S21). At the same time, the speaker 23 outputs the voice of the speaker included in the call data.

Ｓ２１の後、通話が継続している場合、即ち、テレビ電話処理が継続している場合、Ｓ０４以降が実行され、通話が継続しない場合、テレビ電話処理は終了する。 After S21, when the call is continued, that is, when the videophone process is continued, S04 and subsequent steps are executed, and when the call is not continued, the videophone process is terminated.

このように、本発明の第２実施形態に係るテレビ電話機１では、送信元の被写体（発話者）がとるジェスチャを受話者のテレビ電話機１Ｂで翻訳することにより、相手（受話者）に対して誤解を生むことなく、円滑なコミュニケーションを図ることができる。 As described above, in the videophone 1 according to the second embodiment of the present invention, the gesture taken by the subject (speaker) of the transmission source is translated by the receiver's videophone 1B, so that the other party (receiver) can be translated. Smooth communication can be achieved without causing misunderstandings.

［第３実施形態］
第３実施形態では、第１又は第２実施形態からの変更点のみ説明する。 [Third Embodiment]
In the third embodiment, only changes from the first or second embodiment will be described.

図５は、本発明の第３実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の構成を示す概略ブロック図である。そのジェスチャ翻訳部３０は、更に、電話帳記憶部３５と、判別用記憶部３６と、判別部３７とを具備している。 FIG. 5 is a schematic block diagram showing the configuration of the gesture translation unit 30 in the videophone 1 according to the third embodiment of the present invention. The gesture translation unit 30 further includes a telephone directory storage unit 35, a determination storage unit 36, and a determination unit 37.

電話帳記憶部３５には、通話に用いられる識別子（例えば、電話番号や、ＩＰ電話などではネットワークアドレスやアカウント名）と使用言語に関連する属性情報とが利用者毎に対応付けられて登録されている。属性情報は、利用者の住所、利用者の国籍、利用者の所在地が判別可能な識別情報（ＩＰアドレスなど）を少なくとも含む。 In the telephone directory storage unit 35, identifiers (for example, network numbers and account names for IP telephones, etc.) used for calls and attribute information related to the language used are registered in association with each user. ing. The attribute information includes at least identification information (such as an IP address) that can identify the address of the user, the nationality of the user, and the location of the user.

判別用記憶部３６には、属性情報５１と、選択すべきジェスチャ辞書を示す辞書情報５２とが対応付けられて登録されている。辞書情報５２は、属性情報５１別に判別用記憶部３６に登録されている。図６は、図５の判別用記憶部３６の例を示している。図６に示されるように、属性情報５１は国籍を示している。その国籍としては日本、アメリカ、イギリス、中国などが例示されるが、これに限定されない。その属性情報５１（国籍）が示す日本、アメリカ、イギリス、中国に対して、辞書情報５２が示すジェスチャ辞書は、日本語のジェスチャ辞書、英語（アメリカ）のジェスチャ辞書、英語（イギリス）のジェスチャ辞書、中国語のジェスチャ辞書であるものとする。 In the determination storage unit 36, attribute information 51 and dictionary information 52 indicating a gesture dictionary to be selected are registered in association with each other. The dictionary information 52 is registered in the determination storage unit 36 for each attribute information 51. FIG. 6 shows an example of the determination storage unit 36 of FIG. As shown in FIG. 6, the attribute information 51 indicates the nationality. Examples of the nationality include, but are not limited to, Japan, the United States, the United Kingdom, and China. For Japan, the United States, the United Kingdom, and China indicated by the attribute information 51 (nationality), the gesture dictionary indicated by the dictionary information 52 is a Japanese gesture dictionary, an English (United States) gesture dictionary, and an English (United Kingdom) gesture dictionary. Suppose that it is a Chinese gesture dictionary.

判別部３７は、発話者と受話者とが通話を行なうときに、電話帳記憶部３５に登録された属性情報の中から、通話に用いられる発話者の識別子（ここでは、電話番号とする）に対応する属性情報を選択属性情報として取得する。このとき、判別部３７は、判別用記憶部３６に登録された属性情報５１の中から、選択属性情報に一致する属性情報５１を検索し、判別部３７は、判別用記憶部３６に登録された辞書情報５２の中から、検索された属性情報５１に対応する辞書情報５２を選択辞書情報として取得する。 When the speaker and the receiver make a call, the determination unit 37 uses the identifier of the speaker used for the call from the attribute information registered in the telephone directory storage unit 35 (here, a telephone number). Attribute information corresponding to is acquired as selection attribute information. At this time, the determination unit 37 searches the attribute information 51 registered in the determination storage unit 36 for attribute information 51 that matches the selected attribute information, and the determination unit 37 is registered in the determination storage unit 36. From the dictionary information 52, the dictionary information 52 corresponding to the searched attribute information 51 is acquired as selected dictionary information.

これにより、ジェスチャ辞書選択部３２は、辞書記憶部３１に登録されたジェスチャ辞書３１Ａ〜３１Ｄの中から、選択辞書情報が示すジェスチャ辞書を選択ジェスチャ辞書として自動的に選択する。 As a result, the gesture dictionary selection unit 32 automatically selects the gesture dictionary indicated by the selected dictionary information from the gesture dictionaries 31A to 31D registered in the dictionary storage unit 31 as the selection gesture dictionary.

図７は、図５及び図６の構成を用いて、本発明の第３実施形態に係るテレビ電話機１の動作（テレビ電話処理）を説明するためのフローチャートである。 FIG. 7 is a flowchart for explaining the operation (videophone processing) of the videophone 1 according to the third embodiment of the present invention, using the configuration of FIGS. 5 and 6.

まず、発話者と受話者とが通話を行なうときに、発話者のテレビ電話機１Ａ、又は、受話者のテレビ電話機１Ｂにおいて、ジェスチャ翻訳部３０の判別部３７は、発話者国籍取得要求を行なう（Ｓ３１）。ここで、Ｓ３１は、例えば発話者の操作に応じて発話者のテレビ電話機１Ａが発呼するときに実行される。又は、Ｓ３１は、例えば発話者のテレビ電話機１Ａから受話者のテレビ電話機１Ｂに対して発呼があったときに受話者の操作に応じて実行される。判別部３７は、電話帳記憶部３５に登録された国籍の中から、発話者の電話番号に対応する国籍（例えば［アメリカ］）を選択国籍として取得する（Ｓ３２）。次に、判別部３７は、判別用記憶部３６に登録された辞書情報５２の中から、選択国籍［アメリカ］に対応する辞書情報５２［英語（アメリカ）のジェスチャ辞書］を選択辞書情報として取得し、ジェスチャ辞書選択部３２に出力する（Ｓ３３）。 First, when a speaker and a receiver make a call, in the speaker's video phone 1A or the receiver's video phone 1B, the determination unit 37 of the gesture translation unit 30 makes a request for acquiring the speaker's nationality ( S31). Here, S31 is executed, for example, when the videophone 1A of the speaker makes a call in response to the operation of the speaker. Alternatively, S31 is executed according to the operation of the receiver when, for example, a call is made from the videophone 1A of the speaker to the videophone 1B of the receiver. The determination unit 37 acquires the nationality (for example, [USA]) corresponding to the telephone number of the speaker as the selected nationality from the nationalities registered in the telephone directory storage unit 35 (S32). Next, the determination unit 37 acquires the dictionary information 52 [English (US) gesture dictionary] corresponding to the selected nationality [USA] as the selected dictionary information from the dictionary information 52 registered in the determination storage unit 36. Then, it is output to the gesture dictionary selection unit 32 (S33).

次に、ジェスチャ翻訳部３０のジェスチャ辞書選択部３２は、判別部３７からの選択辞書情報［英語（アメリカ）のジェスチャ辞書］に応じて、ジェスチャ辞書取得要求を行なう（Ｓ３４）。Ｓ３４において、ジェスチャ辞書選択部３２は、辞書記憶部３１に登録されたジェスチャ辞書３１Ａ〜３１Ｄの中から、選択辞書情報が示す英語（アメリカ）のジェスチャ辞書（例えばジェスチャ辞書３１Ｂ）を選択ジェスチャ辞書として自動的に選択する（Ｓ３５）。ジェスチャ辞書選択部３２は、ジェスチャ翻訳部３０のジェスチャ解析部３３及び翻訳結果合成部３４に翻訳準備完了を通知する（Ｓ３６）。 Next, the gesture dictionary selection unit 32 of the gesture translation unit 30 makes a gesture dictionary acquisition request according to the selected dictionary information [English (US) gesture dictionary] from the determination unit 37 (S34). In S 34, the gesture dictionary selection unit 32 selects, from the gesture dictionaries 31 A to 31 D registered in the dictionary storage unit 31, an English (US) gesture dictionary (for example, the gesture dictionary 31 B) indicated by the selection dictionary information as the selection gesture dictionary. The selection is made automatically (S35). The gesture dictionary selection unit 32 notifies the gesture analysis unit 33 and the translation result synthesis unit 34 of the gesture translation unit 30 of the completion of translation preparation (S36).

Ｓ３６の後、Ｓ０４以降が実行される。 After S36, S04 and subsequent steps are executed.

このように、本発明の第３実施形態に係るテレビ電話機１では、発話者と受話者とが通話を行なうときに、発話者の使用言語に対応するジェスチャ辞書を自動的に選択するため、これに伴う発話者又は受話者の操作の手間を省くことができる。 Thus, in the videophone 1 according to the third embodiment of the present invention, when a speaker and a receiver make a call, a gesture dictionary corresponding to the language used by the speaker is automatically selected. Thus, it is possible to save time and effort for the operation of the speaker or receiver.

［第４実施形態］
第４実施形態では、第１〜第３実施形態からの変更点のみ説明する。 [Fourth Embodiment]
In the fourth embodiment, only changes from the first to third embodiments will be described.

図８は、本発明の第４実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の辞書記憶部３１内のジェスチャ辞書の例を示している。辞書記憶部３１には、ジェスチャ辞書３１Ａ〜３１Ｄが言語別に登録され、ジェスチャ辞書３１Ａ〜３１Ｄの各々には、利用者がジェスチャを行なうときの所作の特徴を示す参照用特徴データ５３（特徴ベクトル）と、そのジェスチャの翻訳（意味）を示す翻訳情報５４と、が複数種類対応付けられて登録されている。図８に示されるように、その翻訳情報５４は、ジェスチャの意味を表す文字列である。 FIG. 8 shows an example of a gesture dictionary in the dictionary storage unit 31 of the gesture translation unit 30 in the video phone 1 according to the fourth embodiment of the present invention. Gesture dictionaries 31A to 31D are registered in the dictionary storage unit 31 for each language. In each of the gesture dictionaries 31A to 31D, reference feature data 53 (feature vectors) indicating features of an operation when a user performs a gesture. And translation information 54 indicating translation (meaning) of the gesture are registered in association with a plurality of types. As shown in FIG. 8, the translation information 54 is a character string that represents the meaning of the gesture.

この場合、ジェスチャ翻訳部３０の翻訳結果合成部３４は、発話者の画像に翻訳結果を合成するときに、例えば図９に示されるような形態で、発話者の画像に、文字列を表す翻訳情報５４を合成する。このとき、発話者の画像のうちの、発話者の画像に合成される字幕とは重複しない位置に、文字列を表す翻訳情報５４を合成することが好ましい。 In this case, when the translation result synthesis unit 34 of the gesture translation unit 30 synthesizes the translation result with the speaker's image, the translation representing a character string in the speaker's image, for example, in the form shown in FIG. The information 54 is synthesized. At this time, it is preferable to synthesize the translation information 54 representing a character string at a position that does not overlap with the caption to be synthesized with the speaker image in the speaker image.

［第５実施形態］
第５実施形態では、第１〜第３実施形態からの変更点のみ説明する。 [Fifth Embodiment]
In the fifth embodiment, only changes from the first to third embodiments will be described.

図１０は、本発明の第５実施形態に係るテレビ電話機１におけるジェスチャ翻訳部３０の辞書記憶部３１内のジェスチャ辞書の例を示している。辞書記憶部３１には、ジェスチャ辞書３１Ａ〜３１Ｄが言語別及び利用者別に登録され、ジェスチャ辞書３１Ａ〜３１Ｄの各々には、利用者がジェスチャを行なうときの所作の特徴を示す参照用特徴データ５３（特徴ベクトル）と、そのジェスチャの翻訳（意味）を示す翻訳情報５５と、が複数種類対応付けられて登録されている。翻訳情報５５は、利用者のジェスチャに対して、予め用意された利用者の画像を示すアバター画像である。予め用意された利用者の画像は、例えば、カメラ１１により利用者が予め撮影された画像ファイルである。 FIG. 10 shows an example of a gesture dictionary in the dictionary storage unit 31 of the gesture translation unit 30 in the videophone 1 according to the fifth embodiment of the present invention. In the dictionary storage unit 31, gesture dictionaries 31A to 31D are registered for each language and for each user. In each of the gesture dictionaries 31A to 31D, reference feature data 53 indicating the features of the action when the user performs a gesture. (Feature vector) and translation information 55 indicating translation (meaning) of the gesture are registered in association with a plurality of types. The translation information 55 is an avatar image indicating a user image prepared in advance for the user gesture. The user image prepared in advance is, for example, an image file taken by the user in advance with the camera 11.

この場合、ジェスチャ翻訳部３０のジェスチャ解析部３３は、撮影時特徴データ（特徴ベクトル）を生成したときに、選択ジェスチャ辞書（例えばジェスチャ辞書３１Ｂ）の複数種類のアバター画像を表す翻訳情報５５の中から、発話者の撮影時特徴データに対応するアバター画像を表す翻訳情報５５を、翻訳結果として選択する。翻訳結果合成部３４は、発話者の画像に翻訳結果を合成するときに、例えば図１１に示されるような形態で、発話者の画像のうちの、背景画像以外の画像を、発話者のアバター画像を表す翻訳情報５５に置き換える。このアバター画像は、発話者の顔の表情が顕著に表れるようなものであることが好ましい。 In this case, the gesture analysis unit 33 of the gesture translation unit 30 includes the translation information 55 representing a plurality of types of avatar images in the selected gesture dictionary (for example, the gesture dictionary 31B) when generating the shooting feature data (feature vector). Then, the translation information 55 representing the avatar image corresponding to the shooting feature data of the speaker is selected as the translation result. When the translation result synthesizer 34 synthesizes the translation result with the speaker's image, for example, in the form shown in FIG. 11, an image other than the background image of the speaker's image is converted into the speaker's avatar. Replace with translation information 55 representing an image. This avatar image is preferably such that the facial expression of the speaker appears remarkably.

［第６実施形態］
第６実施形態では、第１〜第５実施形態からの変更点のみ説明する。 [Sixth Embodiment]
In the sixth embodiment, only changes from the first to fifth embodiments will be described.

本発明のテレビ電話機１は、ジェスチャ翻訳部３０が発話者のテレビ電話機１Ａ又は受話者のテレビ電話機１Ｂに設けられているが、これに限定されない。各テレビ電話機１に複数のジェスチャ翻訳部３０が設けられてもよく、受話者のテレビ電話機１Ｂでは、複数の発話者のテレビ電話機１Ａからの画像を一画面で表示部２２に表示してもよいし、画面切り替えにより個別に表示部２２に表示してもよい。 In the video phone 1 of the present invention, the gesture translation unit 30 is provided in the speaker's video phone 1A or the receiver's video phone 1B, but the present invention is not limited to this. Each video phone 1 may be provided with a plurality of gesture translating units 30, and the receiver's video phone 1B may display images from the plurality of speaker's video phones 1A on the display unit 22 in one screen. However, it may be individually displayed on the display unit 22 by switching the screen.

例えば、図１２に示されるように、第１〜第３の発話者と受話者とが同時にテレビ電話を行なう場合、受話者のテレビ電話機１Ｂの表示部２２には、第１〜第３翻訳結果合成画像が表示される。第１翻訳結果合成画像は、第１発話者の画像に対して、第１発話者のジェスチャの翻訳結果である第１翻訳結果が合成されている。第２翻訳結果合成画像は、第２発話者の画像に対して、第２発話者のジェスチャの翻訳結果である第２翻訳結果が合成されている。第３翻訳結果合成画像は、第３発話者の画像に対して、第３発話者のジェスチャの翻訳結果である第３翻訳結果が合成されている。 For example, as shown in FIG. 12, when the first to third utterers and the receiver simultaneously make a videophone call, the display unit 22 of the receiver's videophone 1B displays the first to third translation results. A composite image is displayed. In the first translation result synthesized image, the first translation result, which is the translation result of the gesture of the first speaker, is synthesized with the image of the first speaker. In the second translation result synthesized image, the second translation result that is the translation result of the gesture of the second speaker is synthesized with the image of the second speaker. In the third translation result synthesized image, the third translation result, which is the translation result of the gesture of the third speaker, is synthesized with the image of the third speaker.

また、本発明のテレビ電話機１で動作するコンピュータプログラム（以下、プログラムと称する）は、本発明に関わる上記実施形態の機能を実現するように、ＣＰＵ等を制御するプログラム（コンピュータを機能させるプログラム）である。そして、これら装置で取り扱われる情報は、その処理的に一時的にＲＡＭに蓄積され、その後、各種ＲＯＭやＨＤＤに格納され、必要に応じてＣＰＵによって読み出し、修正・書き込みが行われる。プログラムを格納する記録媒体としては、半導体媒体（例えば、ＲＯＭ、不揮発性メモリカード等）、光記憶媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ、ＢＤ等）、磁気記録媒体（例えば、磁気テープ、フレキシブルディスク等）等のいずれであってもよい。また、ロードしたプログラムを実行することにより、上記した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステムあるいは他のアプリケーションプログラム等と共同して処理することにより、本発明の機能が実現される場合もある。 A computer program (hereinafter referred to as a program) that operates on the video phone 1 of the present invention is a program that controls a CPU or the like (a program that causes a computer to function) so as to realize the functions of the above-described embodiments related to the present invention. It is. Information handled by these devices is temporarily accumulated in the RAM in terms of processing, then stored in various ROMs and HDDs, read out by the CPU, and corrected and written as necessary. As a recording medium for storing the program, a semiconductor medium (for example, ROM, nonvolatile memory card, etc.), an optical storage medium (for example, DVD, MO, MD, CD, BD, etc.), a magnetic recording medium (for example, magnetic tape, Any of a flexible disk etc. may be sufficient. Further, by executing the loaded program, not only the functions of the above-described embodiment are realized, but also by co-processing with the operating system or other application programs based on the instructions of the program, The functions of the invention may be realized.

また、市場に流通させる場合には、可搬型の記録媒体にプログラムを格納して流通させたり、インターネット等のネットワークを介して接続されたサーバコンピュータに転送したりすることができる。この場合、サーバコンピュータの記録装置も本発明に含まれる。また、上述した実施形態における送信局装置及び受信局装置の一部または全部を典型機能ブロックは個別にプロセッサ化してもよいし、一部または全部を集積してプロセッサ化してもよい。また、集積回路化の手法はＬＳＩに限らず専用回路または汎用プロセッサで実現しても良い。また、半導体技術の進歩によりＬＳＩに代替する集積回路化の技術が実現した場合、該当技術による集積回路を用いることも可能である。 In addition, when distributing to the market, the program can be stored and distributed in a portable recording medium, or transferred to a server computer connected via a network such as the Internet. In this case, the recording apparatus of the server computer is also included in the present invention. In addition, a part or all of the transmitting station apparatus and the receiving station apparatus in the above-described embodiment may be individually converted into a processor, or a part or all of them may be integrated into a processor. Further, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. Further, when an integrated circuit technology that replaces LSI is realized by the advancement of semiconductor technology, an integrated circuit according to the technology can be used.

１ … テレビ電話機、
１Ａ … 発話者のテレビ電話機、
１Ｂ … 受話者のテレビ電話機、
２ … 電話網、
１１ … カメラ、
１２ … 送信部、
１３ … マイク、
２１ … 受信部、
２２ … 表示部、
２３ … スピーカ、
３０ … ジェスチャ翻訳部、
３１ … 辞書記憶部、
３１Ａ〜３１Ｄ … ジェスチャ辞書、
３２ … ジェスチャ辞書選択部、
３３ … ジェスチャ解析部、
３４ … 翻訳結果合成部、
３５ … 電話帳記憶部、
３６ … 判別用記憶部、
３７ … 判別部、
４０ … 音声翻訳部、
５１ … 属性情報、
５２ … 辞書情報、
５３ … 参照用特徴データ、
５４ … 翻訳情報（文字列）、
５５ … 翻訳情報（画像ファイル） 1 ... Video phone,
1A ... Speaker's video phone,
1B ... The receiver's video phone,
2 ... telephone network,
11 ... Camera,
12 ... Transmitter,
13 ... Mike,
21 ... receiving part,
22 ... display part,
23… Speaker,
30 ... Gesture Translation Department,
31 ... Dictionary storage,
31A-31D ... Gesture dictionary,
32 ... Gesture dictionary selection part,
33 ... Gesture analysis section,
34 ... Translation result synthesis unit,
35 ... Phone book storage,
36 ... storage unit for discrimination,
37 ... discriminator,
40 ... speech translation department,
51 ... attribute information,
52… Dictionary information,
53 ... feature data for reference,
54 ... Translation information (character string),
55 ... Translation information (image file)

Claims

A gesture dictionary associating reference feature data indicating the features of the gesture with translation information indicating the translation of the gesture;
Gesture analysis that generates shooting feature data indicating the feature of the subject gesture based on the image of the subject, and selects translation information corresponding to the shooting feature data from the translation information of the gesture dictionary as a translation result And
A translation result synthesis unit that synthesizes the translation result with the image of the subject to generate a translation result synthesized image;
A video phone comprising:

A receiving unit for receiving an image of the subject;
A display unit for displaying the translation result synthesized image;
The video phone according to claim 1, further comprising:

A transmission unit for transmitting the translation result composite image;
The video phone according to claim 1, further comprising:

A dictionary storage unit in which a plurality of gesture dictionaries are registered;
A telephone directory storage unit in which an identifier used for a call and attribute information related to a language used are registered in association with each other;
A storage unit for determination in which selection dictionary information indicating a gesture dictionary to be selected is registered for each attribute information;
From the attribute information registered in the telephone directory storage unit, to obtain attribute information corresponding to an identifier used for a call as selection attribute information, from among the dictionary information registered in the determination storage unit, A determination unit that acquires dictionary information corresponding to the selection attribute information as selection dictionary information;
A gesture dictionary selection unit that selects a gesture dictionary indicated by the selection dictionary information as the gesture dictionary from the plurality of gesture dictionaries registered in the dictionary storage unit;
The video phone according to claim 1, further comprising:

Generating feature data at the time of photographing indicating the feature of the gesture of the subject based on the image of the subject;
Referring to the gesture dictionary that associates the reference feature data indicating the feature of the gesture with the translation information indicating the translation of the gesture, the translation information corresponding to the shooting feature data is translated from the translation information of the gesture dictionary A step to select as a result;
Synthesizing the translation result with the image of the subject to generate a translation result synthesized image;
A computer program that causes a computer to execute each step of.