JP2008065653A

JP2008065653A - Video translation device

Info

Publication number: JP2008065653A
Application number: JP2006243697A
Authority: JP
Inventors: Takaaki Yamazaki; 隆朗山崎
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2006-09-08
Filing date: 2006-09-08
Publication date: 2008-03-21

Abstract

<P>PROBLEM TO BE SOLVED: To translate voice or characters contained in a video into another language according to the video and position information. <P>SOLUTION: This video translation device includes: a voice information extracting part 104 for extracting sound from video data 103 containing position information 101 and video 102; an image information extracting part 105 for extracting images; a position information extracting part 106 for extracting positions; a voice analysis part 107 for analyzing voice contained in the sound; an image analysis part 108 for analyzing characters contained in the images; an input language selecting part 109 for selecting an input language corresponding to the position information from a position language kind DB 110; an output language selecting part 111 for selecting an output language; a voice translating part 112 for translating the voice; an image translating part 113 for translating the characters; a voice information dubbing part 114 for dubbing the voice with the translated voice; an image information superposing part 115 for superposing a translated caption; and a video output part 116 for displaying the video after the translation. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、映像翻訳装置に関し、特に撮影した映像データに含まれる位置情報に対応して、翻訳を行なうことができる映像翻訳装置に関するものである。 The present invention relates to a video translation apparatus, and more particularly to a video translation apparatus capable of performing translation in accordance with position information included in captured video data.

旅行先の国で見聞きする会話や放送、看板の文字など、旅行する国や地域によって使用される言語が異なるが、従来、ユーザの位置を検出し、ユーザの位置に対応する言語に基づいて、場所を移動するごとに言語を設定し直すことなく翻訳を行なう情報処理装置があった。（例えば、特許文献１参照）。
特開２０００−１９４６９８号公報 The language used varies depending on the country or region where you travel, such as conversations and broadcasts that you see and hear in the country you are traveling to, and the characters on the signboard. Traditionally, based on the language corresponding to the user's location, There has been an information processing apparatus that performs translation without resetting the language each time a place is moved. (For example, refer to Patent Document 1).
JP 2000-194698 A

しかしながら、上記従来の翻訳を行なう情報処理装置では、旅行先の国で翻訳が必要となる映像を撮影する場合に、デジタルビデオカメラなどの撮影装置とは別に翻訳装置を持ち歩く手間がかかるという問題を有していた。 However, in the conventional information processing apparatus that performs translation, there is a problem that it takes time to carry a translation apparatus separately from a photographing apparatus such as a digital video camera when photographing an image that needs to be translated in a travel destination country. Had.

映像を撮影する場合に、ＧＰＳにより取得した位置情報を対応させて記録させることにより、再生時に映像の位置情報をわかるようにした撮影装置がある。 When shooting a video, there is a shooting device that records the positional information acquired by GPS in association with each other so that the positional information of the video can be known during reproduction.

本発明は、撮影した映像データに含まれる位置情報に対応する言語に基づいて翻訳を行なう装置であり、撮影後のデータを入力とするため、撮影時に翻訳装置を持ち歩く必要をなくすことができる。 The present invention is an apparatus that performs translation based on a language corresponding to position information included in photographed video data. Since data after photographing is input, it is possible to eliminate the need to carry the translation apparatus during photographing.

請求項１に記載の映像翻訳装置は、位置情報を含む映像データから位置を抽出する抽出手段と、抽出手段により抽出された位置に対応する入力言語を選択する入力言語選択手段及び、ユーザが任意に他の種類の出力言語を選択する出力言語選択手段と、入力言語選択手段により選択された入力言語と出力言語選択手段により選択された出力言語に基づいて、映像の翻訳処理を実行する翻訳手段とを備えることを特徴とする。 The video translation device according to claim 1 is an extraction means for extracting a position from video data including position information, an input language selection means for selecting an input language corresponding to the position extracted by the extraction means, and a user arbitrarily Output language selection means for selecting another type of output language, and translation means for performing video translation processing based on the input language selected by the input language selection means and the output language selected by the output language selection means It is characterized by providing.

請求項２に記載の映像翻訳装置は、映像データから、音声を抽出する音声抽出手段と、音声抽出手段により抽出された音声に含まれる声を解析する解析手段と、出力言語選択手段により選択された出力言語に基づいて、解析手段により解析した声を翻訳する音声翻訳手段とをさらに備えることを特徴とする。 The video translation device according to claim 2 is selected by the voice extraction means for extracting the voice from the video data, the analysis means for analyzing the voice included in the voice extracted by the voice extraction means, and the output language selection means. And voice translation means for translating the voice analyzed by the analysis means based on the output language.

請求項３に記載の映像翻訳装置は、音声出力手段により翻訳された声を、もとの映像に含まれる声に吹替えた映像を出力する出力手段をさらに備えることを特徴とする。 According to a third aspect of the present invention, there is provided the video translation apparatus further comprising output means for outputting a video in which the voice translated by the voice output means is dubbed into the voice included in the original video.

請求項４に記載の映像翻訳装置は、音声翻訳手段により翻訳された声を出力言語選択手段により選択された出力言語に翻訳し、もとの映像にテロップとして重畳した映像を出力する出力手段をさらに備えることを特徴とする。 The video translation device according to claim 4 includes output means for translating the voice translated by the voice translation means into the output language selected by the output language selection means and outputting a video superimposed as a telop on the original video. It is further provided with the feature.

請求項５に記載の映像翻訳装置は、映像データから画像を抽出する画像抽出手段と、画像抽出手段により抽出された画像に含まれる文字を解析する文字解析手段とをさらに備え、翻訳手段は、文字解析手段により解析した文字を翻訳することを特徴とする。 The video translation apparatus according to claim 5, further comprising: an image extraction unit that extracts an image from the video data; and a character analysis unit that analyzes a character included in the image extracted by the image extraction unit. Characters analyzed by the character analysis means are translated.

請求項６に記載の映像翻訳装置は、翻訳手段により翻訳された文字を、もとの映像にテロップとして重畳した映像を出力する出力手段をさらに備えることを特徴とする。 According to a sixth aspect of the present invention, there is provided the video translation apparatus further comprising output means for outputting a video in which the characters translated by the translation means are superimposed as a telop on the original video.

請求項７に記載の映像翻訳装置は、翻訳手段により翻訳された文字を出力言語選択手段により選択された出力言語に基づいて音声に変換して、もとの映像に音声を加えた映像を出力する出力手段をさらに備えることを特徴とする。 The video translation apparatus according to claim 7, wherein the character translated by the translation unit is converted into voice based on the output language selected by the output language selection unit, and a video obtained by adding the voice to the original video is output. Output means is further provided.

請求項８に記載の映像翻訳装置は、抽出手段で抽出する位置情報をＧＰＳにより取得することを特徴とする。 The video translation device according to claim 8 is characterized in that the position information extracted by the extraction means is acquired by GPS.

本発明によれば、撮影時には、どこの国、地域で撮影した映像かを撮影者が意識することなく、撮影後に映像を翻訳することができる。また、使用される言語が入り組んだ国、地域をまたがって撮影した場合でも、刻々と変化する位置情報に応じて言語を判定し、映像を翻訳できる。これにより、旅行やビジネス、報道目的など、国内外で撮影した映像データを基に自国語や他の言語に翻訳して視聴することが可能となる利点がある。 According to the present invention, at the time of shooting, the video can be translated after shooting without the photographer being aware of the video shot in which country and region. Even when shooting is performed across countries and regions where the language used is complicated, the language can be determined according to the position information that changes every moment, and the video can be translated. This has the advantage that it can be viewed and translated into its own language or other languages based on video data taken at home and abroad, such as for travel, business, and reporting purposes.

以下、本発明の実施の形態について、図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（実施の形態１）
図１は本発明の実施の形態１における映像翻訳装置の構成を示すブロック図である。図１において、１０３はデジタルビデオカメラなどの撮影装置で撮影した映像１０２と、映像の撮影時にＧＰＳにより取得した位置情報１０１とを対応させて記録した映像データである。ここで位置情報とは、撮影地点により刻々と変化する緯度、経度を含む情報である。１０４は映像データ１０３から音声を抽出する音声情報抽出部である。１０７は音声解析部であり、音声情報抽出部１０４で抽出した音声に含まれる会話や放送などの声を解析し、抽出する。１０５は映像データ１０３から画像を抽出する画像情報抽出部である。１０８は画像解析部であり、画像情報抽出部１０５で抽出した画像に含まれる看板などの文字を解析し、抽出する。１０６は映像データ１０３から翻訳の入力となる言語種別を特定するための位置情報を抽出する位置情報抽出部である。１０９は入力言語選択部であり、位置情報抽出部１０６で抽出した位置情報と、位置情報に対応する言語種別を集積した位置言語種別ＤＢ１１０が入力言語選択部１０９に入力され、翻訳の入力となる言語種別を検索し選択する。１１１は出力言語選択部であり、翻訳の出力となる任意の言語種別を、ユーザがキー入力や音声入力などの手段で選択する。１１２は音声翻訳部であり、音声解析部１０７で抽出した声を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。１１４は音声情報吹替え部であり、音声情報抽出部１０４で抽出した音声を音声翻訳部１１２で翻訳した声に吹替える。１１３は画像翻訳部であり、画像解析部１０８で抽出した文字を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。１１５は画像情報重畳部であり、画像情報抽出部１０５で抽出した画像に画像翻訳部１１３で翻訳した文字をテロップとして重畳する。１１６は映像出力部であり、音声情報吹替え部１１４で吹替えた音声と映像情報重畳部１１５で重畳した画像とを合成し、翻訳された映像として出力する。 (Embodiment 1)
FIG. 1 is a block diagram showing a configuration of a video translation apparatus according to Embodiment 1 of the present invention. In FIG. 1, reference numeral 103 denotes video data recorded by associating a video 102 shot by a shooting device such as a digital video camera with position information 101 acquired by GPS at the time of shooting the video. Here, the position information is information including latitude and longitude that change every moment depending on the shooting point. An audio information extracting unit 104 extracts audio from the video data 103. A voice analysis unit 107 analyzes and extracts voices such as conversation and broadcast included in the voice extracted by the voice information extraction unit 104. An image information extraction unit 105 extracts an image from the video data 103. An image analysis unit 108 analyzes and extracts characters such as a signboard included in the image extracted by the image information extraction unit 105. Reference numeral 106 denotes a position information extraction unit that extracts position information for specifying a language type to be input for translation from the video data 103. Reference numeral 109 denotes an input language selection unit. A position language type DB 110 in which the position information extracted by the position information extraction unit 106 and the language types corresponding to the position information are collected is input to the input language selection unit 109 and used as an input for translation. Search and select the language type. Reference numeral 111 denotes an output language selection unit, in which the user selects an arbitrary language type to be output for translation by means such as key input or voice input. A voice translation unit 112 translates the voice extracted by the voice analysis unit 107 from the input language selected by the input language selection unit 109 to the output language selected by the output language selection unit 111. Reference numeral 114 denotes a voice information dubbing unit that dubbes the voice extracted by the voice information extraction unit 104 into a voice translated by the voice translation unit 112. An image translation unit 113 translates the characters extracted by the image analysis unit 108 from the input language selected by the input language selection unit 109 into the output language selected by the output language selection unit 111. Reference numeral 115 denotes an image information superimposing unit that superimposes characters translated by the image translating unit 113 on the image extracted by the image information extracting unit 105 as a telop. Reference numeral 116 denotes a video output unit that synthesizes the audio dubbed by the audio information dubbing unit 114 and the image superimposed by the video information superimposing unit 115 and outputs the synthesized video.

図２は、本発明の実施の形態１による映像データに格納されている情報の例を表す図である。映像にＧＰＳにより取得した位置情報を対応させて記録させる撮影装置で撮影された映像データは、撮影シーンごとの画像２０１と音声２０２に加えて、撮影地点の緯度、経度を含む位置情報２０３から構成されている。撮影地点を移動することにより言語種別の異なる国、地域へ移った場合は、音声２０２の言語種別が変化することになる。例えば、図２に示すように、北緯４０度、西経７４度の地点はアメリカ合衆国であり、ここで撮影されたシーン１の画像２０１には英語の文字が含まれ、音声２０２には英語の声が含まれている。北緯４３度、西経７９度の地点はカナダであり、ここで撮影されたシーン２の画像２０１には加仏語の文字が含まれ、音声２０２には加仏語の声が含まれている。このようにして撮影された映像データから画像情報、音声情報、位置情報をそれぞれ抽出する。 FIG. 2 is a diagram illustrating an example of information stored in video data according to Embodiment 1 of the present invention. Video data shot by a shooting device that records the position information acquired by GPS in correspondence with the image is composed of position information 203 including the latitude and longitude of the shooting point in addition to the image 201 and sound 202 for each shooting scene. Has been. When moving to a country or region with a different language type by moving the shooting point, the language type of the voice 202 changes. For example, as shown in FIG. 2, the point at 40 degrees north latitude and 74 degrees west longitude is the United States of America, and the image 201 of the scene 1 photographed here contains English characters, and the voice 202 contains English voices. include. The point at 43 degrees north latitude and 79 degrees west is Canada. The image 201 of the scene 2 taken here contains French characters, and the voice 202 contains French words. Image information, audio information, and position information are extracted from the video data thus captured.

以上のように構成された映像翻訳装置において、映像データから抽出した位置情報に基づいて、位置情報が示す国、地域で使用されている言語を自動的に判定し、入力言語種別を選択する動作について、図１、図３を参照しながら説明する。図３は、位置言語種別ＤＢに格納されている情報の例を表す図である。まず、位置情報抽出部１０６で抽出した緯度、経度などの組合せで表される位置情報に基づき、位置言語種別ＤＢ１１０を検索する。位置言語種別ＤＢ１１０には予め各位置情報が示す国、地域で使用されている言語の種別が記録されている。例えば、図３に示すように、北緯４０度、西経７４度という位置情報に対応する言語の種別は英語であるというような位置情報とそこで使用されている言語の種別を対にしたデータが集積されている。これによって、位置情報抽出部１０６で抽出した位置情報と一致した位置言語種別ＤＢ１１０の位置情報と対になっている言語の種別を選択する。 Operation for automatically determining the language used in the country and region indicated by the position information and selecting the input language type based on the position information extracted from the video data in the video translation apparatus configured as described above Will be described with reference to FIGS. FIG. 3 is a diagram illustrating an example of information stored in the position language type DB. First, the location language type DB 110 is searched based on location information represented by a combination of latitude, longitude, and the like extracted by the location information extraction unit 106. In the position language type DB 110, the language type used in the country and region indicated by each position information is recorded in advance. For example, as shown in FIG. 3, the position information that the language type corresponding to the position information of 40 degrees north latitude and 74 degrees west longitude is English and the data paired with the type of language used there are accumulated. Has been. Thus, the language type paired with the position information in the position language type DB 110 that matches the position information extracted by the position information extraction unit 106 is selected.

次に出力言語種別を選択する動作について、図４を参照しながら説明する。図４は、出力言語選択画面の例を表す図である。本発明の映像翻訳装置は、例えば図４に示すような出力言語選択画面を有し、キー入力や音声入力などにより出力言語選択リスト４０２に表示される言語種別から出力言語種別を選択するか、出力言語入力域４０１に直接言語種別を入力することで、出力言語種別をユーザが任意に選択する。 Next, the operation of selecting the output language type will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of an output language selection screen. The video translation apparatus of the present invention has an output language selection screen as shown in FIG. 4, for example, and selects an output language type from the language types displayed in the output language selection list 402 by key input, voice input, or the like. By directly inputting the language type into the output language input area 401, the user arbitrarily selects the output language type.

続いて、音声翻訳処理を実行する場合の動作について、図１、図５を参照しながら説明する。図５は、実施の形態１の映像翻訳装置を用いて翻訳した映像出力情報の一例を表す図である。音声翻訳部１１２では、音声解析部１０７で抽出した声を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。例えば、図５のシーン１に示すように音声５０２に含まれる声５０５の入力言語種別は英語であり、出力言語種別が日本語であった場合に、音声情報吹替え部１１４にて、音声情報抽出部１０４で抽出した英語の声“ＷｅｌｃｏｍｅｔｏＨｏｏｌｙｗｏｏｄ”を音声翻訳部１１２で翻訳した日本語の声“ようこそハリウッドへ”に吹替える。同様にシーン２では、加仏語の声“ＢｉｅｎｖｅｎｕｅｖｅｒｓＣａｎａｄａ”を日本語の声“ようこそカナダへ”に吹替える。 Next, the operation when the speech translation process is executed will be described with reference to FIGS. FIG. 5 is a diagram illustrating an example of video output information translated using the video translation apparatus according to the first embodiment. The speech translation unit 112 translates the voice extracted by the speech analysis unit 107 from the input language selected by the input language selection unit 109 to the output language selected by the output language selection unit 111. For example, as shown in scene 1 of FIG. 5, when the input language type of the voice 505 included in the voice 502 is English and the output language type is Japanese, the voice information dubbing unit 114 extracts voice information. The English voice “Welcome to Hourwood” extracted by the section 104 is dubbed into the Japanese voice “Welcome to Hollywood” translated by the speech translation section 112. Similarly, in scene 2, the French voice “Bienvenue vers Canada” is dubbed into the Japanese voice “Welcome to Canada”.

次に、映像翻訳処理を実行する場合の動作について、図１、図５を参照しながら説明する。画像翻訳部１１３では、画像解析部１０８で抽出した文字を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。例えば、図５のシーン１に示すように画像５０１に含まれる文字５０３の入力言語種別は英語であり、出力言語種別が日本語であった場合に、画像情報重畳部１１５にて、画像情報抽出部１０５で抽出した英語の文字“ＨＯＯＬＹＷＯＯＤ”が含まれる画像に画像翻訳部１１３で翻訳した日本語の文字“ハリウッド”をテロップ５０４として重畳する。同様にシーン２では、加仏語の文字“Ｃａｎａｄａ”が含まれる画像に日本語の文字“カナダ”をテロップ５０４として重畳する。 Next, the operation when the video translation process is executed will be described with reference to FIGS. The image translation unit 113 translates the characters extracted by the image analysis unit 108 from the input language selected by the input language selection unit 109 into the output language selected by the output language selection unit 111. For example, as shown in scene 1 of FIG. 5, when the input language type of the character 503 included in the image 501 is English and the output language type is Japanese, the image information superimposing unit 115 extracts the image information. The Japanese character “Hollywood” translated by the image translating unit 113 is superimposed as a telop 504 on the image including the English character “HOOOYWOOD” extracted by the unit 105. Similarly, in the scene 2, the Japanese character “Canada” is superimposed as a telop 504 on an image including the French character “Canada”.

以上の処理により、本発明の実施の形態１における映像翻訳装置において、映像に対応した位置情報を持つ映像データから、位置情報が示す国、地域で使用されている言語を自動的に判定し、これにより翻訳の入力となる言語を選択することができ、ユーザが選択した出力言語に画像と音声を翻訳して、翻訳結果を映像情報として出力することが可能となる。 With the above processing, in the video translation apparatus according to Embodiment 1 of the present invention, the language used in the country and region indicated by the location information is automatically determined from the video data having the location information corresponding to the video, This makes it possible to select a language as an input for translation, translate an image and sound into an output language selected by the user, and output the translation result as video information.

なお、音声情報の翻訳処理を実施する際、本人及び、同行者の声は入力言語種別とは異なる場合が多いため、入力言語種別以外の言語の翻訳は行なわない機能を有する。同様に画像情報に含まれる文字についても、入力言語種別以外の言語の翻訳は行なわない機能を有するものとする。 It should be noted that when performing speech information translation processing, the voice of the person and the accompanying person is often different from the input language type, and thus has a function of not translating languages other than the input language type. Similarly, the characters included in the image information have a function that does not translate languages other than the input language type.

また、図３に示す位置情報と言語種別の対応関係において、緯度、経度の情報は数値の範囲を持つものとし、位置情報と各々の国、地域内の撮影地点との対応関係は維持し得るものとしている。 Further, in the correspondence relationship between the position information and the language type shown in FIG. 3, the latitude and longitude information are assumed to have numerical ranges, and the correspondence relationship between the position information and the shooting points in each country and region can be maintained. It is supposed to be.

また、図３に示す撮影地点の位置情報と、一地点の位置情報に対する国、地域で使用される言語種別の候補を複数保持するＤＢ構成とした場合、候補となる言語種別の中から最も高い翻訳精度が得られる最適な言語種別を自動的に判別する機能を有するものとする。 In addition, in the case of a DB configuration that holds a plurality of candidate language types used in the country and region for the location information of the shooting location and the location information of one location shown in FIG. 3, it is the highest among the candidate language types. It is assumed that it has a function of automatically discriminating an optimum language type that can obtain translation accuracy.

また、本発明の映像翻訳装置は動画に限らず、デジタルスチルカメラなどで撮影した静止画に対しても、画像の翻訳、出力が同様の手段にて実施できる。 In addition, the video translation apparatus of the present invention is not limited to moving images, and the translation and output of images can be performed by similar means for still images taken with a digital still camera or the like.

また、本発明の映像翻訳装置を、映像にＧＰＳにより取得した位置情報を対応させて記録させる撮影装置に接続、もしくは組込んで、携帯する構成をとることにより、映像を撮影しながら、その場で翻訳処理後の出力映像を視聴することが可能である。 In addition, the video translation device of the present invention can be connected to or incorporated in a shooting device that records and records position information acquired by GPS in correspondence with the video, so that it can be taken on the spot while shooting the video. It is possible to view the output video after translation processing.

また、本発明の映像翻訳装置を、パソコン用プログラムとして実施することによって、翻訳処理後の出力映像を、例えば松下電器産業のパソコン用デジタルビデオ動画編集ソフトウェア”ＭｏｔｉｏｎＤＶＳＴＵＤＩＯ”などの映像編集ソフトに引渡す、もしくは該ソフトウェアの機能として組込むことで、従来のノンリニア編集に加え、翻訳後の音声情報やテロップ情報の追加、修正などを可能とする。 Also, by implementing the video translation apparatus of the present invention as a program for a personal computer, the output video after translation processing is delivered to video editing software such as digital video video editing software “MotionDV STUDIO” for personal computers of Matsushita Electric Industrial Co., Ltd. Alternatively, by incorporating it as a function of the software, it becomes possible to add or modify post-translational audio information or telop information in addition to conventional nonlinear editing.

また、動画翻訳メールや、テレビ電話としての応用など、異国間の動画コミュニケーションツールとしての用途も広がる。 In addition, it will be used as a video communication tool between different countries, such as video translation emails and video phone applications.

（実施の形態２）
実施の形態１では、音声に含まれる入力言語の声を、出力言語の声に吹替えることで翻訳結果を映像情報として出力した。実施の形態２では、音声に含まれる入力言語の声を、字幕などの文字情報として画像に重畳させる手法を導入する。 (Embodiment 2)
In Embodiment 1, the translation result is output as video information by switching the voice of the input language included in the voice to the voice of the output language. In the second embodiment, a method of superimposing voice of an input language included in speech on an image as character information such as subtitles is introduced.

図６は、本発明の実施の形態２の映像翻訳装置の構成を示すブロック図である。図６において、１０１は位置情報、１０２は映像、１０３は映像データ、１０４は音声情報抽出部、１０５は画像情報抽出部、１０６は位置情報抽出部、１０７は音声解析部、１０８は画像解析部、１０９は入力言語選択部、１１０は位置言語種別ＤＢ、１１１は出力言語選択部、１１２は音声翻訳部、１１３は画像翻訳部、６１５は音声・画像情報重畳部、１１６は映像出力部である。位置情報、映像、映像データ、音声情報抽出部、画像情報抽出部、位置情報抽出部、音声解析部、画像解析部、入力言語選択部、位置言語種別ＤＢ、出力言語選択部、音声翻訳部、画像翻訳部、映像出力部は、実施の形態１と同様の動作を行なう。音声・画像情報重畳部６１５は、画像翻訳部１１３で翻訳した文字をテロップとして重畳するとともに、音声翻訳部１１２で翻訳した声を字幕などの文字情報に変換して画像に重畳させる。 FIG. 6 is a block diagram showing the configuration of the video translation apparatus according to Embodiment 2 of the present invention. In FIG. 6, 101 is position information, 102 is video, 103 is video data, 104 is an audio information extraction unit, 105 is an image information extraction unit, 106 is a position information extraction unit, 107 is an audio analysis unit, and 108 is an image analysis unit. 109 is an input language selection unit, 110 is a location language type DB, 111 is an output language selection unit, 112 is a speech translation unit, 113 is an image translation unit, 615 is a speech / image information superimposition unit, and 116 is a video output unit. . Position information, video, video data, audio information extraction unit, image information extraction unit, position information extraction unit, audio analysis unit, image analysis unit, input language selection unit, position language type DB, output language selection unit, speech translation unit, The image translation unit and the video output unit perform the same operations as in the first embodiment. The audio / image information superimposing unit 615 superimposes the characters translated by the image translating unit 113 as telops, converts the voice translated by the audio translating unit 112 into character information such as subtitles, and superimposes them on the image.

続いて、音声・画像翻訳処理を実行する場合の動作について、図６、図７を参照しながら説明する。図７は、実施の形態２の映像翻訳装置を用いて翻訳した映像出力情報の一例を表す図である。音声翻訳部１１２では、音声解析部１０７で抽出した声を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。例えば、図７のシーン１に示すように入力言語種別は英語であり、出力言語種別が日本語であった場合に、音声・画像情報重畳部６１５にて、音声解析部１０７で抽出した英語の声“ＷｅｌｃｏｍｅｔｏＨｏｏｌｙｗｏｏｄ”を音声翻訳部１１２で翻訳した日本語の“ようこそハリウッドへ”という文字情報に変換し、字幕７０４として画像に重畳させる。同様にシーン２では、加仏語の声“ＢｉｅｎｖｅｎｕｅｖｅｒｓＣａｎａｄａ”を日本語の“ようこそカナダへ”という。字幕７０４として画像に重畳させる。また、画像に対する処理は、テロップの表示が字幕と重ならない位置となるよう実施の形態１と同様の動作を行なう。 Next, the operation when the voice / image translation process is executed will be described with reference to FIGS. FIG. 7 is a diagram illustrating an example of video output information translated using the video translation apparatus according to the second embodiment. The speech translation unit 112 translates the voice extracted by the speech analysis unit 107 from the input language selected by the input language selection unit 109 to the output language selected by the output language selection unit 111. For example, as shown in scene 1 of FIG. 7, when the input language type is English and the output language type is Japanese, the audio / image information superimposing unit 615 extracts the English language extracted by the audio analyzing unit 107. The voice “Welcome to Hourwood” is converted into Japanese text information “Welcome to Hollywood” translated by the speech translation unit 112 and superimposed on the image as subtitle 704. Similarly, in the scene 2, the French voice “Bienvenue vers Canada” is called “Welcome to Canada” in Japanese. The caption 704 is superimposed on the image. Further, the processing for the image is performed in the same manner as in the first embodiment so that the display of the telop is at a position where it does not overlap with the caption.

このように、実施の形態２では、音声に含まれる入力言語の声を、字幕などの文字情報として画像に重畳させることで、画像のみで翻訳結果を理解できる映像情報として出力することが可能となる。 As described above, in the second embodiment, the voice of the input language included in the voice is superimposed on the image as text information such as subtitles, so that it can be output as video information with which the translation result can be understood only by the image. Become.

（実施の形態３）
実施の形態１では、画像に含まれる入力言語の文字を、出力言語の文字のテロップとして画像に重畳することで、翻訳結果を映像情報として出力した。実施の形態３では、画像に含まれる入力言語の文字を、声に変換して音声に合成させる手法を導入する。 (Embodiment 3)
In the first embodiment, the translation result is output as video information by superimposing an input language character included in the image as an output language character telop on the image. In the third embodiment, a method of converting characters of an input language included in an image into a voice and synthesizing it with the voice is introduced.

図８は、本発明の実施の形態３の映像翻訳装置の構成を示すブロック図である。図８において、１０１は位置情報、１０２は映像、１０３は映像データ、１０４は音声情報抽出部、１０５は画像情報抽出部、１０６は位置情報抽出部、１０７は音声解析部、１０８は画像解析部、１０９は入力言語選択部、１１０は位置言語種別ＤＢ、１１１は出力言語選択部、１１２は音声翻訳部、１１３は画像翻訳部、８１４は音声・画像情報吹替え部、１１６は映像出力部である。位置情報、映像、映像データ、音声情報抽出部、画像情報抽出部、位置情報抽出部、音声解析部、画像解析部、入力言語選択部、位置言語種別ＤＢ、出力言語選択部、音声翻訳部、画像翻訳部、映像出力部は、実施の形態１と同様の動作を行なう。音声・画像情報吹替え部８１４は、音声情報抽出部１０４で抽出した音声を音声翻訳部１１２で翻訳した声に吹替えるとともに、画像翻訳部１１３で翻訳した文字を声に変換して音声に合成させる。 FIG. 8 is a block diagram showing the configuration of the video translation apparatus according to Embodiment 3 of the present invention. In FIG. 8, 101 is position information, 102 is video, 103 is video data, 104 is an audio information extraction unit, 105 is an image information extraction unit, 106 is a position information extraction unit, 107 is an audio analysis unit, and 108 is an image analysis unit. 109 is an input language selection unit, 110 is a position language type DB, 111 is an output language selection unit, 112 is a speech translation unit, 113 is an image translation unit, 814 is a voice / image information dubbing unit, and 116 is a video output unit. . Position information, video, video data, audio information extraction unit, image information extraction unit, position information extraction unit, audio analysis unit, image analysis unit, input language selection unit, position language type DB, output language selection unit, speech translation unit, The image translation unit and the video output unit perform the same operations as in the first embodiment. The voice / image information dubbing unit 814 changes the voice extracted by the voice information extraction unit 104 to a voice translated by the voice translation unit 112, and converts the characters translated by the image translation unit 113 into voice and synthesizes the voice. .

続いて、音声・画像翻訳処理を実行する場合の動作について、図８、図９を参照しながら説明する。図９は、実施の形態３の映像翻訳装置を用いて翻訳した映像出力情報の一例を表す図である。画像翻訳部１１３では、画像解析部１０８で抽出した文字を、入力言語選択部１０９で選択された入力言語から出力言語選択部１１１で選択された出力言語に翻訳する。例えば、図９のシーン１に示すように入力言語種別は英語であり、出力言語種別が日本語であった場合に、音声・画像情報吹替え部８１４にて、画像情報重畳部１１５にて、画像情報抽出部１０５で抽出した英語の文字“ＨＯＯＬＹＷＯＯＤ”が含まれる画像から画像翻訳部１１３で翻訳した日本語の「“ハリウッド”の表示があります」のような音声案内９０３に変換して、音声に合成させる。同様にシーン２では、加仏語の文字“Ｃａｎａｄａ”が含まれる画像から日本語の「“カナダ” の表示があります」のような音声案内９０３に変換して、音声に合成させる。また、音声に対する処理は、音声案内と吹替えの声が重ならないよう実施の形態１と同様の動作を行なう。 Next, the operation when the voice / image translation process is executed will be described with reference to FIGS. FIG. 9 is a diagram illustrating an example of video output information translated using the video translation device according to the third embodiment. The image translation unit 113 translates the characters extracted by the image analysis unit 108 from the input language selected by the input language selection unit 109 into the output language selected by the output language selection unit 111. For example, as shown in scene 1 of FIG. 9, when the input language type is English and the output language type is Japanese, the audio / image information dubbing unit 814 performs image processing on the image information superimposing unit 115. The image including the English character “HOOOLYWOOD” extracted by the information extraction unit 105 is converted into a voice guidance 903 such as “There is a display of“ Hollywood ”” translated by the image translation unit 113, and converted into voice. Let's synthesize. Similarly, in the scene 2, an image including the French-language character “Canada” is converted into a voice guidance 903 such as “There is a display of“ Canada ”” in Japanese, and synthesized into a voice. In the processing for voice, the same operation as in the first embodiment is performed so that voice guidance and voice of voice-over do not overlap.

このように、実施の形態３では、画像に含まれる入力言語の文字を、音声案内などの声の情報として音声に合成させることで、音声のみで翻訳結果を理解できる映像情報として出力することが可能となる。 As described above, in the third embodiment, characters in an input language included in an image are synthesized with voice as voice information such as voice guidance, so that the translation result can be output only by voice as video information that can be understood. It becomes possible.

（実施の形態４）
実施の形態１から３では、出力言語種別をユーザが任意に選択するものとしていた。実施の形態４では、出力言語選択部にＧＰＳにより取得した現在の位置情報を入力し、ユーザが本発明の映像翻訳装置を使用する現在地の国、地域で使用されている言語を、自動的に出力言語として選択する出力言語選択手段を有する。 (Embodiment 4)
In the first to third embodiments, the user arbitrarily selects the output language type. In the fourth embodiment, the current location information acquired by GPS is input to the output language selection unit, and the language used in the current country / region where the user uses the video translation device of the present invention is automatically selected. Output language selection means for selecting an output language is provided.

図１０は、本発明の実施の形態４における出力言語選択部の構成を示すブロック図であり。図１０において、１１０は位置言語種別ＤＢ、１１１は出力言語選択部、１００１は位置情報である。続いて、出力言語選択処理を実行する場合の動作について、図３、図１０を参照しながら説明する。例えば、実施の形態４における本発明の映像翻訳装置を北緯４０度、東経１３５度の地点で動作させた場合、ＧＰＳにより緯度、経度の情報が位置情報１００１として取得でき、図３に示すとおり、位置言語種別ＤＢ１１０に集積されているデータから、出力言語の言語種別として日本語が選択される。 FIG. 10 is a block diagram showing the configuration of the output language selection unit in the fourth embodiment of the present invention. In FIG. 10, 110 is a location language type DB, 111 is an output language selection unit, and 1001 is location information. Next, the operation when the output language selection process is executed will be described with reference to FIGS. For example, when the video translation apparatus of the present invention in Embodiment 4 is operated at a point of 40 degrees north latitude and 135 degrees east longitude, latitude and longitude information can be acquired as position information 1001 by GPS, and as shown in FIG. Japanese is selected as the language type of the output language from the data accumulated in the position language type DB 110.

以上の処理により、本発明の実施の形態４における出力言語選択部において、現在の位置情報が示す国、地域で使用されている言語を自動的に判定し、翻訳の出力となる言語を選択することが可能となる。 Through the above processing, the output language selection unit according to Embodiment 4 of the present invention automatically determines the language used in the country and region indicated by the current position information, and selects the language to be output for translation. It becomes possible.

本発明の映像翻訳装置は、撮影後の映像データに含まれる刻々と変化する位置情報に応じて、撮影した国、地域の言語を判定し、映像を任意の他の種類の言語に翻訳できる。これにより、旅行やビジネス、報道目的など、国内外で撮影した映像データを基に自国語や他の言語に翻訳して視聴することができる。 The video translation apparatus of the present invention can determine the language of the country and region in which the image was taken, and translate the image into any other type of language according to the position information that changes every moment included in the image data after shooting. As a result, it is possible to translate and view in the native language or other languages based on video data taken at home and abroad, such as for travel, business, and reporting purposes.

本発明の実施の形態１における映像翻訳装置の構成を示すブロック図1 is a block diagram showing a configuration of a video translation apparatus according to Embodiment 1 of the present invention. 映像データに格納されている情報の例を表す図A figure showing an example of information stored in video data 位置言語種別ＤＢに格納されている情報の例を表す図The figure showing the example of the information stored in position language classification DB 出力言語選択画面の例を表す図Diagram showing an example of the output language selection screen 実施の形態１の実施の効果例図Example of effect of implementation of embodiment 1 本発明の実施の形態２における映像翻訳装置の構成を示すブロック図The block diagram which shows the structure of the video translation apparatus in Embodiment 2 of this invention. 実施の形態２の実施の効果例図Example of effect of implementation of embodiment 2 本発明の実施の形態３における映像翻訳装置の構成を示すブロック図The block diagram which shows the structure of the video translation apparatus in Embodiment 3 of this invention. 実施の形態３の実施の効果例図Example of effect of implementation of embodiment 3 本発明の実施の形態４における出力言語選択部の構成を示すブロック図The block diagram which shows the structure of the output language selection part in Embodiment 4 of this invention.

Explanation of symbols

１０１位置情報
１０２映像
１０３映像データ
１０４音声情報抽出部
１０５画像情報抽出部
１０６位置情報抽出部
１０７音声解析部
１０８画像解析部
１０９入力言語選択部
１１０位置言語種別ＤＢ
１１１出力言語選択部
１１２音声翻訳部
１１３画像翻訳部
１１４音声情報吹替え部
１１５画像情報重畳部
１１６映像出力部
２０１画像
２０２音声
２０３位置情報
４０１出力言語入力域
４０２出力言語選択リスト
５０１画像
５０２音声
５０３画像に含まれる文字
５０４テロップ
５０５声
６１５音声・画像情報重畳部
８１４音声・画像情報吹替え部
９０３音声案内
１００１位置情報 101 Position Information 102 Video 103 Video Data 104 Audio Information Extraction Unit 105 Image Information Extraction Unit 106 Position Information Extraction Unit 107 Audio Analysis Unit 108 Image Analysis Unit 109 Input Language Selection Unit 110 Location Language Type DB
DESCRIPTION OF SYMBOLS 111 Output language selection part 112 Speech translation part 113 Image translation part 114 Voice information dubbing part 115 Image information superimposition part 116 Video output part 201 Image 202 Voice 203 Position information 401 Output language input area 402 Output language selection list 501 Image 502 Voice 503 Image 504 Telop 505 Voice 615 Audio / image information superimposing unit 814 Audio / image information dubbing unit 903 Audio guidance 1001 Position information

Claims

Extraction means for extracting a position from video data including position information, input language selection means for selecting an input language corresponding to the position extracted by the extraction means, and a user arbitrarily selecting another type of output language Output language selection means, and an input language selected by the input language selection means and a translation means for performing video translation processing based on the output language selected by the output language selection means Video translation device.

Based on the output language selected by the output language selection means, the voice extraction means for extracting voice from the video data, the analysis means for analyzing the voice included in the voice extracted by the voice extraction means, The video translation apparatus according to claim 1, further comprising speech translation means for translating the voice analyzed by the analysis means.

3. The video translation apparatus according to claim 2, further comprising output means for outputting a video in which the voice translated by the voice translation means is dubbed into a voice included in the original video.

The apparatus further comprises output means for translating the voice translated by the speech translation means into the output language selected by the output language selection means and outputting a video superimposed as a telop on the original video. Item 3. The video translation apparatus according to Item 2.

Image extraction means for extracting an image from video data; and character analysis means for analyzing characters included in the image extracted by the image extraction means; and the translation means for analyzing characters analyzed by the character analysis means The video translation apparatus according to claim 1, wherein the video translation apparatus translates the video.

6. The video translation apparatus according to claim 5, further comprising output means for outputting a video in which characters translated by the translation means are superimposed as a telop on the original video.

It further comprises output means for converting the character translated by the translation means into sound based on the output language selected by the output language selection means, and outputting a video obtained by adding the sound to the original video. The video translation apparatus according to claim 5, wherein the video translation apparatus is characterized.

The video translation apparatus according to claim 1, wherein the position information extracted by the extraction unit is acquired by GPS.