JP2012074961A

JP2012074961A - Video shooting device and method of adding captions to audio

Info

Publication number: JP2012074961A
Application number: JP2010218946A
Authority: JP
Inventors: Shinichi Itamoto; 真一板本
Original assignee: NEC Casio Mobile Communications Ltd
Current assignee: NEC Casio Mobile Communications Ltd
Priority date: 2010-09-29
Filing date: 2010-09-29
Publication date: 2012-04-12

Abstract

PROBLEM TO BE SOLVED: To add captions to audio in a user-desired language.SOLUTION: A video shooting device according to the present invention comprises: a camera module configured to shoot a video of a subject; a microphone configured to collect audio in the neighborhood of the subject while the camera module shoots a video; an audio-to-text conversion module configured to convert audio data of audio collected by the microphone into text data; a text translation module configured to translate text data included in the text data converted by the audio-to-text conversion module and written in a language other than a predetermined language into text data in the predetermined language; and a video text superimposition module configured to superimpose the text data in the predetermined language converted by the audio-to-text conversion module and the text data in the predetermined language translated by the text translation module on video data of the video shot by the camera module as captions.

Description

本発明は、動画撮影装置、音声字幕化方法に関する。 The present invention relates to a moving image photographing apparatus and an audio subtitle conversion method.

テレビなどでは、よく出演者のコメントなどが字幕として表示される。 On TV and the like, comments of performers are often displayed as subtitles.

しかし、一般ユーザが、デジタルカメラ、携帯電話機などの動画撮影機能が搭載された動画撮影装置で撮影した動画に字幕を合成するためには、パソコンなどを使用して字幕を追加する必要があり、非常に煩わしい作業が必要であった。 However, in order for a general user to add subtitles to a video shot by a video shooting device equipped with a video shooting function such as a digital camera or a mobile phone, it is necessary to add subtitles using a personal computer or the like. Very troublesome work was necessary.

そこで、最近は、動画撮影装置において、動画撮影時に集音した音声データをテキストデータに変換し、変換したテキストデータを字幕として動画データに合成する技術が提案されている（例えば、特許文献１，２，３）。 Therefore, recently, a technique for converting audio data collected during moving image shooting into text data in a moving image shooting apparatus and synthesizing the converted text data as moving image data has been proposed (for example, Patent Document 1). 2, 3).

特開２００４−１２０２７９号公報JP 2004-120279 A 特開２００６−１９７１１５号公報JP 2006-197115 A 特開２００７−０２７９９０号公報JP 2007-027990 A

しかし、特許文献１，２，３に記載の技術では、音声データが日本語であるか、その他の言語であるかを特に言及しておらず、そのために、英語で話している人の音声データはそのまま英語のテキストデータに変換されると考えられる。 However, the techniques described in Patent Literatures 1, 2, and 3 do not particularly mention whether the voice data is in Japanese or another language, and therefore, voice data of a person who speaks in English. Is considered to be converted into English text data.

そうすると、英語が苦手なユーザにとっては、せっかく音声データを字幕化しても、字幕の内容がわからず、非常に不便であるという問題がある。したがって、ユーザ所望の言語で音声を字幕化できれば、便宜が良いと考えられる。 Then, for users who are not good at English, there is a problem that even if the audio data is converted into subtitles, the contents of the subtitles are not understood and it is very inconvenient. Therefore, it is considered convenient if the audio can be captioned in the language desired by the user.

そこで、本発明の目的は、上述した課題を解決し、ユーザ所望の言語で音声を字幕化することができる動画撮影装置、音声字幕化方法を提供することにある。 Accordingly, an object of the present invention is to solve the above-described problems and provide a moving image shooting apparatus and a voice captioning method capable of captioning voice in a user-desired language.

本発明の動画撮影装置は、
被写体を動画で撮影するカメラ部と、
前記カメラ部による動画の撮影時における被写体周辺の音声を集音するマイクと、
前記マイクにより集音された音声の音声データをテキストデータに変換する音声テキスト化部と、
前記音声テキスト化部により変換されたテキストデータのうち所定の言語以外の他言語のテキストデータを、前記所定の言語のテキストデータに翻訳するテキスト翻訳部と、
前記音声テキスト化部により変換された前記所定の言語のテキストデータおよび前記テキスト翻訳部により翻訳された前記所定の言語のテキストデータを、前記カメラ部により撮影された動画の動画データに字幕として合成する動画テキスト合成部と、を有する。 The moving image photographing apparatus of the present invention is
A camera unit that shoots the subject as a movie,
A microphone that collects sound around the subject at the time of shooting a video by the camera unit;
A voice text converting unit that converts voice data of voice collected by the microphone into text data;
A text translation unit that translates text data of a language other than a predetermined language out of the text data converted by the speech text unit into the text data of the predetermined language;
The text data of the predetermined language converted by the voice text unit and the text data of the predetermined language translated by the text translation unit are combined as subtitles with the moving image data of the video shot by the camera unit. A moving image text composition unit.

本発明の音声字幕化方法は、
動画撮影装置による音声字幕化方法であって、
被写体を動画で撮影するステップと、
前記動画の撮影時における被写体周辺の音声を集音するステップと、
前記集音された音声の音声データをテキストデータに変換するステップと、
前記変換されたテキストデータのうち所定の言語以外の他言語のテキストデータを、前記所定の言語のテキストデータに翻訳するステップと、
前記変換された前記所定の言語のテキストデータおよび前記翻訳された前記所定の言語のテキストデータを、前記撮影された動画の動画データに字幕として合成するステップと、を有する。 The audio subtitle conversion method of the present invention
An audio subtitle method by a video shooting device,
Shooting a subject with a video,
Collecting sound around the subject at the time of shooting the video;
Converting voice data of the collected voice into text data;
Translating text data of a language other than a predetermined language out of the converted text data into text data of the predetermined language;
Synthesizing the converted text data in the predetermined language and the translated text data in the predetermined language as subtitles with the moving image data of the captured moving image.

本発明は以上のように構成されているため、ユーザは、所望の言語で音声を字幕化することができるという効果が得られる。 Since the present invention is configured as described above, it is possible to obtain an effect that the user can convert audio into a desired language.

本発明の第１および第２の実施形態の動画撮影装置の構成を示すブロック図である。It is a block diagram which shows the structure of the moving image imaging device of the 1st and 2nd embodiment of this invention. 本発明の第１の実施形態の動画撮影装置の動画撮影時の動作を説明するシーケンス図である。It is a sequence diagram explaining the operation | movement at the time of the video recording of the video recording device of the 1st Embodiment of this invention. 本発明の第１の実施形態の動画撮影装置のテキストデータの編集時の動作を説明するシーケンス図である。It is a sequence diagram explaining the operation | movement at the time of editing of the text data of the moving image imaging device of the 1st Embodiment of this invention. 本発明の第１の実施形態の動画撮影装置の動画再生時の動作を説明するシーケンス図である。It is a sequence diagram explaining the operation | movement at the time of the moving image reproduction of the moving image imaging device of the 1st Embodiment of this invention. 本発明の第２の実施形態の動画撮影装置の動画撮影時の動作を説明するシーケンス図である。It is a sequence diagram explaining the operation | movement at the time of the video recording of the video recording device of the 2nd Embodiment of this invention.

以下に、本発明を実施するための形態について図面を参照して説明する。
（１）第１の実施形態
図１は、本発明の第１の実施形態の動画撮影装置の構成を示すブロック図である。 EMBODIMENT OF THE INVENTION Below, the form for implementing this invention is demonstrated with reference to drawings.
(1) First Embodiment FIG. 1 is a block diagram illustrating a configuration of a moving image shooting apparatus according to a first embodiment of the present invention.

図１に示すように、本実施形態の動画撮影装置は、制御部１０１と、操作部１０２と、動画撮影部１０３と、記憶部１０６と、音声テキスト化部１０７と、テキスト翻訳部１０８と、テキスト編集部１０９と、動画テキスト合成部１１０と、ディスプレイ１１１と、スピーカ１１２と、を有している。 As shown in FIG. 1, the moving image shooting apparatus of the present embodiment includes a control unit 101, an operation unit 102, a moving image shooting unit 103, a storage unit 106, an audio text conversion unit 107, a text translation unit 108, A text editing unit 109, a moving image text synthesis unit 110, a display 111, and a speaker 112 are included.

制御部１０１は、本動画撮影装置内の構成要素を制御して各種の処理を実行する。 The control unit 101 performs various processes by controlling the components in the moving image photographing apparatus.

操作部１０２は、ユーザによる各種の操作が行われる。 The operation unit 102 performs various operations by the user.

動画撮影部１０３は、カメラ部１０４およびマイク１０５を含む。 The moving image shooting unit 103 includes a camera unit 104 and a microphone 105.

カメラ部１０４は、被写体を動画で撮影する。 The camera unit 104 captures a subject with a moving image.

マイク１０５は、カメラ部１０４による動画撮影時に、被写体周辺の音声を集音する。 The microphone 105 collects sound around the subject when the camera unit 104 captures a moving image.

記憶部１０６は、各種のデータが格納される。 The storage unit 106 stores various data.

音声テキスト化部１０７は、マイク１０５により集音された音声の音声データを音声認識してテキストデータに変換する。 The voice text converting unit 107 recognizes voice data collected by the microphone 105 and converts the voice data into text data.

テキスト翻訳部１０８は、音声テキスト化部１０７により変換されたテキストデータのうち所定の言語（例えば、日本語）以外の他言語（例えば、英語）のテキストデータを、所定の言語のテキストデータに翻訳する。 The text translation unit 108 translates text data in a language other than a predetermined language (for example, Japanese) out of the text data converted by the speech text conversion unit 107 into text data in a predetermined language. To do.

テキスト編集部１０９は、音声テキスト化部１０７により変換された所定の言語のテキストデータおよびテキスト翻訳部１０８により翻訳された所定の言語のテキストデータを編集する。 The text editing unit 109 edits the text data of a predetermined language converted by the voice text converting unit 107 and the text data of a predetermined language translated by the text translation unit 108.

動画テキスト合成部１１０は、音声テキスト化部１０７により変換された所定の言語のテキストデータおよびテキスト翻訳部１０８により翻訳された所定の言語のテキストデータを、カメラ部１０４により撮影された動画の動画データに字幕として合成する。 The moving image text synthesizing unit 110 uses the moving image moving image data of the moving image captured by the camera unit 104 using the text data of the predetermined language converted by the voice text converting unit 107 and the text data of the predetermined language translated by the text translating unit 108. Composite as subtitles.

なお、動画テキスト合成部１１０は、テキスト編集部１０９によりテキストデータの編集が行われていた場合は、編集後のテキストデータを動画データに合成する。 If the text editing unit 109 has edited the text data, the moving image text synthesizing unit 110 synthesizes the edited text data with the moving image data.

ディスプレイ１１１は、カメラ部１０４により撮影された動画データおよびその動画データに字幕として合成された所定の言語のテキストデータを表示する表示部である。 The display 111 is a display unit that displays moving image data photographed by the camera unit 104 and text data of a predetermined language synthesized as captions on the moving image data.

スピーカ１１２は、マイク１０５により集音された音声の音声データを音声出力する。 The speaker 112 outputs the sound data collected by the microphone 105 as sound.

以下、本実施形態の動画撮影装置の動作について説明する。以下では、ユーザにより操作部１０２を介して、所定の言語が日本語に設定されているものとして説明する。 Hereinafter, the operation of the moving image shooting apparatus of the present embodiment will be described. In the following description, it is assumed that the predetermined language is set to Japanese via the operation unit 102 by the user.

まず、動画撮影時の動作について、図２を参照して説明する。 First, the operation at the time of moving image shooting will be described with reference to FIG.

図２に示すように、動画撮影部１０３は、動画撮影を開始すると（ステップＡ１０１）、記憶部１０６に対して、「同期データ（撮影開始からの経過時間等の、同期を取るのに必要なデータ。以下、同じ）」がそれぞれ付加された「動画データ（カメラ部１０４により撮影された動画のデータ。以下、同じ）」および「音声データ（マイク１０５により動画撮影時に集音された音声のデータ。以下、同じ）」の書き込み依頼を行い（ステップＡ１０２）、記憶部１０６は、「同期データ」がそれぞれ付加された「動画データ」および「音声データ」を書き込む（ステップＡ１０３）。 As shown in FIG. 2, when the moving image photographing unit 103 starts moving image photographing (step A101), the moving image photographing unit 103 notifies the storage unit 106 that “synchronization data (e.g., elapsed time from the start of photographing is necessary for synchronization). "Moving image data (moving image data shot by the camera unit 104. the same applies hereinafter)" and "audio data (voice data collected by the microphone 105 during shooting of moving images) to which" data. The following is the same) (step A102), and the storage unit 106 writes “moving image data” and “audio data” to which “synchronized data” is added (step A103).

動画撮影部１０３は、動画撮影を終了すると（ステップＡ１０４）、その旨を音声テキスト化部１０７に通知する（ステップＡ１０５）。 When the moving image shooting unit 103 finishes moving image shooting (step A104), the moving image shooting unit 103 notifies the voice text converting unit 107 to that effect (step A105).

音声テキスト化部１０７は、動画撮影部１０３から撮影終了の通知を受けると、記憶部１０６に対して、「同期データ」が付加された「音声データ」の読み出し依頼を行い（ステップＡ１０６）、記憶部１０６は、「同期データ」が付加された「音声データ」を読み出し、音声テキスト化部１０７に引き渡す（ステップＡ１０７）。 When receiving the notification of the end of shooting from the moving image shooting unit 103, the voice text conversion unit 107 requests the storage unit 106 to read “voice data” with “synchronization data” added (step A106). The unit 106 reads out “voice data” to which “synchronization data” is added, and delivers it to the voice text unit 107 (step A107).

音声テキスト化部１０７は、記憶部１０６から「同期データ」が付加された「音声データ」が引き渡されると、「音声データ」を音声認識して「テキストデータ」に変換し（ステップＡ１０８）、「同期データ」が付加された「テキストデータ」をテキスト翻訳部１０８に引き渡す（ステップＡ１０９）。なお、音声データを音声認識してテキストデータに変換する技術はすでに幾つか知られており、そのいずれもが本実施形態では使用可能である。 When the “speech data” to which “synchronization data” is added is delivered from the storage unit 106, the speech text unit 107 recognizes the “speech data” and converts it into “text data” (step A108). The “text data” to which “synchronized data” is added is delivered to the text translation unit 108 (step A109). Several techniques for recognizing voice data and converting it into text data are already known, and any of them can be used in this embodiment.

テキスト翻訳部１０８は、音声テキスト化部１０７から「同期データ」が付加された「テキストデータ」が引き渡されると、「テキストデータ」の中で日本語以外の他言語で記述された部分を、日本語へ翻訳する（ステップＡ１１０）。なお、翻訳技術はすでに幾つか知られており、そのいずれもが本実施形態では使用可能である。 When the “text data” to which “synchronization data” is added is delivered from the speech text conversion unit 107, the text translation unit 108 converts a part of the “text data” described in a language other than Japanese into Japanese. Translate into words (step A110). Several translation techniques are already known, and any of them can be used in this embodiment.

このとき、「テキストデータ」の中の他言語を全て日本語に翻訳する必要はなく、例えば、「テキストデータ」を、文単位で区切り、区切った文中に日本語が全く含まれていない場合（すなわち、その文が全て日本語以外の他言語で構成されている場合）に、その文を“日本語以外の他言語で記述された部分”として扱い、その文のみを日本語へ翻訳すればよい。なお、テキストデータを文単位で区切る技術はすでに幾つか知られており、そのいずれもが本実施形態では使用可能である。 At this time, it is not necessary to translate all other languages in the “text data” into Japanese. For example, when “text data” is delimited in sentence units, and the delimited sentence does not contain Japanese at all ( That is, if the sentence is composed entirely of a language other than Japanese), treat the sentence as “a part written in a language other than Japanese” and translate only that sentence into Japanese. Good. Several techniques for dividing text data into sentences are already known, and any of them can be used in the present embodiment.

そして、テキスト翻訳部１０８は、記憶部１０６に対して、「同期データ」が付加された翻訳後の「テキストデータ」の書き込み依頼を行い（ステップＡ１１１）、記憶部１０６は、「同期データ」が付加された翻訳後の「テキストデータ」を書き込み（ステップＡ１１２）、処理を終了する。このとき、テキスト翻訳部１０８は、「テキストデータ」のうち翻訳を行わなかった部分については、音声テキスト化部１０７から引き渡されたものをそのまま記憶部１０６に引き渡す。 Then, the text translation unit 108 requests the storage unit 106 to write the translated “text data” with “synchronization data” added (step A111), and the storage unit 106 stores “synchronization data”. The added translated “text data” is written (step A112), and the process is terminated. At this time, the text translation unit 108 hands over the portion of the “text data” that has not been translated to the storage unit 106 as it is passed from the speech text conversion unit 107.

次に、テキストデータの編集時の動作について、図３を参照して説明する。 Next, the operation at the time of editing text data will be described with reference to FIG.

図３に示すように、テキスト編集部１０９は、記憶部１０６に格納された「テキストデータ」を編集する場合、記憶部１０６に対して、「同期データ」が付加された「テキストデータ」の読み出し依頼を行い（ステップＢ１０１）、記憶部１０６は、「同期データ」が付加された「テキストデータ」を読み出し、テキスト編集部１０９に引き渡す（ステップＢ１０２）。 As shown in FIG. 3, when editing the “text data” stored in the storage unit 106, the text editing unit 109 reads “text data” with “synchronous data” added to the storage unit 106. A request is made (step B101), and the storage unit 106 reads “text data” to which “synchronization data” is added, and delivers it to the text editing unit 109 (step B102).

テキスト編集部１０９は、記憶部１０６から「同期データ」が付加された「テキストデータ」が引き渡されると、「テキストデータ」を編集し（ステップＢ１０３）、記憶部１０６に対して、「同期データ」が付加された編集後の「テキストデータ」の書き込み依頼を行い（ステップＢ１０４）、記憶部１０６は、「同期データ」が付加された編集後の「テキストデータ」を書き込み（ステップＢ１０５）、処理を終了する。 When the “text data” to which “synchronization data” is added is delivered from the storage unit 106, the text editing unit 109 edits the “text data” (step B 103), and sends the “synchronization data” to the storage unit 106. Is requested to write the edited “text data” (step B104), and the storage unit 106 writes the edited “text data” to which “synchronized data” is added (step B105). finish.

次に、動画再生時の動作について、図４を参照して説明する。 Next, the operation at the time of moving image reproduction will be described with reference to FIG.

図４に示すように、動画テキスト合成部１１０は、動画を再生する場合、記憶部１０６に対して、「同期データ」がそれぞれ付加された「音声データ」、「動画データ」、および「テキストデータ」の読み出し依頼を行い（ステップＣ１０１）、記憶部１０６は、「同期データ」がそれぞれ付加された「音声データ」、「動画データ」、および「テキストデータ」を読み出し、動画テキスト合成部１１０に引き渡す（ステップＣ１０２）。 As shown in FIG. 4, when reproducing a moving image, the moving image text synthesizing unit 110 adds “synchronous data” to “audio data”, “moving image data”, and “text data” to the storage unit 106. Is read out (step C101), and the storage unit 106 reads out “audio data”, “moving image data”, and “text data” to which “synchronous data” has been added, and delivers them to the moving image text synthesizing unit 110. (Step C102).

動画テキスト合成部１１０は、記憶部１０６から「同期データ」がそれぞれ付加された「音声データ」、「動画データ」、および「テキストデータ」が引き渡されると、「同期データ」を基に、「テキストデータ」を字幕として「動画データ」に合成して「字幕付き動画データ」を生成し、「字幕付き動画データ」をディスプレイ１１１に表示するとともに、「同期データ」を基に、「音声データ」を、「字幕付き動画データ」と同期させて、スピーカ１１２から音声出力する（ステップＣ１０３）。 When the “speech data”, “moving image data”, and “text data” to which “synchronization data” has been added are delivered from the storage unit 106, the video text synthesis unit 110 receives “text data” based on the “synchronization data”. “Data” is combined with “Video data” as subtitles to generate “Video data with subtitles”, “Video data with subtitles” is displayed on the display 111, and “Audio data” is converted based on “Synchronization data”. Then, audio is output from the speaker 112 in synchronization with the “moving image data with caption” (step C103).

なお、動画テキスト合成部１１０は、「字幕付き動画データ」を記憶部１０６に格納する場合は、記憶部１０６に対して、「同期データ」が付加された「字幕付き動画データ」の書き込み依頼を行い（ステップＣ１０４）、記憶部１０６は、「同期データ」が付加された「字幕付き動画データ」を書き込み（ステップＣ１０５）、処理を終了する。
（２）第２の実施形態
本実施形態の動画撮影装置は、第１の実施形態と比較して、構成自体は同様であるが、動作が異なる。 In addition, when storing the “moving image data with subtitles” in the storage unit 106, the moving image text synthesis unit 110 requests the storage unit 106 to write “moving image data with subtitles” to which “synchronization data” is added. (Step C104), the storage unit 106 writes “moving image data with subtitles” added with “synchronization data” (step C105), and ends the process.
(2) Second Embodiment The moving image capturing apparatus according to the present embodiment is similar in structure to the first embodiment but operates differently.

すなわち、第１の実施形態においては、動画再生時に、音声データをテキスト化した字幕をディスプレイ１１１に表示していたのに対して、本実施形態においては、動画撮影時に、音声データをテキスト化した字幕をディスプレイ１１１のプレビュー画面に表示する。 That is, in the first embodiment, the subtitles in which the audio data is converted into text are displayed on the display 111 at the time of moving image reproduction, whereas in the present embodiment, the audio data is converted into text at the time of moving image shooting. The subtitles are displayed on the preview screen of the display 111.

以下、本実施形態の動画撮影装置の動画撮影時の動作について、図５を参照して説明する。以下では、ユーザにより操作部１０２を介して、所定の言語が日本語に設定されているものとして説明する。 Hereinafter, the operation at the time of moving image shooting of the moving image shooting apparatus of the present embodiment will be described with reference to FIG. In the following description, it is assumed that the predetermined language is set to Japanese via the operation unit 102 by the user.

図５に示すように、動画撮影部１０３は、動画撮影を開始すると（ステップＤ１０１）、記憶部１０６に対して、「同期データ」がそれぞれ付加された「動画データ」および「音声データ」の書き込み依頼を行い（ステップＤ１０２）、記憶部１０６は、「同期データ」がそれぞれ付加された「動画データ」および「音声データ」を書き込む（ステップＤ１０３）。 As shown in FIG. 5, when the moving image shooting unit 103 starts moving image shooting (step D101), writing of “moving image data” and “audio data” with “synchronization data” added to the storage unit 106, respectively. The request is made (step D102), and the storage unit 106 writes “moving image data” and “audio data” to which “synchronization data” is added (step D103).

また、動画撮影部１０３は、記憶部１０６に対して上記の書き込み依頼を行うと同時に、「同期データ」が付加された「音声データ」を、音声テキスト化部１０７に引き渡す（ステップＤ１０４）。 In addition, the moving image photographing unit 103 makes the above write request to the storage unit 106, and at the same time, delivers the “voice data” to which “synchronization data” is added to the voice text unit 107 (step D104).

音声テキスト化部１０７は、動画撮影部１０３から「同期データ」が付加された「音声データ」が引き渡されると、「音声データ」を「テキストデータ」に変換し（ステップＤ１０５）、「同期データ」が付加された「テキストデータ」をテキスト翻訳部１０８に引き渡す（ステップＤ１０６）。 When the “speech data” to which “synchronization data” is added is delivered from the moving image photographing unit 103, the speech text conversion unit 107 converts “speech data” into “text data” (step D105), and “synchronization data”. The “text data” added with is handed over to the text translation unit 108 (step D106).

テキスト翻訳部１０８は、音声テキスト化部１０７から「同期データ」が付加された「テキストデータ」が引き渡されると、「テキストデータ」の中で日本語以外の他言語で記述された部分を、日本語へ翻訳し（ステップＤ１０７）、翻訳後の「テキストデータ」を動画テキスト合成部１１０に引き渡す（ステップＤ１０８）。このとき、テキスト翻訳部１０８は、「テキストデータ」のうち翻訳を行わなかった部分については、音声テキスト化部１０７から引き渡されたものをそのまま動画テキスト合成部１１０に引き渡す。 When the “text data” to which “synchronization data” is added is delivered from the speech text conversion unit 107, the text translation unit 108 converts a part of the “text data” described in a language other than Japanese into Japanese. The text is translated into words (step D107), and the translated "text data" is delivered to the moving image text synthesis unit 110 (step D108). At this time, the text translation unit 108 hands over the portion of the “text data” that has not been translated to the video text synthesis unit 110 as it is passed from the speech text conversion unit 107.

動画テキスト合成部１１０は、テキスト翻訳部１０８から「テキストデータ」が引き渡されると、「テキストデータ」を字幕として「動画データ」に合成し、ディスプレイ１１１のプレビュー画面に表示する（ステップＤ１０９）。このとき、「音声データ」は、動画テキスト合成部１１０を経由せずに、スピーカ１１２から音声出力されることになる。 When the “text data” is delivered from the text translation unit 108, the moving image text synthesizing unit 110 synthesizes “text data” into “moving image data” as a caption and displays it on the preview screen of the display 111 (step D109). At this time, the “voice data” is output as voice from the speaker 112 without going through the moving image text synthesis unit 110.

また、テキスト翻訳部１０８は、動画テキスト合成部１１０に対して翻訳後の「テキストデータ」を引き渡すと同時に、記憶部１０６に対して、「同期データ」が付加された翻訳後の「テキストデータ」の書き込み依頼を行い（ステップＤ１１０）、記憶部１０６は、「同期データ」が付加された翻訳後の「テキストデータ」を書き込む（ステップＤ１１１）。このとき、テキスト翻訳部１０８は、「テキストデータ」のうち翻訳を行わなかった部分については、音声テキスト化部１０７から引き渡されたものをそのまま記憶部１０６に引き渡す。 Further, the text translation unit 108 delivers the translated “text data” to the moving image text synthesis unit 110, and at the same time, the translated “text data” to which “synchronization data” is added to the storage unit 106. Is written (step D110), and the storage unit 106 writes the translated “text data” to which “synchronous data” is added (step D111). At this time, the text translation unit 108 hands over the portion of the “text data” that has not been translated to the storage unit 106 as it is passed from the speech text conversion unit 107.

その後、動画撮影部１０３による動画撮影が終了すると（ステップＤ１１２）、処理を終了する。 Thereafter, when the moving image shooting by the moving image shooting unit 103 ends (step D112), the process ends.

なお、本実施形態においては、テキストデータの編集時の動作と動画再生時の動作については、第１の実施形態と同様であるため、説明を省略する（それぞれ図３、図４を参照）。
（３）他の実施形態
以上、実施形態を参照して本発明を説明したが、本発明は上記実施形態に限定されものでない。本発明の構成や詳細には、本発明の範囲内で当業者が理解し得る様々な変更をすることができる。 In the present embodiment, the operation at the time of editing text data and the operation at the time of reproducing a moving image are the same as those in the first embodiment, and thus description thereof is omitted (see FIGS. 3 and 4 respectively).
(3) Other Embodiments Although the present invention has been described above with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

例えば、第１および第２の実施形態において、動画データおよびテキストデータをディスプレイ１１１に表示する際には、ディスプレイ１１１の表示画面全体に動画データを表示し、その動画データ上にテキストデータを重ね合わせて表示してもよい。または、ディスプレイ１１１の表示画面を２画面に分割し、一方の画面に、動画データを表示し、他方の画面に、テキストデータを表示してもよい。 For example, in the first and second embodiments, when moving image data and text data are displayed on the display 111, the moving image data is displayed on the entire display screen of the display 111, and the text data is superimposed on the moving image data. May be displayed. Alternatively, the display screen of the display 111 may be divided into two screens, moving image data may be displayed on one screen, and text data may be displayed on the other screen.

また、第１および第２の実施形態において、テキストデータをディスプレイ１１１に表示する際には、音声テキスト化部１０７により変換された時点ですでに日本語であったテキストデータと、テキスト翻訳部１０８により日本語に翻訳されたテキストデータと、を異なる色やフォントで表示してもよい。 In the first and second embodiments, when text data is displayed on the display 111, the text data already in Japanese at the time of conversion by the speech text conversion unit 107 and the text translation unit 108 are displayed. The text data translated into Japanese may be displayed in a different color or font.

また、第１および第２の実施形態においては、所定の言語を日本語とし、日本語以外の言語で話している人の音声データを日本語のテキストデータに変換する例を示したが、所定の言語を他の言語とすることも当然に可能である。例えば、所定の言語を英語とした場合は、英語以外の言語で話している人の音声データを英語のテキストデータに変換することができる。 In the first and second embodiments, an example has been described in which the predetermined language is Japanese and the speech data of a person speaking in a language other than Japanese is converted into Japanese text data. Of course, other languages can be used. For example, when the predetermined language is English, voice data of a person speaking in a language other than English can be converted into English text data.

上述したように本発明の動画撮像装置においては、動画撮影時に集音された音声データをテキストデータに変換し、そのテキストデータのうち所定の言語以外の他言語のテキストデータを所定の言語のテキストデータに翻訳した上で、そのテキストデータを動画データに字幕として合成する。 As described above, in the moving image capturing apparatus of the present invention, voice data collected at the time of moving image shooting is converted into text data, and text data in a language other than a predetermined language among the text data is converted into text in a predetermined language. After being translated into data, the text data is combined with moving image data as subtitles.

そのため、音声データのうち所定の言語以外の他言語の音声データは、所定の言語のテキストデータに翻訳された上で動画データに合成される。 Therefore, voice data in a language other than the predetermined language among the voice data is translated into text data in a predetermined language and then synthesized into moving image data.

したがって、ユーザは、所望の言語を上記の所定の言語に設定することにより、所望の言語で音声を字幕化することができるという効果が得られる。 Therefore, the user can set the desired language as the predetermined language, thereby obtaining the effect that the audio can be subtitled in the desired language.

本発明は、動画撮影機能が搭載されたビデオカメラ、デジタルカメラ、携帯電話機、ＰＨＳ（Personal Handyphone System）、ＰＤＡ（Personal Digital Assistant）等の動画撮影装置に適用可能である。 The present invention can be applied to a moving image shooting apparatus such as a video camera, a digital camera, a mobile phone, a PHS (Personal Handyphone System), a PDA (Personal Digital Assistant) equipped with a moving image shooting function.

１０１制御部
１０２操作部
１０３動画撮影部
１０４カメラ部
１０５マイク
１０６記憶部
１０７音声テキスト化部
１０８テキスト翻訳部
１０９テキスト編集部
１１０動画テキスト合成部
１１１ディスプレイ
１１２スピーカ DESCRIPTION OF SYMBOLS 101 Control part 102 Operation part 103 Movie imaging | photography part 104 Camera part 105 Microphone 106 Memory | storage part 107 Speech text conversion part 108 Text translation part 109 Text editing part 110 Movie text synthesis part 111 Display 112 Speaker

Claims

A camera unit that shoots the subject as a movie,
A microphone that collects sound around the subject at the time of shooting a video by the camera unit;
A voice text converting unit that converts voice data of voice collected by the microphone into text data;
A text translation unit that translates text data of a language other than a predetermined language out of the text data converted by the speech text unit into the text data of the predetermined language;
The text data of the predetermined language converted by the voice text unit and the text data of the predetermined language translated by the text translation unit are combined as subtitles with the moving image data of the video shot by the camera unit. A moving image photographing device comprising: a moving image text synthesizing unit;

The moving image text synthesis unit
The text data of the predetermined language is synthesized as subtitles with the video data when the video shot by the camera unit is played back, and the video data and the text data of the predetermined language are displayed. Movie shooting device.

The moving image text synthesis unit
The text data of the predetermined language is combined with the video data as subtitles when capturing a video by the camera unit, and the video data and the text data of the predetermined language are displayed on a preview screen. Movie shooting device.

The moving image text synthesis unit
When displaying the moving image data and the text data of the predetermined language,
Divide the display screen into two screens,
Display the video data on one screen,
The moving image photographing apparatus according to claim 2 or 3, wherein text data of the predetermined language is displayed on the other screen.

The moving image text synthesis unit
When displaying the moving image data and the text data of the predetermined language,
The text data of the predetermined language converted by the speech text conversion unit and the text data of the predetermined language translated by the translation unit are displayed in different colors. The moving image photographing device according to the item.

The moving image text synthesis unit
When displaying the moving image data and the text data of the predetermined language,
The text data of the predetermined language converted by the speech text conversion unit and the text data of the predetermined language translated by the translation unit are displayed in different fonts. The moving image photographing device according to the item.

An audio subtitle method by a video shooting device,
Shooting a subject with a video,
Collecting sound around the subject at the time of shooting the video;
Converting voice data of the collected voice into text data;
Translating text data of a language other than a predetermined language out of the converted text data into text data of the predetermined language;
Synthesizing the converted text data of the predetermined language and the translated text data of the predetermined language as subtitles with the video data of the captured moving image.