JP4599244B2

JP4599244B2 - Apparatus and method for creating subtitles from moving image data, program, and storage medium

Info

Publication number: JP4599244B2
Application number: JP2005204736A
Authority: JP
Inventors: 恵弘倉片
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-07-13
Filing date: 2005-07-13
Publication date: 2010-12-15
Anticipated expiration: 2025-07-13
Also published as: JP2007027990A

Description

本発明は、動画データから字幕を自動生成及び編集する技術に関する。 The present invention relates to a technique for automatically generating and editing captions from moving image data.

近年、デジタルビデオカメラやデジタルカメラ等のように動画をデジタルデータで撮影できる機能を持った装置が普及し、それに伴い撮影した画像をデジタルビデオカメラやデジタルカメラで編集したり、撮影した動画を加工してオリジナル画像を楽しむ人が増えてきている。また、撮影した動画をパーソナルコンピュータに取り込み、パーソナルコンピュータで編集を行い、タイトル合成や字幕スーパーの追加を行ってオリジナル動画を作成する人も増えている。 In recent years, devices such as digital video cameras and digital cameras that have the ability to shoot movies with digital data have become widespread, and as a result, images taken are edited with digital video cameras and digital cameras, and the videos that have been shot are processed. More and more people enjoy original images. In addition, an increasing number of people take captured videos to a personal computer, edit them on a personal computer, and create an original video by adding a title composition or adding a caption.

一方、公共のテレビ放送においても、話者の台詞を字幕スーパーとして表示することで効果を演出し、また耳の不自由な人のために話者の台詞を字幕スーパーとして表示するサービスも行っているところも多くなっている。 On the other hand, in public TV broadcasting, the effect is produced by displaying the speech of the speaker as a caption supervision, and a service that displays the speech of the speaker as a caption supervision for people who are deaf There are also many places.

このように、撮影した動画に対して、話者の台詞を字幕スーパーとして動画データと重ねて表示するといったニーズが増えている。 In this way, there is an increasing need for the captured video to display the speaker's dialogue as a caption superimposition with the video data.

撮影した動画に対して、話者の台詞を字幕スーパーとして動画に取り込む際には、通常編集ソフトにおいて、テキストデータを台詞として入力し、話者の近辺に吹き出しとして表示するか、映画などのように画面下に字幕スーパーとして表示するように編集される。 When importing a speaker's dialogue into a movie as a subtitle for a recorded movie, text data is usually entered as dialogue in a normal editing software and displayed as a speech bubble near the speaker, or as a movie It is edited so that it will be displayed as a caption subtitle at the bottom of the screen.

この編集作業は時間のかかるもので、動画を再生し、編集者が耳で聞き取った結果を必要なテキストデータとして編集ソフトから入力し、あるフレーズが再生にかかる時間だけ表示するように編集しなければならない。また、誰が話者であるかを判断し、字幕スーパーの表示位置や字幕の色などを変える際には編集者がそれぞれ個別に位置や色などを指定する必要があり、編集が非常に困難かつ時間のかかるものになっていた。 This editing process is time consuming.You must play the video, input the result of the editor's listening to the input as necessary text data from the editing software, and edit it so that a phrase is displayed only for the time required for playback. I must. Also, when determining who is the speaker and changing the display position of subtitle supermarket and the color of subtitles, it is necessary for the editor to individually specify the position, color, etc. It was time consuming.

これらの作業を簡単に効率よく行うための技術として、下記の公開技術の応用が考えられる。即ち、
撮影画像から顔領域を検出し、予め作成したテキストデータを吹き出しとして、検出された顔の口元付近に表示する方法（例えば、特許文献１の請求項１０）やマイク等の音声入力端末に対して発言者が対応付けされており、音声からテキストへの変換を自動的に行い、発言者の顔を検出し、変換したテキストデータを吹き出しとして発言者の顔付近に表示する方法（例えば、特許文献２の請求項２）が提案されている。これらを支える技術として、顔領域の特徴量から特定の顔を識別する方法（例えば、特許文献３）や入力音声に含まれる特徴量を抽出し、予め登録されている音声の特徴量とのパターンマッチングを行う方法（例えば、特許文献４）、入力音声からテキスト化を行い、議事録を作成する方法（例えば、特許文献５）が提案されている。
特開２００２−１７６６１９号公報特開２００３−３３９０３４号公報特開平８−０６３５９７号公報特開平６−０８３３８２号公報特開平８−１９４４９２号公報 As a technique for performing these operations easily and efficiently, application of the following published technique can be considered. That is,
For a method of detecting a face area from a photographed image and displaying text data created in advance as a speech bubble near the mouth of the detected face (for example, claim 10 of Patent Document 1) or a voice input terminal such as a microphone A method in which a speaker is associated, voice to text is automatically converted, the face of the speaker is detected, and the converted text data is displayed in the vicinity of the speaker's face as a speech bubble (for example, Patent Literature Two claims 2) have been proposed. As a technology that supports these, a method for identifying a specific face from the feature amount of the face region (for example, Patent Document 3) or extracting a feature amount included in the input speech, and a pattern with a pre-registered speech feature amount A method for matching (for example, Patent Document 4) and a method for creating a minutes by converting text from input speech (for example, Patent Document 5) have been proposed.
JP 2002-176619 A JP 2003-339034 A JP-A-8-063597 Japanese Unexamined Patent Publication No. 6-083382 JP-A-8-194492

しかしながら、上記技術においては、簡単に字幕を作成するために、話者の音声データからテキストデータを作成し、話者の顔付近に吹き出しとして字幕スーパーの表示を行うことが可能であるが、話者の音声と話者の顔の対応付けはされておらず、予め話者を特定することが必要であった。従って、予め話者を特定した後に、話者の顔付近に字幕スーパーを表示するように編集していたため、自動的に話者を判別して所定の話者のところに吹き出しを付けるといった作業はできなかった。即ち、編集作業では必ず話者の特定を行い、その後編集することが必要であった。本発明は、このような課題を解決することを目的としている。 However, in the above technique, in order to easily create subtitles, it is possible to create text data from the speech data of the speaker and display the supertitle as a speech bubble near the speaker's face. The speaker's voice is not associated with the speaker's face, and it is necessary to specify the speaker in advance. Therefore, after specifying the speaker in advance, editing was performed so that the caption superimpose was displayed near the speaker's face, so the task of automatically identifying the speaker and adding a speech balloon to the predetermined speaker could not. That is, in editing work, it is necessary to specify a speaker and then edit it. The present invention aims to solve such a problem.

上記課題を解決するために、本発明は、画像及び音声を含む元動画データから字幕を作成する装置であって、前記元動画データの画像部分から顔の特徴量を検出する顔検出手段と、前記元動画データの音声部分から音声の特徴量を検出する音声識別手段と、前記顔検出手段により検出された顔の特徴量及び前記音声識別手段により検出された音声の特徴量を、予め準備された話者の声を識別する音声特徴量及び当該話者の顔を識別する顔特徴量と比較して話者を特定する話者特定手段と、特定された前記話者の顔位置を特定する位置特定手段と、特定された前記話者の音声から文字列を認識し、当該文字列のテキストデータを生成する音声認識手段と、前記位置特定手段により得られる顔位置と、前記音声認識手段により生成されたテキストデータとに基づいて、特定された前記話者から発声された文字列のテキストデータを表示画面内に表示するための吹き出しデータを作成する吹き出し作成手段と、前記元動画データに前記吹き出しデータを付加して新たな動画データを作成する動画像作成手段とを具備し、前記吹き出し作成手段は、前記吹き出し作成手段により作成された吹き出しデータに対して、特定された前記話者に対応した吹き出しの形、色、柄、大きさ、並びに文字の色、大きさ、字体の少なくともいずれかを編集するための吹き出し編集画面を表示する吹き出し編集手段を有し、前記吹き出し編集画面は、前記新たな動画データを表示するための画像表示領域と、前記吹き出しデータを編集するためのテキスト表示領域と、前記音声認識手段による音声認識を実行させるための音声認識操作部と、前記音声の再生を実行するための再生操作部と、を含み、前記吹き出し作成手段は、前記話者特定手段において話者の音声を認識したが顔が認識できない場合または話者が前記表示画面からいなくなった場合には、前記話者の顔位置に応じた吹き出しデータに代えて、前記表示画面下の領域に字幕スーパとして文字列のみを表示するためのデータを作成する。 In order to solve the above-described problem, the present invention is an apparatus for creating captions from original moving image data including images and sounds, and a face detection unit that detects a feature amount of a face from an image portion of the original moving image data, A voice identification unit that detects a voice feature amount from a voice portion of the original moving image data, a face feature amount detected by the face detection unit, and a voice feature amount detected by the voice identification unit are prepared in advance. A speaker specifying means for specifying a speaker in comparison with a voice feature amount for identifying a voice of a speaker and a face feature amount for identifying a face of the speaker, and specifying a face position of the specified speaker A position specifying unit, a voice recognition unit that recognizes a character string from the specified voice of the speaker, and generates text data of the character string; a face position obtained by the position specifying unit; and the voice recognition unit Generated text Based on the chromatography data, and blowout creating means to create a balloon data for displaying the text data string uttered from the identified said speaker on the display screen, the balloon data to the original video data And a moving image creating means for creating new moving image data, wherein the speech balloon creating means generates a speech balloon corresponding to the identified speaker with respect to the speech balloon data created by the speech balloon creating means. Speech balloon editing means for displaying a balloon editing screen for editing at least one of shape, color, pattern, size, and character color, size, and font; An image display area for displaying data, a text display area for editing the balloon data, and voice recognition by the voice recognition means A voice recognition operating unit for, viewing including and a reproduction operation portion for performing reproduction of the sound, the balloon creation means has been recognized recognized faces the voice of the speaker in the speaker identification means When it is not possible or when the speaker disappears from the display screen, instead of the balloon data corresponding to the speaker's face position, only a character string is displayed as a subtitle super in the area below the display screen. Create data .

また、本発明は、画像及び音声を含む元動画データから字幕を作成する方法であって、前記元動画データの画像部分から顔の特徴量を検出する顔検出工程と、前記元動画データの音声部分から音声の特徴量を検出する音声識別工程と、前記顔検出工程にて検出された顔の特徴量及び前記音声識別工程にて検出された音声の特徴量を、予め準備された話者の声を識別する音声特徴量及び当該話者の顔を識別する顔特徴量と比較して話者を特定する話者特定工程と、特定された前記話者の顔位置を特定する位置特定工程と、特定された前記話者の音声から文字列を認識し、当該文字列のテキストデータを生成する音声認識工程と、前記位置特定工程により得られる顔位置と、前記音声認識工程により生成されたテキストデータとに基づいて、特定された前記話者から発声された文字列のテキストデータを表示画面内に表示するための吹き出しデータを作成する吹き出し作成工程と、前記元動画データに前記吹き出しデータを付加して新たな動画データを作成する動画像作成工程とを備え、前記吹き出し作成工程は、前記吹き出し作成工程により作成された吹き出しデータに対して、特定された前記話者に対応した吹き出しの形、色、柄、大きさ、並びに文字の色、大きさ、字体の少なくともいずれかを編集するための吹き出し編集画面を表示する吹き出し編集工程を有し、前記吹き出し編集画面は、前記新たな動画データを表示するための画像表示領域と、前記吹き出しデータを編集するためのテキスト表示領域と、前記音声認識工程による音声認識を実行させるための音声認識操作部と、前記音声の再生を実行するための再生操作部と、を含み、前記吹き出し作成工程は、前記話者特定工程において話者の音声を認識したが顔が認識できない場合または話者が前記表示画面からいなくなった場合には、前記話者の顔位置に応じた吹き出しデータに代えて、前記表示画面下の領域に字幕スーパとして文字列のみを表示するためのデータを作成する。 Further, the present invention is a method for creating subtitles from original moving image data including images and sounds, a face detection step of detecting a facial feature amount from an image portion of the original moving image data, and sound of the original moving image data A speech identification step for detecting a feature amount of speech from a portion; a feature amount of a face detected in the face detection step; and a feature amount of a speech detected in the speech identification step. A speaker specifying step of specifying a speaker in comparison with a voice feature amount for identifying a voice and a face feature amount for identifying a face of the speaker, and a position specifying step of specifying the face position of the specified speaker A speech recognition step of recognizing a character string from the specified voice of the speaker and generating text data of the character string, a face position obtained by the position specification step, and a text generated by the speech recognition step Specific based on data The a balloon creation process to create a balloon data for displaying the text data of the spoken character string within the display screen from the speaker, the new moving image data by adding the balloon data to the original video data Creating a moving image creating step, the balloon creating step, for the balloon data created by the balloon creating step, the shape, color, pattern, size of the balloon corresponding to the specified speaker, And a balloon editing step for displaying a balloon editing screen for editing at least one of the color, size, and font of the character, wherein the balloon editing screen is an image display area for displaying the new moving image data. A text display area for editing the balloon data, and a voice recognition operation unit for executing voice recognition by the voice recognition step , Look-containing and a reproduction operation portion for performing reproduction of the sound, the balloon creation process, when it recognizes the voice of the speaker in the speaker identification step unrecognized face or speaker said display When the screen disappears, data for displaying only a character string as a subtitle super is created in the area below the display screen, instead of the balloon data corresponding to the speaker's face position .

なお、本発明は、コンピュータに上記画像及び音声を含む動画データから字幕を作成する方法を実行させるためのプログラムや、当該プログラムを記憶したコンピュータ可読記憶媒体としても実現可能である。 Note that the present invention can also be realized as a program for causing a computer to execute a method for creating subtitles from moving image data including images and sound, or a computer-readable storage medium storing the program.

本発明によれば、入力された動画データの顔と音声から話者を特定し、話者の位置と該当する話者の音声より吹き出しデータを作成するので、該当する話者の画像付近に吹き出しを表示でき、吹き出しや字幕スーパーの作成や編集が容易になる。 According to the present invention, the speaker is identified from the face and voice of the input video data, and the balloon data is created from the speaker position and the voice of the corresponding speaker. Can be displayed, making it easy to create and edit speech balloons and subtitles.

以下に、添付図面を参照して本発明の好適な実施形態について詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

尚、以下に説明する実施の形態は、本発明の実現手段としての一例であり、本発明が適用される装置の構成や各種条件によって適宜修正又は変更されるべきものであり、本発明は以下の実施の形態に限定されるものではない。 The embodiment described below is an example as means for realizing the present invention, and should be appropriately modified or changed according to the configuration and various conditions of the apparatus to which the present invention is applied. It is not limited to the embodiment.

［第１の実施形態］
図１は本発明に係る実施形態の自動吹き出し作成・編集処理機能を実現するためのブロック図である。 [First Embodiment]
FIG. 1 is a block diagram for realizing an automatic speech balloon creation / editing processing function according to an embodiment of the present invention.

１０１は動画入力部であり、動画の映像信号を入力する。入力される映像信号はデジタルデータとして入力される。入力された映像信号は、顔検出部１０３と画像データ部１１１へ送られる。顔検出部１０３では、入力された映像信号から人間の顔を抽出し、その特徴量を算出する。顔検出のアルゴリズムに関しては、肌色検出、目鼻口検出、顔の輪郭検出等既知の技術を用いている。特徴量の算出に関しては、検出された人間の顔に対して、目鼻口の位置関係や大きさ、顔の輪郭に対する比率等を組み合わせた既知の特徴量算出を用いている。また、顔検出部１０３では、顔の大きさや口の位置、顔の向きを判断し、これらの情報も顔特徴量して話者特定部１０７へ送られる。 Reference numeral 101 denotes a moving image input unit which inputs a moving image video signal. The input video signal is input as digital data. The input video signal is sent to the face detection unit 103 and the image data unit 111. The face detection unit 103 extracts a human face from the input video signal and calculates the feature amount. As for the face detection algorithm, known techniques such as skin color detection, eye-nose-mouth detection, and face contour detection are used. Regarding the calculation of the feature amount, a known feature amount calculation is used in which the positional relationship and size of the eyes and nose and mouth, the ratio to the face contour, and the like are combined with the detected human face. Further, the face detection unit 103 determines the size of the face, the position of the mouth, and the orientation of the face, and these pieces of information are also sent to the speaker specifying unit 107 as face feature amounts.

１０２は音声入力部であり、動画の音声信号を入力する。入力される音声信号はデジタルデータとして入力される。入力された音声信号は、音声識別部１０４と音声認識部１０５と音声データ部１１３へ送られる。音声識別部１０４では、入力された音声信号から音声の特徴量を算出する。音声認識のアルゴリズムに関しては、音声周波数の特性、声の強弱特性等を組み合わせた既知の特徴量算出を用いている。 An audio input unit 102 inputs an audio signal of a moving image. The input audio signal is input as digital data. The input voice signal is sent to the voice identification unit 104, the voice recognition unit 105, and the voice data unit 113. The voice identification unit 104 calculates a voice feature amount from the input voice signal. As for a speech recognition algorithm, a known feature amount calculation combining a speech frequency characteristic, a voice strength characteristic, and the like is used.

顔検出部１０３より算出された特徴量と音声識別部１０４より算出され特徴量は話者特定部１０７へ送られる。話者特定部１０７では、顔検出部１０３と音声識別部１０４より送られた特徴量に対して音声・顔対応部１０６に登録されている個人の音声特徴量と顔特徴量を比較・参照して話者を特定する。複数の人物が顔検出部１０３で認識されている場合には複数の顔特徴量が送られ、複数の人物が音声識別部１０４で認識されている場合には複数の音声特徴量が送られる。これら複数の特徴量が送られた場合には、特徴量の組み合わせを行い、音声・顔対応部１０６に登録されている音声特徴量と顔特徴量を比較・参照して複数人の話者を特定することも可能である。話者特定部１０７において、話者が特定されると、それぞれの話者の顔位置や大きさに関する情報が位置特定部１０９に送られ、また話者の識別情報が音声認識部１０５へ送られる。 The feature amount calculated by the face detection unit 103 and the feature amount calculated by the voice identification unit 104 are sent to the speaker specifying unit 107. The speaker identification unit 107 compares and refers to the individual voice feature quantity and face feature quantity registered in the voice / face correspondence unit 106 with respect to the feature quantities sent from the face detection unit 103 and the voice identification unit 104. To identify the speaker. When a plurality of persons are recognized by the face detection unit 103, a plurality of face feature amounts are sent, and when a plurality of persons are recognized by the voice identification unit 104, a plurality of sound feature amounts are sent. When these multiple feature quantities are sent, the feature quantities are combined, and the voice feature quantity registered in the voice / face correspondence unit 106 is compared with the face feature quantity to refer to a plurality of speakers. It is also possible to specify. When a speaker is specified by the speaker specifying unit 107, information on the face position and size of each speaker is sent to the position specifying unit 109, and speaker identification information is sent to the voice recognition unit 105. .

１０５は音声認識部であり、音声入力部１０２より入力された音声信号に対して、話者特定部１０７から通知される話者の識別情報により、特定話者に該当する場合に音声認識を行う。音声認識のアルゴリズムは、周辺ノイズの除去、特徴抽出、音響モデルによる音素検出等、既知の技術を用いている。 Reference numeral 105 denotes a voice recognition unit, which performs voice recognition on a voice signal input from the voice input unit 102 when the speaker corresponds to a specific speaker based on the speaker identification information notified from the speaker specifying unit 107. . The speech recognition algorithm uses known techniques such as removal of ambient noise, feature extraction, and phoneme detection using an acoustic model.

音声認識部１０５で音素に分解された情報はテキスト化部１０８へ送られ、語彙分析、文法解析が行われ、発声された文字列のテキストデータが生成される。テキスト化部１０８における音声のテキスト化アルゴリズムは、語彙分析や文法解析、辞書引き等、既知の技術を用いている。テキスト化された音声情報は、話者の情報と共に位置特定部１０９へ送られる。 Information decomposed into phonemes by the speech recognition unit 105 is sent to the text unit 108, where lexical analysis and grammatical analysis are performed, and text data of the uttered character string is generated. The speech text conversion algorithm in the text conversion unit 108 uses known techniques such as lexical analysis, grammatical analysis, and dictionary lookup. The voice information converted into text is sent to the position specifying unit 109 together with the speaker information.

音声認識部１０５及びテキスト化部１０８では、複数の話者に対して、話者毎に音声認識及びテキスト化を行うことが可能である。これにより同時に複数の人物が映っている状態で、複数の人物が話している場合でも、話者毎のテキスト化された音声情報が生成される。 The voice recognition unit 105 and the text conversion unit 108 can perform voice recognition and text conversion for each speaker for a plurality of speakers. As a result, even when a plurality of persons are talking in a state where a plurality of persons are shown at the same time, voice information converted into text for each speaker is generated.

位置特定部１０９では、話者特定部１０７から送られる話者の顔位置や大きさに関する情報と、テキスト化部１０８から送られるテキスト化された音声の情報と話者の情報から、話者の顔位置（吹き出しの表示位置又は字幕スーパーの表示位置）と音声のテキスト情報を組み合わせて生成された位置特定情報を吹き出し作成部１１２へ送る。更に位置特定部１０９では、話者の顔の向きや大きさ、音声の発声継続時間を判断して口元に吹き出しを生成するか、字幕スーパーとして表示するかの位置特定情報も決定する。例えば、ズーム操作を行い話者の顔の大きさが音声の発声継続時間に対して大きく変わる場合、口元に吹き出しを出すと画面が見にくくなることが考えられるため、画面下に字幕スーパーを表示する。また、話者が音声の発声継続時間中に後ろを向いてしまった場合には話者の映像継続性を判断して話者を追跡し、口元から頭の先へ吹き出し表示位置を移動する。 In the position specifying unit 109, the speaker's face position and size information sent from the speaker specifying unit 107 and the text-formed voice information and speaker information sent from the text forming unit 108 are used to determine the speaker's face. The position specifying information generated by combining the face position (the display position of the speech bubble or the display position of the caption subtitle) and the audio text information is sent to the speech bubble creation unit 112. Further, the position specifying unit 109 determines the position specifying information on whether to generate a speech bubble at the mouth or to display it as a supertitle by determining the direction and size of the speaker's face and the duration of voice utterance. For example, since the case vary greatly with respect to the utterance duration magnitude of the voice of the speaker's face performs a zoom operation, it is thought that the screen and put the balloon in the mouth is difficult to see, display the subtitles at the bottom of the screen To do. If the speaker turns backward during the voice utterance duration, the video continuity of the speaker is determined to track the speaker, and the balloon display position is moved from the mouth to the tip of the head.

また、話者が音声の発声継続時間中に画面内を大きく移動する場合に吹き出しを大きく動かす必要があり画面が見にくくなることが考えられるため、画面下に字幕スーパーを表示する。また、話者が音声の発声継続時間中に画面外へ移動または、話者が物陰に隠れる等画面から消えた場合は、画面内にいる時は口元に吹き出しを表示し、画面から消えた時は画面下に字幕スーパーを表示する。ここに挙げた話者と吹き出しや字幕スーパーの関係は一例であり、他の組み合わせが存在しても良い。 In addition, when the speaker moves greatly in the screen during the voice utterance duration, it is necessary to move the speech balloon a lot and it may be difficult to see the screen, so a caption subtitle is displayed at the bottom of the screen. Also, if the speaker moves off the screen during the duration of the speech, or disappears from the screen, such as when the speaker is hidden behind the screen, a speech bubble is displayed at the mouth when the speaker is inside the screen, and the speaker disappears from the screen. Displays subtitle super at the bottom of the screen. The relationship between the speaker and the speech balloon or the caption subtitle mentioned here is an example, and other combinations may exist.

位置特定部１０９により特定された位置特定情報が吹き出し作成部１１２へ送られると、位置特定情報内の吹き出しまたは字幕スーパーの表示位置とテキスト化された音声情報から吹き出しまたは字幕スーパーを表示するための吹き出しデータが作成される。ここで作成される吹き出しデータは、メタデータを用いて記述される。メタデータのタグには、開始フレーム及び終了フレーム、継続時間、効果、フォント、属性（フォント色及び背景色、透明度）、吹き出しの形が指定される。ここで示されたタグは一例を示したものであり、本実施形態のタグ種類を制限するものではない。作成される吹き出しや字幕スーパーをメタデータで記述しているため、編集作業が画像データの編集ではなく、テキストデータの編集ベースで行えて編集作業を容易にしている。 When the position specifying information specified by the position specifying unit 109 is sent to the balloon creating unit 112, the balloon or the caption super display for displaying the balloon or the caption super display position in the position specifying information and the voiced text information. Balloon data is created. The balloon data created here is described using metadata. In the metadata tag, a start frame and an end frame, duration, effect, font, attributes (font color and background color, transparency), and a balloon shape are designated. The tag shown here is an example, and does not limit the tag type of the present embodiment. Since the speech balloons and subtitles to be created are described in metadata, the editing work can be performed not on the image data but on the text data editing base to facilitate the editing work.

吹き出し作成部１１２で作成された吹き出しデータは、画像データ部１１１と、音声データ部１１３の各データと同期を取って動画像作成部１１４へ送られ、動画の画像形式にまとめられる。代表的な動画形式の規格としては、Motion JPEG,MPEG等が挙げられる。 The speech balloon data created by the speech balloon creation unit 112 is sent to the moving image creation unit 114 in synchronization with the data in the image data unit 111 and the audio data unit 113, and is collected into a moving image format. Typical video format standards include Motion JPEG, MPEG, and the like.

１１０は同期部で、動画の映像信号と音声信号の同期を取り、それぞれ顔検出部１０３、音声識別部１０４、音声認識部１０５へ供給される。顔検出部１０３では、同期信号から顔を認識開始した時刻とフレーム番号（以下タイムコードと記す）を算出し、顔の移動量（時間あたりの移動量）、映像に映っている時間（顔が認識できなくなったタイムコード）などの情報を生成する。また、音声識別部１０４では、同期信号から話者の音声を認識し、識別することで、発声の開始タイムコード、発声の終了タイムコードを算出する。音声認識部１０５では、同期信号から話者の音声を認識し、言葉として認識を開始したタイムコードと発声の終了タイムコードを算出する。これらの同期信号により、話者の顔画像と、吹き出しの表示開始タイムコード、表示継続時間、表示位置を決定することが可能となり、話者特定や、テキスト化、位置特定、吹き出し作成において、処理速度の違いによる処理時間が異なっても、話者の顔と、音声、吹き出しのずれを無くすことが可能となる。同期部１１０より画像データ部１１１と音声データ部１１３にも同期信号が送られ、動画像ファイルを作成する際に、画像と音声の同期を取るようにしている。 Reference numeral 110 denotes a synchronization unit that synchronizes the video signal of the moving image and the audio signal, and supplies them to the face detection unit 103, the audio identification unit 104, and the audio recognition unit 105, respectively. The face detection unit 103 calculates the face recognition start time and frame number (hereinafter referred to as a time code) from the synchronization signal, moves the face (moving amount per time), and the time (face Information such as time code that can no longer be recognized is generated. In addition, the voice identification unit 104 recognizes and identifies the voice of the speaker from the synchronization signal, thereby calculating the utterance start time code and the utterance end time code. The voice recognition unit 105 recognizes the voice of the speaker from the synchronization signal, and calculates the time code when the recognition is started as a word and the end time code of the utterance. These synchronization signals make it possible to determine the speaker's face image, balloon start display time code, display duration, and display position. Even if the processing time is different due to the difference in speed, it is possible to eliminate the deviation between the speaker's face, voice and speech balloons. A synchronization signal is also sent from the synchronization unit 110 to the image data unit 111 and the audio data unit 113, and the image and the audio are synchronized when creating a moving image file.

図２は、図１に示す自動吹き出し作成・編集処理機能を有する映像記録・編集装置２００の構成を示している。 FIG. 2 shows the configuration of the video recording / editing apparatus 200 having the automatic speech balloon creation / editing processing function shown in FIG.

２０１はカメラ部で、撮影した画像データがアナログ信号として出力され、Ａ／Ｄ変換部２０２により点順次のデジタルデータに変換され、画像処理部２０３へ送られる。画像処理部２０３では、点順次に送られた映像信号から、色処理、輝度処理等が行われ、上述した自動吹き出し作成・編集処理部１００へ送られる。 A camera unit 201 outputs captured image data as an analog signal, is converted into dot-sequential digital data by an A / D conversion unit 202, and is sent to an image processing unit 203. In the image processing unit 203, color processing, luminance processing, and the like are performed from the video signal sent in point-sequential order and sent to the above-described automatic balloon creation / editing processing unit 100.

２０４はマイク部で、撮影と同時に音声信号を取得し、アナログ音声データとしてＡ／Ｄ変換部２０５へ送られる。Ａ／Ｄ変換部２０５では、サンプリング周期に併せてアナログ音声データをデジタルデータへ変換し、音声信号処理部２０６へ送られる。音声信号処理部２０６で信号処理されたデータは、時系列のデジタルデータとして自動吹き出し作成・編集処理部１００へ送られる。 Reference numeral 204 denotes a microphone unit that acquires an audio signal at the same time as photographing, and sends it to the A / D converter 205 as analog audio data. The A / D conversion unit 205 converts analog audio data into digital data in accordance with the sampling period and sends the digital data to the audio signal processing unit 206. Data processed by the audio signal processing unit 206 is sent to the automatic speech balloon creation / edit processing unit 100 as time-series digital data.

２０７は装置２００全体を司る制御装置で、内部には制御用マイコン（ＣＰＵ）やプログラム格納用メモリ（ＲＯＭ、フラッシュメモリ、ＲＡＭ等）、データ格納用メモリ（ＲＡＭ）等を含み、装置２００内の各ブロックの制御や装置全体の制御を行う。 Reference numeral 207 denotes a control device that controls the entire apparatus 200, and includes a control microcomputer (CPU), a program storage memory (ROM, flash memory, RAM, etc.), a data storage memory (RAM), etc. Controls each block and the entire device.

２０８は装置２００の操作部材で各種スイッチ、レバー、ボタンなどにより構成され、装置２００のユーザインターフェース部材や装置内のセンサなどの検知部材を含んでいる。これらの操作部材を操作することで、撮影や再生の開始・停止、各種設定、編集操作を行うことが可能である。 Reference numeral 208 denotes an operation member of the apparatus 200, which includes various switches, levers, buttons, and the like, and includes detection members such as a user interface member of the apparatus 200 and a sensor in the apparatus. By operating these operation members, it is possible to start / stop shooting, playback, various settings, and editing operations.

２１０は記録装置で、自動吹き出し作成・編集処理部１００で作成された吹き出し付きの動画像データを記録する部分である。記録装置２１０は、ハードディスクやメモリカード、光磁気記憶メディア等、組み込みまたは着脱可能な記録手段で構成されている。動画像データは、記憶媒体によって、生の動画像データとして記憶されることも、ファイル形式として記録されることもある。 Reference numeral 210 denotes a recording device that records moving image data with a balloon created by the automatic balloon creation / editing processing unit 100. The recording device 210 is composed of recording means that can be incorporated or detached, such as a hard disk, a memory card, a magneto-optical storage medium, and the like. Depending on the storage medium, the moving image data may be stored as raw moving image data or may be recorded as a file format.

２１１は動画像・吹き出し合成処理部で、記録装置２１０から読み出された動画像データまたは自動吹き出し作成・編集処理部１００から出力された動画像データが入力される。入力された動画データに対して、動画の映像データと音声データの同期を取りつつ、吹き出しデータに記録された開始タイムコードに従い、当該フレームが表示されたタイミングから、吹き出しデータに記録された位置情報、効果、属性、フォント、色、吹き出しの形状等により実際の吹き出しを作成して画像合成する。動画像・吹き出し合成処理部２１１では吹き出に対して、固定の形の物や、文字も固定の物から効果によっては、時系列に吹き出しの形を変更させることも、時系列に文字を順次表示することも、時系列にフォント色や吹き出しの背景色を順次変化させることも可能である。 A moving image / balloon synthesis processing unit 211 receives moving image data read from the recording device 210 or moving image data output from the automatic balloon creation / editing processing unit 100. The position information recorded in the balloon data from the timing at which the frame is displayed according to the start time code recorded in the balloon data while synchronizing the video data and audio data of the movie with the input movie data An actual speech balloon is created based on the effect, attribute, font, color, speech balloon shape, etc., and the image is synthesized. The moving image / balloon composition processing unit 211 can display the characters in a time series in order to change the shape of the balloon in time series depending on the effect of the fixed form and characters from the fixed thing. It is also possible to sequentially change the font color and the background color of the balloon in time series.

２０９は同期部である。同期部２０９からカメラ部２０１、Ａ／Ｄ変換部２０２、画像処理部２０３へ同期信号が提供され映像信号のサンプリングレートとして使用される。同期部２０９から音声系Ａ／Ｄ変換部２０５、音声信号処理部２０６へ同期信号が供給され音声信号のサンプリングレートとして使用される。同期部２０９から自動吹き出し作成・編集処理部１００へ同期信号が供給され、映像信号と音声信号の同期化及び同期部１１０への供給が行われる。同期部２０９から動画像・吹き出し合成処理２１１へ同期信号が供給され、動画再生時の映像信号と音声信号の同期を取り、吹き出しデータに記録された開始タイムコードに従い、当該フレームが表示されたタイミングから、吹き出しを表示・消去または効果を施すタイミング信号として使用される。 Reference numeral 209 denotes a synchronization unit. A synchronization signal is provided from the synchronization unit 209 to the camera unit 201, the A / D conversion unit 202, and the image processing unit 203, and is used as a sampling rate of the video signal. A synchronization signal is supplied from the synchronization unit 209 to the audio system A / D conversion unit 205 and the audio signal processing unit 206, and is used as a sampling rate of the audio signal. A synchronization signal is supplied from the synchronization unit 209 to the automatic speech balloon creation / editing processing unit 100, and the video signal and the audio signal are synchronized and supplied to the synchronization unit 110. A synchronization signal is supplied from the synchronization unit 209 to the moving image / balloon synthesizing process 211, the video signal and the audio signal at the time of moving image reproduction are synchronized, and the frame is displayed according to the start time code recorded in the balloon data From the above, it is used as a timing signal for displaying / erasing a balloon or applying an effect.

動画像・吹き出し合成処理部２１１で構成された動画像信号は、映像信号として表示装置２１３へ供給され、また音声信号としてスピーカー２１２へ供給される。これにより、スピーカー２１２から登録された人物の音声が出ている時に、表示装置２１３へ吹き出しや字幕スーパーのついた動画像が表示される。 The moving image signal configured by the moving image / balloon synthesis processing unit 211 is supplied to the display device 213 as a video signal and also supplied to the speaker 212 as an audio signal. As a result, when a registered person's voice is output from the speaker 212, a moving image with a speech balloon or a caption supervision is displayed on the display device 213.

図３は図２の映像記録・編集装置の外観図である。 FIG. 3 is an external view of the video recording / editing apparatus of FIG.

３００は映像記録・編集装置本体である。３０１は撮影ボタンであり、このボタンを押下することで撮影が開始・停止される。上述した自動吹き出し作成・編集機能が有効な場合、撮影が開始されると自動的に吹き出しや字幕スーパーが作成され、記録される。３０２は接眼レンズ（ビューファインダー）であり撮影者は撮影画像を確認することができる。３０３は撮影レンズであり、このレンズを通して撮影を行う。３０４は液晶ファインダー・再生画面であり撮影中の画像確認や再生画像確認、各種設定の確認を行うことができる。自動吹き出し作成・編集機能が有効ならば、撮影した画像に自動的に吹き出しや字幕スーパーが付加されて表示される。また、吹き出しや字幕スーパーが付加された再生画像も表示される。３０５は操作スイッチで各種設定操作や再生、早送り、巻き戻しなどの操作を行う。 Reference numeral 300 denotes a video recording / editing device main body. Reference numeral 301 denotes a photographing button, and photographing is started / stopped by pressing this button. When the automatic speech balloon creating / editing function described above is valid, a speech balloon and a caption supertitle are automatically created and recorded when shooting is started. Reference numeral 302 denotes an eyepiece (viewfinder), which allows a photographer to check a captured image. Reference numeral 303 denotes a photographing lens, and photographing is performed through this lens. Reference numeral 304 denotes a liquid crystal finder / reproduction screen, which can confirm an image being photographed, a reproduced image, and various settings. If the automatic speech balloon creation / editing function is valid, the captured image is automatically displayed with a speech balloon and a caption subtitle. In addition, a playback image to which a balloon or a caption supertitle is added is also displayed. Reference numeral 305 denotes an operation switch that performs various setting operations and operations such as playback, fast forward, and rewind.

次に、図４乃至図９を参照して、自動吹き出し作成・編集処理の流れについて説明する。 Next, the flow of the automatic speech balloon creation / editing process will be described with reference to FIGS.

図４は横方向を時間軸として右方向に時間が経過していく際の処理を示している。 FIG. 4 shows processing when time elapses in the right direction with the horizontal direction as the time axis.

映像信号としては、Ａ子のみが映っている映像（図５）と、Ａ子とＢ子の２人が映っている映像（図６）が動画入力部１０１へ入力されている。Ａ子のみが映っている映像は期間４０１であり、Ａ子とＢ子の２人が映っている映像は期間４０２である。一方、音声信号としては、Ａ子の声で「おはようＡ子です。」とＢ子の声で「おはようＢ子です。」とＣ子の声で「Ｃ子です。おはようＡ子、Ｂ子。」が音声入力部１０２へ入力されている。Ａ子が発声している期間は４０３、Ｂ子が発声している期間は４０４、Ｃ子が発声している期間は４０５である。その他の期間は背景の音声が音声入力部１０２へ入力されている。 As the video signal, a video (FIG. 5) showing only the child A and a video (FIG. 6) showing the two children A and B are input to the moving image input unit 101. A video in which only A child is shown is a period 401, and a video in which two children A and B are shown is a period 402. On the other hand, the voice signal is “Good morning A child” in the voice of child A, “Good morning B child” in the voice of child B, and “C child. Good morning A child, B child in the voice of child C. "Is input to the voice input unit 102. The period during which child A is speaking is 403, the period during which child B is speaking is 404, and the period during which child C is speaking is 405. During other periods, background audio is input to the audio input unit 102.

Ａ子のみが映っている映像期間４０１では、顔検出部１０３によりＡ子の顔特徴量や顔の向き、口の位置などが算出され話者特定部１０７へ送られる。音声・顔対応部１０６には、予めＡ子の顔特徴量と音声特徴量、Ｂ子の顔特徴量と音声特徴量、Ｃ子の顔特徴量と音声特徴量の組み合わせが登録されている。映像期間４０１において、話者特定部１０７ではＡ子の顔特徴量に対して音声・顔対応部１０６に登録されている顔特徴量を照合してＡ子が画面内に存在することを認識している。 In the video period 401 in which only the child A is shown, the face detection unit 103 calculates the face feature amount, face direction, mouth position, and the like of the child A, and sends them to the speaker specifying unit 107. In the voice / face correspondence unit 106, a child A face feature amount and voice feature amount, a child B face feature amount and voice feature amount, and a combination of child C face feature amount and voice feature amount are registered. In the video period 401, the speaker identification unit 107 recognizes that the child A exists in the screen by comparing the face feature amount registered in the voice / face correspondence unit 106 against the face feature amount of the child A. ing.

一方、音声入力部１０２に入力された音声信号に対して、音声識別部１０４では、音声特徴量を一定周期で算出し、話者特定部１０７へ送っている。話者特定部１０７では音声特徴量に対して音声・顔対応部１０６に登録されている音声特徴量を照合する。Ａ子が発声している期間４０３では、音声識別部１０４においてＡ子の音声特徴量が算出されており、話者特定部１０７において、Ａ子の音声特徴量に対して音声・顔対応部１０６に登録されている音声特徴量を照合してＡ子が話者であることを認識している。このように、話者特定部１０７では期間４０１ではＡ子が画面内に存在すること、期間４０３ではＡ子が画面内に存在し話者であることを認識している。話者特定部１０７では４０３の期間中、Ａ子が話者であることを示す識別情報を音声認識部１０５へ送っている。ここで送られる識別情報としては、Ａ子が話者である期間の情報（期間４０３）及びＡ子の予め登録されている音声特徴量を含むデータで構成されている。音声認識部１０５では、送られた識別情報により音声入力部１０２より送られた音声信号に対して、期間４０３におけるＡ子の音声情報を抽出し、音声認識を行ってＡ子の音声の音素を抽出する。音声認識部１０５で抽出されたＡ子の音声に対する音素データに対して、テキスト化部１０８では、Ａ子が期間４０３で発声した内容「おはようＡ子です。」をテキスト化する。話者特定部１０７で話者特定に時間がかかるため、音声識別部１０４や音声認識部１０５では一定時間の音声を蓄積（記憶）しておき、話者特定部１０７で特定された話者の発声開始時間に遡って、再度音声識別を行うことや音声認識を行うことができるようにしている。 On the other hand, with respect to the voice signal input to the voice input unit 102, the voice identification unit 104 calculates a voice feature amount at a constant period and sends it to the speaker specifying unit 107. The speaker specifying unit 107 collates the voice feature amount registered in the voice / face correspondence unit 106 against the voice feature amount. In the period 403 during which the child A is speaking, the voice identification unit 104 calculates the voice feature amount of the child A, and the speaker identifying unit 107 performs the voice / face correspondence unit 106 on the voice feature amount of the child A. The child A is recognized as a speaker by collating the voice feature values registered in. As described above, the speaker specifying unit 107 recognizes that the child A exists in the screen in the period 401 and that the child A exists in the screen and is a speaker in the period 403. The speaker specifying unit 107 sends identification information indicating that the child A is a speaker to the voice recognition unit 105 during the period 403. The identification information sent here includes information (period 403) of a period during which the child A is a speaker and data including a pre-registered voice feature amount of the child A. The voice recognition unit 105 extracts the A child voice information in the period 403 from the voice signal sent from the voice input unit 102 based on the sent identification information, performs voice recognition, and obtains the phoneme of the A child voice. Extract. With respect to the phoneme data for the voice of child A extracted by the voice recognition unit 105, the text conversion unit 108 converts the content that A child uttered in the period 403 to “Good morning A child”. Since it takes time for the speaker identification unit 107 to specify the speaker, the voice identification unit 104 and the voice recognition unit 105 accumulate (store) the voice for a certain period of time, and the speaker identification unit 107 identifies the speaker specified by the speaker identification unit 107. Going back to the utterance start time, voice recognition and voice recognition can be performed again.

話者特定部１０７では、期間４０３においてＡ子が画面内に存在していることを認識し、顔や口の位置も特定できており、テキスト部１０８ではＡ子が発声した内容のテキスト化も完了している。これらの情報により位置特定部１０９では、吹き出しの位置をＡ子の口元と決定し、位置特定情報を吹き出し作成部１１２へ送る。 The speaker specifying unit 107 recognizes that the child A is present in the screen during the period 403 and can also specify the position of the face and mouth. The text unit 108 can also convert the content uttered by the child A into text. Completed. Based on these pieces of information, the position specifying unit 109 determines the position of the balloon as the mouth of the child A, and sends the position specifying information to the balloon creating unit 112.

吹き出し作成部１１２では、送られた位置特定情報を元に、Ａ子の発声内容「おはようＡ子です。」の吹き出しをＡ子の口元に表示するためのメタデータを作成する。メタデータの記述を表示形態にしたものが４０６である。ここで、吹き出しを作成する際に、Ａ子の発声が終了すると同時に吹き出しデータが消えてしまうことが無いよう、保持時間を設定し発声終了後もしばらく吹き出しを表示することで、内容を読みやすくすることも可能である。また、音声認識部１０５において、音節分割を行い、テキスト化部１０８において、単語単位や音声単位で順次テキストを表示することも可能である。更に、発声時間により、単語単位で継続時間を割り振り、テキストの文字単位で時間に応じて順次テキストを表示することも可能である。また、予め音声・顔対応１０６に話者を登録する際に、テキストや吹き出しの形状、色、効果などを関連付けすることで、話者毎の特徴を持った吹き出しを作成することも可能である。 The speech balloon creation unit 112 creates metadata for displaying a speech balloon of A child's utterance content “Good morning A child” at the mouth of A child based on the position specifying information sent. Reference numeral 406 denotes a description of the metadata. Here, when creating a speech bubble, the content is easy to read by setting a holding time and displaying the speech bubble for a while after the speech is finished so that the speech data does not disappear at the same time that the child A utters. It is also possible to do. It is also possible to perform syllable division in the speech recognition unit 105 and to display text sequentially in units of words or speech in the text unit 108. Furthermore, it is also possible to allocate a duration in units of words according to the utterance time, and display the text sequentially according to the time in units of text characters. In addition, when a speaker is registered in the voice / face correspondence 106 in advance, it is also possible to create a speech balloon having characteristics for each speaker by associating the shape, color, and effect of the text and speech balloon. .

次に、Ａ子とＢ子が映っている映像期間４０２について説明する。 Next, a video period 402 in which A child and B child are shown will be described.

顔検出部１０３によりＡ子の顔特徴量や顔の向き、口の位置などとＢ子の顔特徴量や顔の向き、口の位置などが算出され、２人分のデータが話者特定部１０７へ送られる。映像期間４０２において、話者特定部１０７ではＡ子の顔特徴量とＢ子の顔特徴量に対して音声・顔対応部１０６に登録されている顔特徴量を照合してＡ子とＢ子の２人が画面内に存在することを認識している。 The face detection unit 103 calculates the facial feature amount, face orientation, mouth position, etc. of child A, and the facial feature amount, face orientation, mouth position, etc. of child B. 107. In the video period 402, the speaker specifying unit 107 collates the face feature amount registered in the voice / face correspondence unit 106 against the face feature amount of the child A and the face feature amount of the child B, and the child A and the child B. Are aware that they are on the screen.

Ｂ子が発声している期間４０４では、音声識別部１０４においてＢ子の音声特徴量が算出されており、話者特定部１０７において、Ｂ子の音声特徴量に対して音声・顔対応部１０６に登録されている音声特徴量を照合してＢ子が話者であることを認識している。また、Ｃ子が発声している期間４０５では、音声識別部１０４においてＣ子の音声特徴量が算出されており、話者特定部１０７において、Ｃ子の音声特徴量に対して音声・顔対応部１０６に登録されている音声特徴量を照合してＣ子が話者であることを認識している。このようにして、話者特定部１０７では期間４０４ではＢ子が画面内に存在しＢ子が話者であることを認識し、期間４０５ではＣ子が画面内に存在していないがＣ子が話者であることを認識している。 In the period 404 in which the child B is uttering, the speech identification unit 104 calculates the speech feature amount of the child B, and the speaker specifying unit 107 performs the speech / face correspondence unit 106 on the speech feature amount of the child B. The child B is recognized as a speaker by collating the voice feature values registered in. Also, during the period 405 when the child C is speaking, the voice identification unit 104 calculates the voice feature amount of the child C, and the speaker identification unit 107 performs voice / face correspondence on the voice feature amount of the child C. The voice feature quantity registered in the unit 106 is collated to recognize that the child C is a speaker. In this way, the speaker identifying unit 107 recognizes that the child B is present in the screen and the child B is the speaker in the period 404, and the child C is not present in the screen in the period 405. Recognizes that he is a speaker.

話者特定部１０７では４０４の期間中、Ｂ子が話者であることを示す識別情報を音声認識部１０５へ送っている。音声認識部１０５では、送られた識別情報により音声入力部１０２より送られた音声信号に対して、期間４０４におけるＢ子の音声情報を抽出し、音声認識を行ってＢ子の音声の音素を抽出する。音声認識部１０５で抽出されたＢ子の音声に対する音素データに対して、テキスト化部１０８では、Ｂ子が期間４０４で発声した内容「おはようＢ子です。」をテキスト化する。同様にして、Ｃ子が期間４０５で発声した内容「Ｃ子です。おはようＡ子、Ｂ子。」をテキスト化する。 During the period 404, the speaker specifying unit 107 sends identification information indicating that the child B is a speaker to the voice recognition unit 105. The speech recognition unit 105 extracts the B child speech information in the period 404 from the speech signal sent from the speech input unit 102 based on the sent identification information, performs speech recognition, and obtains the phoneme of the B child speech. Extract. With respect to the phoneme data for the B child voice extracted by the voice recognition unit 105, the text conversion unit 108 converts the text “Good morning B child” uttered by the B child in the period 404. In the same manner, the content “C child. Good morning A child, B child.” Uttered by child C in period 405 is converted into text.

話者特定部１０７では、期間４０４においてＢ子が画面内に存在していることを認識し、顔や口の位置も特定できており、テキスト化部１０８ではＢ子が発声した内容のテキスト化も完了している。これらの情報により位置特定部１０９では、吹き出しの位置をＢ子の口元と決定し、位置特定情報を吹き出し作成部１１２へ送る。 The speaker specifying unit 107 recognizes that the child B is present in the screen during the period 404 and can also specify the position of the face and mouth, and the texting unit 108 converts the content of the child B to the text. Has also been completed. Based on these pieces of information, the position specifying unit 109 determines the position of the balloon as the mouth of the child B, and sends the position specifying information to the balloon creating unit 112.

一方、話者特定部１０７では、期間４０５においてＣ子が画面内に存在しないことを認識しており、テキスト部１０８ではＣ子が発声した内容のテキスト化が完了している。これらの情報により位置特定部１０９では、吹き出しではなく字幕スーパーを画面下と決定し、位置特定情報を吹き出し作成部１１２へ送る。本実施形態の映像では話者が発声中に画面内から消える場合や、後ろを向く場合、話者の画面内での大きさの変化や位置の変化が大きい場合は示していないが、このような場合には前述の処理が行われても良い。 On the other hand, the speaker identifying unit 107 recognizes that the child C does not exist in the screen in the period 405, and the text portion 108 has completed the text conversion of the content uttered by the child C. Based on these pieces of information, the position specifying unit 109 determines that the caption subtitle is not the balloon but the bottom of the screen, and sends the position specifying information to the balloon creating unit 112. In the video of the present embodiment, when the speaker disappears from the screen while speaking, or when facing backwards, the size change or the position change in the speaker's screen is not shown. In such a case, the above-described processing may be performed.

吹き出し作成部１１２では、Ｂ子が発声した期間４０４に対して送られた位置特定情報を元に、Ｂ子の発声内容「おはようＢ子です。」の吹き出しをＢ子の口元に表示するためのメタデータを作成する。このメタデータの記述を表示形態にしたものが４０７である。また、Ｃ子が発声した期間４０５に対して送られた位置特定情報を元に、Ｃ子の発声内容「Ｃ子です。おはようＡ子、Ｂ子。」の字幕スーパーを画面下に表示するためのメタデータを作成する。このメタデータの記述を表示形態にしたものが４０８である。 The speech balloon creating unit 112 displays a speech balloon of the child B's utterance content “Good morning B child” on the mouth of the child B, based on the position specifying information sent during the period 404 when the child B spoke. Create metadata. Reference numeral 407 denotes a description form of the metadata. In addition, based on the position specifying information sent for the period 405 when C child uttered, the subtitle superimposition of C child's utterance content “C child. Good morning A child, B child.” Is displayed at the bottom of the screen. Create metadata for. Reference numeral 408 denotes a description form of the metadata.

このように吹き出し作成部１１２で時系列に作成された吹き出しデータは、画像データ部１１１と音声データ部１１３と同期部１１０から提供される同期信号を基にして動画像作成部１１４で動画像データとして組み立てられる。 The speech balloon data created in time series by the speech balloon creation unit 112 in this manner is the moving image data generated by the moving image creation unit 114 based on the synchronization signals provided from the image data unit 111, the audio data unit 113, and the synchronization unit 110. Assembled as.

このようにして組み立てられた動画像データは、動画像・吹き出し合成処理部２１１に送られると映像と音声信号が同期を取って再生され映像信号は表示装置２１３へ、音声信号はスピーカー２１２へ送られる。Ａ子が映っている映像期間４０１で、登録されている人物の音声を検出していない状態では、吹き出しが生成されない状態である（区間１）。 When the moving image data assembled in this way is sent to the moving image / balloon synthesis processing unit 211, the video and the audio signal are reproduced in synchronization, the video signal is sent to the display device 213, and the audio signal is sent to the speaker 212. It is done. In the video period 401 in which the child A is shown and no voice of a registered person is detected, no speech balloon is generated (section 1).

区間１では、図５のようにＡ子が映っている画像が表示される。Ａ子が映っている映像期間４０１で、Ａ子が発声している期間４０３を含む期間（吹き出し作成部１１２で定義された表示期間）では吹き出し４０６が生成され、映像信号に合成されて表示される（区間２）。 In section 1, an image showing the child A is displayed as shown in FIG. In the video period 401 in which the child A is shown and in the period including the period 403 in which the child A utters (the display period defined by the balloon creation unit 112), a balloon 406 is generated and combined with the video signal and displayed. (Section 2).

区間２では、図６のようにＡ子が映っており、Ａ子の口元から吹き出しが表示される。Ａ子とＢ子が映っている映像期間４０２で、登録されている人物の音声を検出していない状態では、吹き出しが生成されない状態である（区間３）。 In section 2, A child is reflected as shown in FIG. 6, and a balloon is displayed from the mouth of A child. In the video period 402 in which the child A and the child B are shown, in a state where the voice of the registered person is not detected, a balloon is not generated (section 3).

区間３では、図７のようにＡ子とＢ子が映っている画像が表示される。Ａ子とＢ子が映っている映像期間４０２で、Ｂ子が発声している期間４０４を含む期間（吹き出し作成部１１２で定義された表示期間）では吹き出し４０７が生成され、映像信号に合成されて表示される（区間４）。 In section 3, an image showing A child and B child is displayed as shown in FIG. In a video period 402 in which A child and B child are shown, a balloon 407 is generated and combined with the video signal in a period including the period 404 in which the B child is uttered (display period defined by the balloon creating unit 112). Are displayed (section 4).

区間４では、図８のようにＡ子とＢ子が映っており、Ｂ子の口元から吹き出しが表示される。Ａ子とＢ子が映っている映像期間４０２で、Ｂ子の発声が完了し吹き出し作成部１１２で定義された表示期間が過ぎると、登録されている人物の音声を検出していない状態となり、吹き出しが生成されない（区間５）。 In section 4, A child and B child are reflected as shown in FIG. 8, and a balloon is displayed from the mouth of B child. In the video period 402 in which A child and B child are shown, when the child B's utterance is completed and the display period defined by the balloon creating unit 112 has passed, the registered person's voice is not detected, A balloon is not generated (section 5).

区間５では、図７のようにＡ子とＢ子が映っている画像が表示される。Ａ子とＢ子が映っている映像期間４０２で、Ｃ子が発声している期間４０５を含む期間（吹き出し作成部１１２で定義された表示期間）では字幕スーパー４０８が生成され、映像信号に合成されて表示される（区間６）。 In section 5, an image showing child A and child B is displayed as shown in FIG. In the video period 402 in which the child A and the child B are shown, and in the period including the period 405 in which the child C utters (the display period defined by the balloon creation unit 112), the caption super 408 is generated and synthesized with the video signal. And displayed (section 6).

区間６では、図９のようにＡ子とＢ子が映っており、画面下に字幕スーパーが表示される。 In section 6, A child and B child are shown as shown in FIG. 9, and a caption super is displayed at the bottom of the screen.

このように、本実施形態によれば、予め登録されている人物の音声・顔対応データに対して、顔検出及び音声識別を行い、話者を特定することで、話者の音声認識による台詞自動テキスト化により、容易に吹き出しや字幕スーパーを作成することが可能となる。 As described above, according to the present embodiment, dialogue is performed by voice recognition of a speaker by performing face detection and voice identification on the voice / face correspondence data of a person registered in advance and specifying the speaker. Automatic text conversion makes it easy to create speech balloons and subtitles.

［第２の実施形態］
図１７は本発明に係る実施形態の自動吹き出し作成・編集処理機能を実現するソフトウェアを備える動画像データ編集装置を例示している。 [Second Embodiment]
FIG. 17 exemplifies a moving image data editing apparatus including software that realizes an automatic speech balloon creation / editing processing function according to an embodiment of the present invention.

本実施形態では、上記動画像データ編集装置を、表示装置６０１、キーボード６０２、マウス６０３を備えるパーソナルコンピュータ６００で実現しているが、ビデオ記録編集装置（磁気テープ記録式、光磁気ディスク記録式、光記録ディスク記録式、磁気ディスク記録式等）、編集専用装置であってもよい。 In this embodiment, the moving image data editing apparatus is realized by a personal computer 600 including a display device 601, a keyboard 602, and a mouse 603, but a video recording / editing apparatus (magnetic tape recording type, magneto-optical disk recording type, An optical recording disk recording type, a magnetic disk recording type, etc.) and an editing-dedicated device may be used.

図１２は、本実施形態の自動吹き出し作成・編集処理を実現するソフトウェアの機能により表示装置６０１に表示される表示画面を例示している。 FIG. 12 exemplifies a display screen displayed on the display device 601 by the function of software that implements the automatic speech balloon creation / editing process of the present embodiment.

５０１は編集対象の動画像の映像を表示する領域である。５０２は音声・顔対応部１０６に登録されている話者の一覧を表示する領域である。５０３、５０４、５０５は登録されている話者１人毎の情報が表示される領域である。５０６は話者の画像内存在状態であって、映像領域５０１において表示されている動画像に対して、動画入力部１０１を通して入力される映像信号が話者特定部１０７で顔認識されている話者を示している。すなわち、映像領域５０１には、現在５０３の話者情報に登録されている人物が映っていることを示している。５０８は話者の発声状態表示であって、映像領域５０１において表示されている動画像に対して、音声入力部１０２を通して入力される音声信号が話者特定部１０７で音声認識されている話者を示している。すなわち、現在５０３の話者情報に登録されている人物が話者として発声していることを示している。５０７は話者一覧表示領域５０２内の話者をスクロールさせるためのスライダである。５０９は映像領域５０１に表示されている動画像データ内の位置を示すスライダであり、スライダのレバーを移動することで、任意の位置に動画像データ内を移動することが可能である。５１０は動画像データの音声入力１０２へ入力される音声信号のレベルを示している。５１１は話者特定部１０７において登録されている話者を検出した時点（検出開始時点）の動画像データ内のタイムコードを示している。５１２は話者特定部１０７において登録されている話者の発声が終了した時点（検出終了時点）の動画像データ内のタイムコードを示している。５１３は現在の映像領域５０１に表示されている映像の動画像データ内のタイムコードを示している。５１４はアプリケーションの動作状態を示している。アプリケーションの状態には、音声・顔識別中、音声認識・テキスト化（書き取り）中の各状態がある。５１５は音声・顔識別の開始ボタンである。当該ボタンを押下することで、音声・顔の識別が開始され、話者特定が行われる。５１６はプレビューボタンで、自動的に作成されたまたはユーザにより編集された吹き出しや字幕スーパーを動画像データと共に合成して再生することができる。ここで挙げた画面のイメージは本実施形態を説明するための一例であり、本実施形態の機能を制限するものではない。 Reference numeral 501 denotes an area for displaying a video image to be edited. An area 502 displays a list of speakers registered in the voice / face correspondence unit 106. Reference numerals 503, 504, and 505 denote areas for displaying information for each registered speaker. Reference numeral 506 denotes a state in which the speaker is present in the image. The video signal input through the moving image input unit 101 is recognized by the speaker specifying unit 107 for the moving image displayed in the video area 501. Shows the person. That is, the video area 501 indicates that a person currently registered in the speaker information 503 is shown. Reference numeral 508 denotes a speaker's utterance state display. A speaker in which an audio signal input through the audio input unit 102 is voice-recognized by the speaker specifying unit 107 for a moving image displayed in the video area 501. Is shown. That is, it is indicated that the person currently registered in the speaker information 503 is speaking as a speaker. Reference numeral 507 denotes a slider for scrolling the speakers in the speaker list display area 502. Reference numeral 509 denotes a slider indicating the position in the moving image data displayed in the video area 501. By moving the slider lever, the moving image data can be moved to an arbitrary position. Reference numeral 510 denotes the level of an audio signal input to the audio input 102 of moving image data. Reference numeral 511 denotes a time code in the moving image data when a speaker registered in the speaker specifying unit 107 is detected (detection start time). Reference numeral 512 denotes a time code in the moving image data at the time when the utterance of the speaker registered in the speaker specifying unit 107 ends (detection end time). Reference numeral 513 denotes a time code in the moving image data of the video displayed in the current video area 501. Reference numeral 514 denotes an operation state of the application. Application states include voice / face identification and voice recognition / text conversion (writing). Reference numeral 515 denotes a voice / face identification start button. By pressing the button, voice / face identification is started and speaker identification is performed. Reference numeral 516 denotes a preview button, which can automatically reproduce a balloon or a caption subtitle created automatically or edited by the user together with moving image data. The image of the screen given here is an example for explaining the present embodiment, and does not limit the function of the present embodiment.

続いてフローチャート及び表示画面例を参照して、本実施形態のソフトウェアの動作について説明する。 Next, the operation of the software of the present embodiment will be described with reference to a flowchart and a display screen example.

図１０は、自動吹き出し作成・編集処理機能を実現するソフトウェアによる音声・顔対応データ登録処理を表すフローチャートである。また、図１３は音声・顔対応登録処理における表示画面の一例である。 FIG. 10 is a flowchart showing voice / face correspondence data registration processing by software that realizes an automatic speech balloon creation / edit processing function. FIG. 13 is an example of a display screen in the voice / face correspondence registration process.

なお、本ソフトウェアによる処理を実行するに当たっては音声特徴量と顔特徴量の関連付けを行っておくことが必要である。 Note that it is necessary to associate the audio feature quantity with the face feature quantity when executing the processing by this software.

先ず、音声・顔対応登録を開始する（Ｓ１００）と、音声・顔対応登録画面５２０が表示され、人物名入力ステップ（Ｓ１０１）となる。人物名入力ステップ（Ｓ１０１）では、音声・顔対応登録画面５２０の人物名入力フィールド５２１に人物名を入力する。続いて、人物の顔特徴量を登録するために顔画像取り込みステップ（Ｓ１０２）を行う。顔画像取り込みステップ（Ｓ１０２）では、顔画像取り込みボタン５２６を押下することで顔の画像を取り込み、取り込んだ画像は顔表示領域５２２に表示されると共に、顔特徴量の演算ステップ（Ｓ１０３）が実行される。続いて、人物の音声特徴量を登録するために音声取り込みステップ（Ｓ１０４）を行う。音声取り込みステップ（Ｓ１０４）では、音声取り込みボタン５２７を押下することで音声を取り込み、取り込んだ音声のレベルが音声レベル表示領域５２５に表示されると共に、音声特徴量の演算ステップ（Ｓ１０５）が実行される。本実施形態では、顔識別のための顔特徴量登録及び音声識別のための音声特徴量登録は１回しか実行していないが、複数回実行してもよい。例えば、顔特徴量を取得する際、正面、左右斜め方向、上下斜め方向の特徴量を演算することで、話者が正面以外を向いていても識別率を向上させることができる。音声特徴量に関しても、複数の単語や声の強弱を変化させ特徴量を演算することで、様々な状況下での識別率を向上させることができる。 First, when voice / face correspondence registration is started (S100), a voice / face correspondence registration screen 520 is displayed, which is a person name input step (S101). In the person name input step (S101), the person name is input into the person name input field 521 of the voice / face correspondence registration screen 520. Subsequently, a face image capturing step (S102) is performed in order to register the facial feature amount of the person. In the face image capture step (S102), a face image is captured by pressing the face image capture button 526, and the captured image is displayed in the face display area 522, and the face feature amount calculation step (S103) is executed. Is done. Subsequently, an audio capturing step (S104) is performed in order to register the audio feature amount of the person. In the sound capturing step (S104), the sound is captured by pressing the sound capturing button 527, the level of the captured sound is displayed in the sound level display area 525, and the sound feature amount calculating step (S105) is executed. The In the present embodiment, the facial feature amount registration for face identification and the voice feature amount registration for voice identification are executed only once, but may be executed a plurality of times. For example, when acquiring facial feature values, by calculating feature values in the front direction, the left-right diagonal direction, and the up-down diagonal direction, the identification rate can be improved even if the speaker faces away from the front side. Regarding the voice feature amount, the identification rate under various situations can be improved by calculating the feature amount while changing the strength of a plurality of words and voices.

顔特徴量と音声特徴量の演算が完了すると、吹き出しの設定ステップ（Ｓ１０６）及び字幕スーパーの設定ステップ（Ｓ１０７）を行う。吹き出しの設定ステップ（Ｓ１０６）では、吹き出しプロパティ設定項目Ｓ１１０を設定する。吹き出しプロパティ設定ボタン５２３を押下すると吹き出しプロパティ設定画面５３０が表示される。設定画面内には、吹き出しプロパティ設定項目Ｓ１１０の項目毎にタブ５３１、５３２、５３３が設けられており設定したい項目のタブを選択し、各項目の設定を行う。図１３では吹き出しの形状を選択するタブ５３１の設定画面を表示している。選択リスト５３５に設定可能な形状が複数示されており、この中から好みの形状を選択する。同様に字幕スーパー設定ステップ（Ｓ１０７）では、字幕スーパープロパティ設定項目Ｓ１１１を設定する。字幕スーパープロパティ設定ボタン５２４を押下すると字幕スーパープロパティ設定画面が表示され、字幕スーパープロパティ設定項目Ｓ１１１の設定を行う。本実施形態の吹き出しプロパティ設定項目Ｓ１１０、字幕スーパープロパティ設定項目Ｓ１１１は一例であり、他の設定項目があっても良く、本提案の内容を制限するものではない。 When the calculation of the face feature amount and the voice feature amount is completed, a balloon setting step (S106) and a caption super setting step (S107) are performed. In the balloon setting step (S106), a balloon property setting item S110 is set. When the balloon property setting button 523 is pressed, a balloon property setting screen 530 is displayed. In the setting screen, tabs 531, 532, and 533 are provided for each item of the balloon property setting item S110, and the tab of the item to be set is selected and each item is set. In FIG. 13, a setting screen for a tab 531 for selecting a balloon shape is displayed. A plurality of shapes that can be set are shown in the selection list 535, and a desired shape is selected from these shapes. Similarly, in the caption super setting step (S107), a caption super property setting item S111 is set. When the caption super property setting button 524 is pressed, a caption super property setting screen is displayed, and the caption super property setting item S111 is set. The balloon property setting item S110 and the caption super property setting item S111 according to the present embodiment are examples, and other setting items may be included, and the content of the proposal is not limited.

音声・顔特徴量の演算、吹き出し設定、字幕スーパー設定が完了すると記録の確認ステップ（Ｓ１０８）が行われ、記録して良ければ音声・顔対応記録ステップ（Ｓ１０９）が実行されて音声・顔対応部１０６へ登録される。 When the calculation of the voice / face feature amount, the speech balloon setting, and the subtitle super setting are completed, the recording confirmation step (S108) is performed. If the recording is acceptable, the voice / face correspondence recording step (S109) is performed and the voice / face correspondence is performed. Registered in the unit 106.

図１１は吹き出し作成（Ｓ１２０）及び吹き出し編集（Ｓ１４０）を示すフローチャートである。 FIG. 11 is a flowchart showing balloon creation (S120) and balloon editing (S140).

吹き出し作成（Ｓ１２０）が開始されると、最初に動画像の入力ステップ（Ｓ１２１）が実行される。例えば、ファイル（Ｆ）を選択して既存の動画像ファイルを読み込む、またファイル（Ｆ）を選択して外部入力（外部の動画像再生機器、ビデオカメラ、ビデオデッキ、ＤＶＤプレーヤ等）より動画像を読み込む。 When the balloon creation (S120) is started, a moving image input step (S121) is first executed. For example, a file (F) is selected and an existing moving image file is read, and a file (F) is selected and a moving image is input from an external input (external moving image playback device, video camera, VCR, DVD player, etc.). Is read.

動画像の入力が決定すると、話者検出開始ステップ（Ｓ１２２）が実行される。図１２の画面で話者検出開始ボタン５１５を押下すると動画像入力ステップ（Ｓ１２１）で指定された動画像データの映像データが動画入力部１０１へ、音声データが音声入力部１０２へ入力される。入力された映像信号は顔検出部１０３、話者特定部１０７へ送られる。入力された音声信号は音声識別部１０４、話者特定部１０７へ送られる。 When the input of the moving image is determined, a speaker detection start step (S122) is executed. When the speaker detection start button 515 is pressed on the screen of FIG. 12, the video data of the moving image data specified in the moving image input step (S 121) is input to the moving image input unit 101 and the audio data is input to the audio input unit 102. The input video signal is sent to the face detection unit 103 and the speaker identification unit 107. The input voice signal is sent to the voice identification unit 104 and the speaker identification unit 107.

話者検出開始ステップ（Ｓ１２２）により話者検出が開始されると、話者特定中ステップ（Ｓ１２３）となる。話者特定中ステップ（Ｓ１２３）では状態表示５１４が「話者特定中」となる。話者特定中ステップ（Ｓ１２３）では、話者特定部１０７に入力される動画像の顔特徴量、音声特徴量と音声・顔対応部１０６に登録された話者の顔特徴量、音声特徴量が照合され、話者の特定が実施される。話者特定部１０７において、音声特徴量が一致した話者を検出すると、発声の開始タイムコードと発声の終了タイムコード、話者人物名、顔認識状態がアプリケーションに通知され話者検出終了ステップ（Ｓ１２４）が実行される。 When speaker detection is started in the speaker detection start step (S122), the speaker specifying step (S123) is performed. In the speaker specifying step (S123), the status display 514 becomes “speaker specifying”. In the speaker specifying step (S123), the facial feature amount and voice feature amount of the moving image input to the speaker specifying unit 107 and the speaker facial feature amount and voice feature amount registered in the voice / face correspondence unit 106 are used. Are identified and the speaker is identified. When the speaker identifying unit 107 detects a speaker whose voice feature amount matches, the application is notified of the utterance start time code, the utterance end time code, the speaker person name, and the face recognition state, and the speaker detection end step ( S124) is executed.

話者検出終了ステップ（Ｓ１２４）では、動画像データから動画入力部１０１、音声入力部１０２への入力が停止し、話者特定部１０７の話者特定処理も停止する。また話者が特定された発声開始タイムコード５１１、発声終了タイムコード５１２が表示される。更に話者人物名、顔認識状態により話者の画像内存在状態５０６、話者の発声状態５０８が表示される。図１２ではＡ子が画面に映っており、Ａ子の声で「おはようＡ子です。」を発声した状態を示している。 In the speaker detection end step (S124), the input from the moving image data to the moving image input unit 101 and the voice input unit 102 is stopped, and the speaker specifying process of the speaker specifying unit 107 is also stopped. Further, the utterance start time code 511 and the utterance end time code 512 in which the speaker is specified are displayed. Furthermore, a speaker's in-image presence state 506 and a speaker's utterance state 508 are displayed according to the speaker's person name and face recognition state. In FIG. 12, child A is shown on the screen, and the voice of child A is the voice of “Good morning child A”.

話者検出が終了する（Ｓ１２４）と音声認識ステップ（Ｓ１２５）が開始される。音声認識ステップ（Ｓ１２５）では、状態表示５１４が「音声認識中」となる。音声認識ステップ（Ｓ１２５）では、Ａ子の発声開始タイムコード、発声終了タイムコードにより再度動画像データより当該時刻区間の音声信号を音声入力部１０２へ入力し、音声認識部１０５により音声認識を行い、テキスト化ステップ（Ｓ１２５）がテキスト化部１０８にて行われる。本実施形態では当該時刻区間を動画像データから読み取っているが、音声入力部１０２または音声識別部１０５で過去一定期間の音声データを保持しており、その音声データを用いて音声認識、テキスト化を行っても良い。 When the speaker detection is finished (S124), a voice recognition step (S125) is started. In the voice recognition step (S125), the status display 514 becomes “during voice recognition”. In the voice recognition step (S125), the voice signal of the time interval is input again from the moving image data to the voice input unit 102 using the A child's voice start time code and voice end time code, and the voice recognition unit 105 performs voice recognition. The text conversion step (S125) is performed by the text conversion unit 108. In this embodiment, the time interval is read from the moving image data, but the voice input unit 102 or the voice identification unit 105 holds the voice data for a certain period in the past, and the voice data is used for voice recognition and text conversion. May be performed.

音声認識ステップ（Ｓ１２５）、テキスト化ステップ（Ｓ１２６）が終了すると、吹き出し自動作成ステップ（Ｓ１２７）が実行される。吹き出し自動作成ステップ（Ｓ１２７）では、話者特定中ステップ（Ｓ１２３）により特定された話者人物名、顔認識状態により、話者が画面内に存在する場合には吹き出しを、話者が画面内に存在しない場合には字幕スーパーを自動作成する。吹き出し自動作成ステップ（Ｓ１２７）では、位置特定部１０９に話者特定部１０７において検出された話者人物名、顔認識状態より吹き出しまたは字幕スーパーの表示位置を決定する。位置特定部１０９で決定された表示位置と、テキスト化ステップ（Ｓ１２５）によりテキスト化部１０８でテキスト化された音声情報を吹き出し作成部１１２へ入力し、吹き出しまたは字幕スーパーが作成される。吹き出し作成部１１２において吹き出しを作成する際には、音声・顔対応部１０６に登録されている話者人物名に対応する吹き出しプロパティ設定の吹き出しの形状、吹き出し背景色、文字フォント、文字色、吹き出し透明度、効果、表示保持時間に基づき、吹き出しデータを作成する。また、吹き出し作成部１１２において字幕スーパーを作成する際には、音声・顔対応部１０６に登録されている話者人物名に対応する字幕スーパープロパティ設定の字幕スーパー背景色、文字フォント、文字色、字幕スーパー透明度、効果、表示保持時間に基づき、字幕スーパーとして吹き出しデータを作成する。 When the speech recognition step (S125) and the text conversion step (S126) are completed, a speech balloon automatic creation step (S127) is executed. In the speech balloon automatic creation step (S127), if the speaker is present on the screen due to the speaker person name and the face recognition state identified in the speaker identifying step (S123), a speech balloon is displayed. If it doesn't exist, it automatically creates a caption. In the automatic speech balloon creation step (S127), the position specifying unit 109 determines the display position of the speech balloon or caption supervision from the speaker person name and face recognition state detected by the speaker specifying unit 107. The display position determined by the position specifying unit 109 and the voice information converted to text by the text converting unit 108 in the text conversion step (S125) are input to the speech bubble generating unit 112, and a speech bubble or a caption subtitle is generated. When the speech balloon creating unit 112 creates a speech balloon, the speech balloon property setting speech balloon shape, speech balloon background color, character font, character color, speech balloon corresponding to the speaker person name registered in the voice / face correspondence unit 106 Balloon data is created based on transparency, effect, and display hold time. Also, when creating a caption super in the speech balloon creation unit 112, the caption super background color, character font, character color of the caption super property setting corresponding to the speaker person name registered in the voice / face correspondence unit 106, The speech balloon data is created as the caption subtitle based on the caption caption transparency, effect, and display holding time.

吹き出し自動作成ステップ（Ｓ１２７）で作成された吹き出しデータにより、吹き出し表示ステップ（Ｓ１２８）が実行される。吹き出し表示ステップ（Ｓ１２８）では吹き出しを表示する際には、図１４の映像領域５０１に吹き出しデータに基づき吹き出し５４０を作成して表示する。また、吹き出し表示ステップ（Ｓ１２８）では字幕スーパーを表示する際には、図１５の映像領域５０１に吹き出しデータに基づき字幕スーパー５６０を作成して表示する。 A speech balloon display step (S128) is executed based on the speech balloon data created in the speech balloon automatic creation step (S127). In the balloon display step (S128), when a balloon is displayed, a balloon 540 is created and displayed in the video area 501 of FIG. 14 based on the balloon data. Also, in the balloon display step (S128), when displaying a caption super, a caption super 560 is created and displayed in the video area 501 of FIG. 15 based on the balloon data.

吹き出し表示ステップ（Ｓ１２８）の後、吹き出し編集ステップ（Ｓ１２９）が実行される。吹き出し編集ステップ（Ｓ１２９、Ｓ１４０）では、吹き出し自動作成ステップ（Ｓ１２７）で作成された吹き出しデータに対して、文字の確認ステップ（Ｓ１４１）、文字の修正ステップ（Ｓ１４４）、吹き出しの設定変更ステップ（Ｓ１４５）により吹き出しデータを編集する。 After the balloon display step (S128), a balloon editing step (S129) is executed. In the speech balloon editing step (S129, S140), the character confirmation step (S141), the character correction step (S144), and the speech balloon setting change step (S145) for the speech balloon data created in the speech balloon automatic creation step (S127). ) To edit the balloon data.

以下では、吹き出し編集処理及び字幕スーパー編集処理について説明する。 Hereinafter, the balloon editing process and the caption super editing process will be described.

図１４は吹き出しの編集画面である。吹き出し編集ステップＳ１４０が実行され、吹き出しデータが吹き出しの場合、吹き出し編集画面５４１が表示される。吹き出し編集画面５４１は、画像確認領域５４２、テキスト表示・編集領域５４３、吹き出し表示期間中の表示位置を表示・移動する為のスライダ５４４、吹き出しプロパティ設定５４５、話者人物名５４６、発声の開始タイムコード５４７と発声の終了タイムコード５４８、音声再認識ボタン５４９，音声再生ボタン５５０，確認ボタン５５１で構成される。図１４では、Ａ子が「おはようＡ子です。」を発声している状態の吹き出し編集画面である。文字の確認ステップ（Ｓ１４１）では、吹き出しデータよりテキストを取得しテキスト表示・編集領域５４３へ表示する。ユーザは修正確認ステップ（Ｓ１４２）により修正の有無を判断する。修正が必要な場合には、必要に応じて音声再生ボタン５５０を押下し音声再生ステップ（Ｓ１４３）により発声の開始タイムコード５４７から発声の終了タイムコード５４８まで動画像データより音声を再生することができる。また、スライダ５４４を移動させることで、発声期間中の任意の位置から音声を再生することができる。ユーザは音声を聞きながら、文字修正ステップ（Ｓ１４４）でテキスト表示・編集領域５４３に表示されたテキストを編集・修正することができる。また、音声再認識ボタン５４９により、再度音声認識（Ｓ１２５）、テキスト化（Ｓ１２６）を実施することもできる。吹き出しの表示テキスト内容が確認されたら、必要に応じて吹き出し設定変更ステップ（Ｓ１４５）を実行する。吹き出しデータは、音声・顔対応１０６内に登録されているＡ子の吹き出しプロパティ設定の内容がコピーされている。吹き出しプロパティ設定５４５に設定されている内容を変更することで、個別の吹き出しデータの吹き出しプロパティの設定を変更することができる。ここで変更された吹き出しプロパティ設定は、「おはようＡ子です。」の吹き出しのみに対して有効であり、音声・顔対応１０６に登録されているＡ子の吹き出しプロパティ設定には影響がない。吹き出しの編集が完了したら、確認ボタン５５１を押下して吹き出し編集ステップＳ１２９が完了する。 FIG. 14 shows a balloon edit screen. When the balloon editing step S140 is executed and the balloon data is a balloon, a balloon editing screen 541 is displayed. The balloon edit screen 541 includes an image confirmation area 542, a text display / edit area 543, a slider 544 for displaying / moving the display position during the balloon display period, a balloon property setting 545, a speaker person name 546, and a utterance start time. It includes a code 547, an utterance end time code 548, a voice re-recognition button 549, a voice playback button 550, and a confirmation button 551. FIG. 14 shows a balloon edit screen in a state where A child is uttering “Good morning A child”. In the character confirmation step (S141), the text is acquired from the balloon data and displayed in the text display / edit area 543. The user determines whether there is a correction in the correction confirmation step (S142). If correction is necessary, the sound reproduction button 550 is pressed as necessary, and sound is reproduced from the moving image data from the utterance start time code 547 to the utterance end time code 548 in the sound reproduction step (S143). it can. Further, by moving the slider 544, it is possible to reproduce sound from an arbitrary position during the utterance period. The user can edit / correct the text displayed in the text display / edit area 543 in the character correction step (S144) while listening to the voice. Further, voice recognition (S125) and text conversion (S126) can be performed again by the voice re-recognition button 549. If the display text content of the speech balloon is confirmed, a speech balloon setting changing step (S145) is executed as necessary. In the balloon data, the contents of the balloon property setting of child A registered in the voice / face correspondence 106 are copied. By changing the contents set in the balloon property setting 545, the balloon property setting of individual balloon data can be changed. The balloon property setting changed here is effective only for the balloon of “Good morning A child”, and does not affect the balloon property setting of the child A registered in the voice / face correspondence 106. When the balloon editing is completed, the confirmation button 551 is pressed to complete the balloon editing step S129.

図１５は字幕スーパーの編集画面である。吹き出し編集ステップＳ１４０が実行され、吹き出しデータが字幕スーパーの場合、字幕スーパー編集画面５６１が表示される。字幕スーパー編集画面５６１は、画像確認領域５６２、テキスト表示・編集領域５６３、吹き出し表示期間中の表示位置を表示・移動するためのスライダ５６４、字幕スーパープロパティ設定５６５、話者人物名５６６、発声の開始タイムコード５６７と発声の終了タイムコード５６８、音声再認識ボタン５６９，音声再生ボタン５７０，確認ボタン５７１で構成される。図１５では、Ｃ子が「Ｃ子です。おはようＡ子、Ｂ子。」を発声している状態の字幕スーパー編集画面である。文字の確認ステップ（Ｓ１４１）では、吹き出しデータよりテキストを取得しテキスト表示・編集領域５６３へ表示する。ユーザは修正確認ステップ（Ｓ１４２）により修正の有無を判断する。修正が必要な場合には、必要に応じて音声再生ボタン５７０を押下し音声再生ステップ（Ｓ１４３）により発声の開始タイムコード５６７から発声の終了タイムコード５６８まで動画像データより音声を再生することができる。また、スライダ５６４を移動させることで、発声期間中の任意の位置から音声を再生することができる。ユーザは音声を聞きながら、文字修正ステップ（Ｓ１４４）でテキスト表示・編集領域５６３に表示されたテキストを編集・修正することができる。また、音声再認識ボタン５６９により、再度音声認識（Ｓ１２５）、テキスト化（Ｓ１２６）を実施することもできる。字幕スーパーの表示テキスト内容が確認されたら、必要に応じて吹き出し設定変更ステップ（Ｓ１４５）を実行する。吹き出しデータは、音声・顔対応１０６内に登録されているＣ子の字幕スーパープロパティ設定の内容がコピーされている。字幕スーパープロパティ設定５６５に設定されている内容を変更することで、個別の吹き出しデータの字幕スーパープロパティ設定を変更することができる。ここで変更された字幕スーパープロパティ設定は、「Ｃ子です。おはようＡ子、Ｂ子。」の字幕スーパーのみに対して有効であり、音声・顔対応１０６に登録されているＣ子の字幕スーパープロパティ設定には影響がない。吹き出しの編集が完了したら、確認ボタン５７１を押下して吹き出し編集ステップ（Ｓ１２９）が完了する。 FIG. 15 is a subtitle editing screen. When the speech balloon editing step S140 is executed and the speech balloon data is caption super, the caption super editing screen 561 is displayed. The subtitle super editing screen 561 includes an image confirmation area 562, a text display / edit area 563, a slider 564 for displaying / moving the display position during the balloon display period, a subtitle super property setting 565, a speaker person name 566, an utterance It includes a start time code 567, an utterance end time code 568, a voice re-recognition button 569, a voice playback button 570, and a confirmation button 571. FIG. 15 is a caption super editing screen in a state where C child is uttering “C child. Good morning A child, B child”. In the character confirmation step (S141), the text is acquired from the balloon data and displayed in the text display / edit area 563. The user determines whether there is a correction in the correction confirmation step (S142). If correction is necessary, the sound reproduction button 570 is pressed as necessary, and sound is reproduced from the moving image data from the utterance start time code 567 to the utterance end time code 568 in the sound reproduction step (S143). it can. Further, by moving the slider 564, it is possible to reproduce sound from an arbitrary position during the utterance period. The user can edit / correct the text displayed in the text display / edit area 563 in the character correction step (S144) while listening to the voice. Further, voice recognition (S125) and text conversion (S126) can be performed again by the voice re-recognition button 569. When the display text content of the caption superimposition is confirmed, a balloon setting changing step (S145) is executed as necessary. The contents of the subtitle super property setting for the child C registered in the voice / face correspondence 106 are copied in the balloon data. By changing the content set in the subtitle super property setting 565, the subtitle super property setting of individual balloon data can be changed. The subtitle super property setting changed here is effective only for the subtitle super of “C child. Good morning A child, B child”, and the subtitle super of C child registered in the voice / face correspondence 106. Property settings are not affected. When the balloon editing is completed, the confirmation button 571 is pressed to complete the balloon editing step (S129).

吹き出し編集ステップ（Ｓ１２９）において、当該話者が画面内に存在する場合、図１４の吹き出し５４０が映像領域５０１に表示されているが、吹き出し５４０を指定して吹き出しの位置や向き、大きさの調整を行うことが可能である。また、吹き出し５４０を指定して字幕スーパーへ変更することも可能である。ここで説明された吹き出し編集手順や画面は一例を説明するものであって、本発明がその編集手順や画面を制限されるものではない。例えば、話者検出から自動吹き出し作成までを動画像データ全体に対して実行し、その後に個別の吹き出しや字幕スーパーの編集操作を行っても良い。 In the speech balloon editing step (S129), when the speaker is present in the screen, the speech balloon 540 of FIG. 14 is displayed in the video area 501, but the speech balloon 540 is designated to indicate the position, orientation, and size of the speech balloon. Adjustments can be made. It is also possible to designate the balloon 540 and change it to a subtitle supermarket. The balloon editing procedure and screen described here are merely examples, and the editing procedure and screen are not limited by the present invention. For example, the process from detection of a speaker to creation of an automatic speech balloon may be performed on the entire moving image data, and then an individual speech balloon or caption super editing operation may be performed.

吹き出し編集ステップ（Ｓ１２９）が完了すると、プレビュー表示ステップ（Ｓ１３０）で編集した吹き出しの確認を行うことができる。図１２において、プレビューボタン５１６を押下するとプレビュー画面が表示される。 When the speech balloon editing step (S129) is completed, the speech balloon edited in the preview display step (S130) can be confirmed. In FIG. 12, when a preview button 516 is pressed, a preview screen is displayed.

図１６はプレビュー画面である。 FIG. 16 shows a preview screen.

５８０は映像と吹き出しを合成した画像を表示する映像領域である。５８１は映像領域５８０に表示されている映像のタイムコードである。５８２から５８６は再生を行うための操作ボタンである。５８２は直前の発声開始タイムコードへの移動ボタン、５８３は巻き戻しボタン、５８４は再生ボタン、５８５は早送りボタン、５８６は直後の発声開始タイムコードへの移動ボタンである。５８７は吹き出し情報ウィンドウであり、スライダ５９２を用いて動画像データ内任意の範囲の吹き出し情報を表示することができる。５８８はタイムコードスケールであり、登録話者の画像内存在開始タイムコード、存在終了タイムコード、発声開始タイムコード、発声終了タイムコードが表示される。図の例では、01:12:20 14はＡ子が映り始めたタイムコード、01:12:21 05はＡ子の発声開始タイムコード、01:12:24 12はＡ子の発声終了タイムコード、01:12:26 02はＡ子とＢ子が映り始めたタイムコード、01:22:27 15はＢ子の発声開始タイムコードである。５８９は登録話者の画像内存在開始タイムコード、発声開始タイムコードにおけるインデックス画像である。５９０は吹き出し情報表示で、吹き出し内に表示されるテキスト情報と吹き出しの表示時間を示したものである。吹き出し表示時間は、発声時間に表示保持時間を加えた時間となっている。５９１は字幕スーパー情報表示で、字幕スーパー内に表示されるテキスト情報と字幕スーパーの表示時間を示したものである。字幕スーパー表示時間は、発声時間に表示保持時間を加えた時間となっている。 Reference numeral 580 denotes a video area for displaying an image obtained by combining a video and a balloon. Reference numeral 581 denotes a time code of the video displayed in the video area 580. Reference numerals 582 to 586 denote operation buttons for performing reproduction. 582 is a move button to the immediately preceding utterance start time code, 583 is a rewind button, 584 is a play button, 585 is a fast forward button, and 586 is a move button to the immediately following utterance start time code. Reference numeral 587 denotes a balloon information window, which can display balloon information in an arbitrary range in the moving image data using the slider 592. Reference numeral 588 denotes a time code scale, which displays the existence start time code, the existence end time code, the utterance start time code, and the utterance end time code in the image of the registered speaker. In the example shown in the figure, 01:12:20 14 is the time code when the child A started to appear, 01:12:21 05 is the child A start time code, and 01:12:24 12 is the child A end time code. , 01:12:26 02 is the time code when the A child and the B child start to appear, and 01:22:27 15 is the utterance start time code of the B child. Reference numeral 589 denotes an index image in the presence start time code and utterance start time code of the registered speaker. Reference numeral 590 denotes a balloon information display, which indicates the text information displayed in the balloon and the display time of the balloon. The balloon display time is a time obtained by adding the display holding time to the utterance time. Reference numeral 591 denotes a caption super information display, which shows the text information displayed in the caption super and the display time of the caption super. The caption super display time is a time obtained by adding the display holding time to the utterance time.

プレビュー画面では、再生ボタン５８４を押すことで現在のタイムコードから吹き出し付きで再生が行われ、吹き出しの内容、効果などを確認することができる。巻き戻しボタン５８３を押すことで、逆方向に再生する。２回以上押すことで巻き戻しの速度を速めることができる。早送りボタン５８４を押すことで正方向に再生する。２回以上押すことで早送りの速度を速めることができる。直前の発声開始タイムコードへの移動ボタン５８２は現在タイムコードの直前に話者が発声を開始したタイムコードまで戻すことができる。直後の発声開始タイムコードへの移動ボタン５８６は現在タイムコードの直後に話者が発声を開始したタイムコードまで早送りすることができる。これらのボタンは話者の発声開始タイムコードへの移動が割り当てられているが、話者の画像内存在開始タイムコード、存在終了タイムコード、発声開始タイムコード、発声終了タイムコードへの移動ボタンとしても割り当てることが可能であっても良い。 On the preview screen, by pressing the play button 584, playback is performed with a balloon from the current time code, and the contents and effects of the balloon can be confirmed. By pressing the rewind button 583, playback is performed in the reverse direction. The rewinding speed can be increased by pressing twice or more. When the fast forward button 584 is pressed, playback is performed in the forward direction. Fast-forward speed can be increased by pressing twice or more. A move button 582 to the immediately preceding utterance start time code can return to the time code at which the speaker started uttering immediately before the current time code. The move button 586 to the utterance start time code immediately after the current time code can fast forward to the time code at which the speaker started uttering. These buttons are assigned to move to the utterance start time code of the speaker, but as buttons to move to the utterance start time code, presence end time code, utterance start time code, utterance end time code in the speaker image May also be assignable.

プレビュー画面では、タイムコードスケール５８８の任意のタイムコードまたはインデックス画像５８９を指定することで、指定されたタイムコードの画像を吹き出しや字幕スーパー付きで呼び出すことができる。 On the preview screen, by designating an arbitrary time code or index image 589 of the time code scale 588, an image of the designated time code can be called with a balloon or a caption superimposition.

プレビュー画面では、吹き出し情報表示５９０のテキスト部分を選択することで、吹き出し編集画面５４１を呼び出すことも可能である。吹き出し情報表示５９０のテキスト部分の左端を移動することで吹き出しの表示開始タイムコードを前後に調整することも可能である。吹き出し情報表示５９０のテキスト部分の右端を移動することで吹き出しの表示終了タイムコードを前後に調整することも可能である。また、プレビュー画面では、字幕スーパー情報表示５９１のテキスト部分を選択することで、字幕スーパー編集画面５６１を呼び出すことも可能である。字幕スーパー情報表示５９１のテキスト部分の左端を移動することで字幕スーパーの表示開始タイムコードを前後に調整することも可能である。字幕スーパー情報表示５９１のテキスト部分の右端を移動することで字幕スーパーの表示終了タイムコードを前後に調整することも可能である。更に吹き出し情報表示５９０のテキスト部分を字幕スーパー情報表示５９１に移動することで吹き出し表示から字幕スーパー表示への切り替えを行うことも可能である。 On the preview screen, the balloon edit screen 541 can be called by selecting the text portion of the balloon information display 590. By moving the left end of the text portion of the balloon information display 590, the balloon display start time code can be adjusted back and forth. It is also possible to adjust the display end time code of the balloon forward and backward by moving the right end of the text portion of the balloon information display 590. On the preview screen, the caption super editing screen 561 can be called by selecting the text portion of the caption super information display 591. By moving the left end of the text portion of the caption super information display 591, it is also possible to adjust the display start time code of the caption supermarket forward and backward. By moving the right end of the text portion of the caption super information display 591, it is also possible to adjust the display end time code of the caption supermarket forward and backward. Furthermore, it is also possible to switch from the balloon display to the caption super display by moving the text part of the balloon information display 590 to the caption super information display 591.

プレビュー表示ステップ（Ｓ１３０）で編集した吹き出しの確認を行い（Ｓ１３１）、修正が必要であれば再度吹き出し編集ステップ（Ｓ１２９）へ戻り、修正が不要であれば編集終了確認ステップ（Ｓ１３２）を行う。吹き出しが更にある場合には次の話者検出開始ステップ（Ｓ１２２）へ戻り、次の話者を検出する。編集終了確認ステップ（Ｓ１３２）が完了すると吹き出し作成は完了し、動画像作成ステップ（Ｓ１３３）において、画像データ１１１と音声データ１１３と吹き出しデータを動画像作成部１１４でまとめて動画像データとして作成する。作成された動画像データは、動画像出力ステップ（Ｓ１３４）で保存される。例えば、ファイル（Ｆ）を選択して新規の動画像ファイルとして書き込む、またファイル（Ｆ）を選択して外部出力（外部の動画像記録機器、ビデオカメラ、ビデオデッキ、ＤＶＤレコーダ等）へ動画像を書き込む。 The speech balloon edited in the preview display step (S130) is confirmed (S131). If correction is necessary, the flow returns to the speech balloon editing step (S129) again. If correction is not necessary, the editing completion confirmation step (S132) is performed. If there are more speech balloons, the process returns to the next speaker detection start step (S122) to detect the next speaker. When the editing completion confirmation step (S132) is completed, the speech balloon creation is completed. In the moving image creation step (S133), the image data 111, the sound data 113, and the speech balloon data are collectively created as moving image data by the moving image creation unit 114. . The created moving image data is stored in a moving image output step (S134). For example, the file (F) is selected and written as a new moving image file, or the file (F) is selected and the moving image is output to an external output (external moving image recording device, video camera, video deck, DVD recorder, etc.). Write.

上記実施形態によれば、入力された動画データの顔と音声から話者を特定し、話者の位置と該当する話者の音声より吹き出しデータを作成するので、該当する話者の画像付近に吹き出しを表示でき、吹き出しや字幕スーパーの作成や編集が容易になる。 According to the above embodiment, the speaker is identified from the face and voice of the input video data, and the balloon data is created from the speaker position and the voice of the corresponding speaker. Speech balloons can be displayed, making it easier to create and edit speech balloons and subtitles.

また、動画の撮影と同時に話者を特定し吹き出しや字幕スーパーの作成を行うことができるため、撮影後の吹き出しや字幕スーパーの編集が容易になる。また、外部からの動画像の入力と同時に話者を特定し吹き出しや字幕スーパーの作成を行うことができるため、画像入力後の吹き出しや字幕スーパーの編集が容易になる。 In addition, since a speaker can be identified and a speech bubble and a caption superimpose can be created simultaneously with the shooting of a moving image, it is easy to edit the speech balloon and the caption supervision after photographing. In addition, since the speaker can be specified simultaneously with the input of the moving image from the outside and the speech balloon and the caption subtitle can be created, the speech balloon and the caption supervision after the image input can be easily edited.

［他の実施形態］
以上、本発明に係る実施形態について具体例を用いて詳述したが、本発明は、例えば、システム、装置、方法、プログラム若しくは記憶媒体（記録媒体）等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 [Other Embodiments]
The embodiment according to the present invention has been described in detail using specific examples. However, the present invention can take an embodiment as a system, apparatus, method, program, storage medium (recording medium), or the like. Specifically, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.

尚、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラム（実施形態では図示の各機能ブロックやフローチャートに対応したプログラム）を、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータが該供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。 In the present invention, a software program (in the embodiment, a program corresponding to each functional block or flowchart shown in the drawings) that realizes the functions of the above-described embodiment is directly or remotely supplied to the system or apparatus. This includes the case where the computer of the apparatus is also achieved by reading and executing the supplied program code.

従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明は、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. In other words, the present invention includes a computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等の形態であっても良い。 In that case, as long as it has the function of a program, it may be in the form of object code, a program executed by an interpreter, script data supplied to the OS, or the like.

プログラムを供給するための記録媒体（記憶媒体）としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などがある。 As a recording medium (storage medium) for supplying the program, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, an MO, a CD-ROM, a CD-R, a CD-RW, a magnetic tape, and a nonvolatile memory card ROM, DVD (DVD-ROM, DVD-R) and the like.

その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、該ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明に含まれるものである。 As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program of the present invention itself or a compressed file including an automatic installation function is downloaded from the homepage to a recording medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部又は全部を行い、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of the processes and performing the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部又は全部を行い、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明に係る実施形態の自動吹き出し作成・編集処理機能を実現するためのブロック図である。It is a block diagram for implement | achieving the automatic speech balloon creation and edit processing function of embodiment which concerns on this invention. 図１に示す自動吹き出し作成・編集処理機能を有する映像記録・編集装置の構成を示す図である。It is a figure which shows the structure of the video recording / editing apparatus which has the automatic speech balloon creation / edit processing function shown in FIG. 図２の映像記録・編集装置の外観図である。FIG. 3 is an external view of the video recording / editing apparatus of FIG. 2. 自動吹き出し作成・編集処理における、映像、音声、吹き出し、字幕スーパー、合成画像が生成される様子を時系列的に示した図である。It is the figure which showed a mode that the image | video, an audio | voice, a speech balloon, a supertitle | superscript, and a synthesized image were produced | generated in the automatic speech balloon creation / edit process. 自動吹き出し作成・編集処理において作成される画像イメージ（Ａ子）を示す図である。It is a figure which shows the image image (A child) produced in an automatic speech balloon preparation / edit process. 自動吹き出し作成・編集処理において作成される画像イメージ（Ａ子と吹き出し）を示す図である。It is a figure which shows the image image (A child and a speech balloon) produced in automatic speech balloon creation and edit processing. 自動吹き出し作成・編集処理において作成される画像イメージ（Ａ子とＢ子）を示す図である。It is a figure which shows the image image (A child and B child) produced in automatic speech balloon production / edit processing. 自動吹き出し作成・編集処理において作成される画像イメージ（Ａ子とＢ子と吹き出し）を示す図である。It is a figure which shows the image image (A child, B child, and a speech balloon) produced in automatic speech balloon production / edit processing. 自動吹き出し作成・編集処理において作成される画像イメージ（Ａ子とＢ子と字幕スーパー）を示す図である。It is a figure which shows the image image (A child, B child, and subtitle super) created in an automatic speech balloon preparation / edit process. 自動吹き出し作成・編集処理機能における音声・顔対応データ登録処理を表すフローチャートである。It is a flowchart showing the voice / face correspondence data registration processing in the automatic speech balloon creation / edit processing function. 吹き出し作成及び吹き出し編集を示すフローチャートである。It is a flowchart which shows speech balloon creation and speech balloon editing. 自動吹き出し作成・編集処理を行う際の表示画面例を示す図である。It is a figure which shows the example of a display screen at the time of performing automatic speech balloon preparation and edit processing. 図１０の音声・顔対応登録処理を行う際の表示画面例を示す図である。It is a figure which shows the example of a display screen at the time of performing the audio | voice and face corresponding | compatible registration process of FIG. 吹き出し編集処理を行う際の編集画面例を示す図である。It is a figure which shows the example of an edit screen at the time of performing speech balloon edit processing. 字幕スーパー編集処理を行う際の編集画面例を示す図である。It is a figure which shows the example of an edit screen at the time of performing a caption super editing process. 自動吹き出し作成・編集処理結果をプレビューする際の表示画面例を示す図である。It is a figure which shows the example of a display screen at the time of previewing an automatic balloon production | generation / editing process result. 本発明に係る実施形態の自動吹き出し作成・編集処理機能を実現するソフトウェアを備える動画像データ編集装置を示す図である。It is a figure which shows the moving image data editing apparatus provided with the software which implement | achieves the automatic speech balloon creation and edit processing function of embodiment which concerns on this invention.

Explanation of symbols

100 自動吹き出し作成・編集処理部
101 動画入力部
102 音声入力部
103 顔検出部
104 音声識別部
105 音声認識部
106 音声・顔対応部
107 話者特定部
108 テキスト化部
109 位置特定部
110 同期部
111 画像データ部
112 吹き出し作成部
113 音声データ部
114 動画像作成部
200 映像記録・編集装置
201 カメラ部
202 映像系Ａ／Ｄ変換部
203 画像処理部
204 マイク入力部
205 音声系Ａ／Ｄ変換部
206 音声信号処理部
207 制御装置
208 操作部材
209 同期部
210 記録装置
211 動画像・吹き出し合成処理部
212 スピーカー
213 表示装置
300 映像記録・編集装置
301 撮影ボタン
302 接眼レンズ（ファインダー）
303 撮影レンズ
304 液晶ファインダー、再生画面
305 操作ボタン
406 吹き出し（Ａ子）
407 吹き出し（Ｂ子）
408 字幕スーパー（Ｃ子）
501 映像領域
503,504,505 話者情報
510 音声信号レベル
511 発声開始タイムコード
512 発声終了タイムコード
513 現在の表示映像タイムコード
514 動作状態
515 開始ボタン
516 プレビューボタン
520 音声・顔登録画面
521 人物名入力フィールド
522 顔表示領域
525 音声レベル表示領域
526 顔画像取り込みボタン
527 音声取り込みボタン
530 吹き出しプロパティ設定画面
541 吹き出し編集画面
542 画像確認領域
543 テキスト表示・編集領域
544 スライダ
549 音声再認識ボタン
550 音声再生ボタン
551 確認ボタン
561 字幕スーパー設定画面
562 画像確認領域
563 テキスト表示・編集領域
564 スライダ
569 音声再認識ボタン
570 音声再生ボタン
571 確認ボタン
580 映像領域
581 映像のタイムコード
582 直前の発声開始タイムコードへの移動ボタン
583 巻き戻しボタン
584 再生ボタン
585 早送りボタン
586 直後の発声開始タイムコードへの移動ボタン
588 タイムコードスケール
589 インデックス画像
590 吹き出し情報表示
591 字幕スーパー情報表示
592 スライダ
600 パーソナルコンピュータ
601 表示装置（ディスプレイ）
602 キーボード
603 マウス 100 Automatic speech balloon creation / editing processing section
101 Video input section
102 Audio input section
103 Face detector
104 Voice identification part
105 Voice recognition unit
106 Voice / face support
107 Speaker Identification Department
108 Texting Department
109 Positioning part
110 Synchronization part
111 Image data section
112 Callout generator
113 Audio data section
114 Moving image generator
200 Video recording / editing equipment
201 Camera section
202 Video A / D converter
203 Image processing unit
204 Microphone input section
205 Voice A / D converter
206 Audio signal processor
207 Controller
208 Control members
209 Synchronization part
210 Recording device
211 Moving image / balloon composition processing part
212 Speaker
213 display
300 Video recording / editing equipment
301 Shooting button
302 Eyepiece (viewfinder)
303 Photo lens
304 LCD viewfinder, playback screen
305 Operation buttons
406 Speech balloon (child A)
407 Speech balloon (child B)
408 Subtitle Supermarket (C Child)
501 video area
503,504,505 Speaker information
510 audio signal level
511 Voice start time code
512 utterance end time code
513 Current display video time code
514 Operating status
515 Start button
516 Preview button
520 Voice / Face Registration Screen
521 Person name input field
522 Face display area
525 Audio level display area
526 Face image import button
527 Audio capture button
530 Callout property setting screen
541 Speech bubble edit screen
542 Image confirmation area
543 Text display / edit area
544 slider
549 Voice recognition button
550 Audio playback button
551 Confirm button
561 Subtitle Super Setting Screen
562 Image confirmation area
563 Text display / edit area
564 Slider
569 Voice recognition button
570 Audio playback button
571 Confirm button
580 video area
581 Video time code
582 Button to move to the last utterance start time code
583 Rewind button
584 Play button
585 Fast-forward button
586 Button to move to the next voice start time code
588 timecode scale
589 Index Image
590 Callout information display
591 Subtitle super information display
592 slider
600 personal computer
601 Display device
602 keyboard
603 mouse

Claims

An apparatus for creating subtitles from original video data including images and sound,
Face detection means for detecting a facial feature amount from an image portion of the original moving image data;
A voice identifying means for detecting a voice feature amount from a voice portion of the original moving image data;
The feature amount of the face detected by the face detection unit and the feature amount of the speech detected by the speech identification unit are identified with the voice feature amount for identifying the voice of the speaker prepared in advance and the face of the speaker. A speaker identification means for identifying a speaker in comparison with a facial feature,
Position specifying means for specifying the face position of the specified speaker;
Voice recognition means for recognizing a character string from the identified voice of the speaker and generating text data of the character string;
Based on the face position obtained by the position specifying unit and the text data generated by the voice recognition unit, the text data of the character string uttered from the specified speaker is displayed in the display screen. A balloon creating means for creating balloon data;
A moving image creating means for creating new moving image data by adding the balloon data to the original moving image data;
The speech balloon creating means has a speech balloon shape, color, pattern, size, and character color, size, and font corresponding to the specified speaker for the speech balloon data created by the speech balloon creating means. Having a balloon editing means for displaying a balloon editing screen for editing at least one of them;
The balloon editing screen includes an image display area for displaying the new moving image data, a text display area for editing the balloon data, and a voice recognition operation unit for executing voice recognition by the voice recognition means. When, seen including and a reproduction operation portion for performing reproduction of the sound,
The speech balloon creating means recognizes a speech corresponding to the speaker's face position when the speaker's voice is recognized by the speaker identifying means but the face cannot be recognized or when the speaker disappears from the display screen. An apparatus for generating data for displaying only a character string as a subtitle super in the area below the display screen instead of data .

Further comprising synchronization means for managing from the start of speaking to the end of speaking by synchronizing the image and sound;
The speech balloon creating means is based on the face position obtained by the position identifying means, the text data generated by the speech recognition means, and the time from the start of speech to the end of speech obtained by the synchronization means. The apparatus according to claim 1, wherein the apparatus creates data.

The face detecting means detects a face direction and a mouth position from the image portion;
The apparatus according to claim 1, wherein the position specifying unit specifies the face direction and mouth position of the speaker so that the balloon data can be displayed in accordance with the face direction.

The apparatus according to claim 1, wherein the balloon creating unit changes a balloon size and a character size in accordance with the face position and size specified by the position specifying unit.

When the speech balloon creation unit cannot recognize the speaker's face between the start of speech and the end of speech, the balloon creating unit displays only a character string as a subtitle super in the area below the display screen from the point when the speech cannot be recognized. apparatus according to claim 2, characterized in that to create the data for.

The speech balloon creation means tracks the face of the speaker by the face detection means between the start of the utterance and the end of the utterance, and if the face of the speaker cannot be recognized, the head is traced to finish the utterance. The apparatus according to claim 2, wherein balloon data up to is created.

When the speaker moves beyond a preset movement amount in the display screen between the start of the utterance and the end of the utterance, the speech balloon creating means is a character string as a subtitle super in the area under the display screen. apparatus according to claim 2, characterized in that to create the data for display only.

The apparatus according to claim 1, wherein the character string of the balloon data is described in a language including metadata described by text data.

The apparatus according to any one of claims 1 to 8 , further comprising moving image input means capable of capturing the moving image data or inputting the moving image data from outside.

It further comprises recording means for recording moving image data,
The moving image creating means generates the balloon data from the original moving picture data input by the moving picture input means, and sequentially records new moving picture data created by adding the balloon data to the recording means. The apparatus of claim 9 .

A method for creating subtitles from original video data including images and sounds,
A face detection step of detecting a feature amount of a face from an image portion of the original moving image data;
A voice identification step of detecting a voice feature amount from a voice portion of the original moving image data;
The feature amount of the face detected in the face detection step and the feature amount of the speech detected in the speech identification step, the speech feature amount for identifying the voice of the speaker prepared in advance, and the face of the speaker A speaker identification step for identifying a speaker in comparison with a facial feature to be identified;
A position specifying step for specifying the face position of the specified speaker;
A speech recognition step of recognizing a character string from the voice of the identified speaker and generating text data of the character string;
Displaying text data of a character string uttered from the specified speaker on the display screen based on the face position obtained by the position specifying step and the text data generated by the voice recognition step A speech balloon creation process for creating speech balloon data;
A moving image creation step of creating new movie data by adding the balloon data to the original movie data,
In the speech balloon creation process, the speech balloon data created in the speech balloon creation process has a speech balloon shape, color, pattern, size corresponding to the specified speaker, and character color, size, font type. Having a speech balloon editing step for displaying a speech balloon editing screen for editing at least one of them,
The balloon editing screen includes an image display area for displaying the new moving image data, a text display area for editing the balloon data, and a voice recognition operation unit for executing voice recognition by the voice recognition step. When, seen including and a reproduction operation portion for performing reproduction of the sound,
In the speech balloon creating step, when the voice of the speaker is recognized in the speaker specifying step but the face cannot be recognized or the speaker disappears from the display screen, the speech balloon corresponding to the speaker's face position is used. A method for generating data for displaying only a character string as a subtitle super in the lower area of the display screen instead of data .

Further comprising a synchronization step of synchronizing the image and the sound to manage from the start of speaking to the end of speaking.
In the balloon creating step, the balloon is based on the face position obtained by the position identifying step, the text data generated by the voice recognition step, and the time from the utterance start to the utterance end obtained by the synchronization step. The method according to claim 11 , wherein data is created.

In the face detection step, a face direction and a mouth position in the image data are detected,
12. The method according to claim 11 , wherein in the position specifying step, the direction of the speaker's face and the position of the mouth are specified so that the balloon data can be displayed in accordance with the direction of the face.

12. The method according to claim 11 , wherein, in the balloon creating step, the balloon size and the character size are changed in accordance with the face position and size specified in the position specifying step.

In the speech balloon creation step, if the speaker's face cannot be recognized between the start of speech and the end of speech, only the character string is displayed as the subtitle super in the area below the display screen from the point when the speech cannot be recognized. the method of claim 12, wherein the creating the data to.

In the balloon creating step, the face of the speaker is tracked by the face detection step between the start of the utterance and the end of the utterance, and when the face of the speaker cannot be recognized, the head is tracked and the utterance is ended. 13. The method according to claim 12 , wherein the balloon data up to is created.

In the step of creating the speech balloon, if the speaker moves beyond the movement amount set in advance on the display screen from the start of the utterance to the end of the utterance, only a character string as a subtitle super in the area below the display screen the method of claim 12, wherein the creating data for displaying.

The method according to claim 11 , wherein the character string of the balloon data is described in a language including metadata described by text data.

The method according to any one of claims 11 to 18 , further comprising a moving image input step of capturing the moving image data or inputting the moving image data from outside.

A recording step of recording the moving image data in the recording means;
In the moving image creation step, the balloon data is generated from the original moving image data input in the moving image input step,
The method according to claim 19 , wherein in the recording step, new moving image data created by adding the balloon data is sequentially recorded in the recording means.

A program for causing a computer to execute the method according to any one of claims 11 to 20 .

A computer-readable storage medium storing the program according to claim 21 .