JP2006166407A

JP2006166407A - Imaging device and its control method

Info

Publication number: JP2006166407A
Application number: JP2005249954A
Authority: JP
Inventors: Yasushi Oowa; 寧司大輪
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-11-09
Filing date: 2005-08-30
Publication date: 2006-06-22

Abstract

PROBLEM TO BE SOLVED: To provide an imaging device and its control method, wherein a speech in a moving image can be recognized and a frame image suitable for the speech can be selected for display and output. SOLUTION: The imaging device comprises a speech recognizer 105 for recognizing a speech included in the moving image, and an image selector 107 for selecting the frame image of a print candidate based on the end of the speech recognized by the speech recognizer 105. A frame image 108 selected by the image selector 107 and a text image 110 showing the speech recognized by the speech recognizer 105 are composed by an image composition part 111 for output. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、動画の再生が可能な撮像装置及びその制御方法に関するものである。 The present invention relates to an imaging apparatus capable of reproducing a moving image and a control method thereof.

動画の撮影及び再生が可能なデジタルカメラが市販されており、このような動画撮影・再生機能を利用することによりビデオカメラと同様な動画の録画・再生が可能になる。 Digital cameras capable of shooting and playing back moving images are commercially available, and by using such moving image shooting / playback functions, it is possible to record and play back moving images similar to video cameras.

またデジタルカメラには、その撮影した画像をプリンタ等に転送して印刷を行うことができる機能が設けられている。通常の静止画の場合は、所望の画像を選択してプリンタに出力することで、所望の画像の印刷を行うことができる。これに対して動画の場合には、画像が連続して変化しており、どのフレーム画像が印刷に最適であるかを判断するのは容易ではない。 In addition, the digital camera is provided with a function capable of printing the captured image by transferring it to a printer or the like. In the case of a normal still image, a desired image can be printed by selecting the desired image and outputting it to a printer. On the other hand, in the case of a moving image, the images change continuously, and it is not easy to determine which frame image is optimal for printing.

特許文献１には、動画像の不連続点（カット点）を検出し、更に音声の無音部分を音声の切れ目として検出し、音声の切れ目でかつ動画像のカット点である部分を全体の切れ目として検出し、こうして切り分けられた部分ごとに代表フレームを選択して一覧表示することが記載されている。また特許文献２には、画像情報と共に音声情報が入力される場合、その画像の印刷が指示されると、その音声部分を音声認識し、その認識した結果を文字コードに変換して印刷用イメージに展開することが記載されている。
特開平９−２１４８７９号公報特開２０００−３０１８０６号公報 In Patent Document 1, a discontinuous point (cut point) of a moving image is detected, a silent part of the sound is detected as a sound break, and a part that is a sound break and is a cut point of the moving image is detected as a whole break. It is described that a representative frame is selected and displayed in a list for each of the parts thus separated. Further, in Japanese Patent Laid-Open No. 2004-228688, when audio information is input together with image information, when the printing of the image is instructed, the audio portion is recognized as speech, and the recognized result is converted into a character code to be printed. It is described to expand to.
JP-A-9-214879 JP 2000-301806 A

しかしながら上記従来では、動画の印刷に際して、どのフレーム画像を選択して印刷するかはユーザの操作に依存しているのが現状であり、そのようにして選択されたフレーム画像と、音声認識された文字コードとは必ずしも一致したものにならず、ユーザが所望する動画の中のフレーム画像を選択するのは容易ではなかった。 However, according to the above-described conventional method, when a moving image is printed, which frame image is selected and printed depends on the operation of the user, and the frame image thus selected and voice recognition are performed. The character code does not always match, and it is not easy for the user to select a frame image in a moving image desired by the user.

本発明の目的は、上記従来技術の問題点を解決することにある。 An object of the present invention is to solve the above-mentioned problems of the prior art.

本願発明の特徴は、動画中の音声を認識し、その音声に適したフレーム画像を選択して表示・出力できる撮像装置及びその制御方法を提供することにある。 A feature of the present invention is to provide an imaging apparatus capable of recognizing sound in a moving image, selecting a frame image suitable for the sound, and displaying / outputting it, and a control method thereof.

また本願発明の特徴は、ユーザが入力した音声を検索キーワードとしてフレーム画像を選択し、その選択したフレーム画像に対応する音声イメージを合成して画像を印刷できる撮像装置及びその制御方法を提供することにある。 The present invention also provides an imaging apparatus capable of selecting a frame image using a voice input by a user as a search keyword, synthesizing a voice image corresponding to the selected frame image, and printing the image, and a control method thereof. It is in.

本発明の一態様に係る撮像装置は以下のような構成を備える。即ち、
動画の再生が可能な撮像装置であって、
動画に含まれる音声を認識する認識手段と、
前記認識手段により認識された音声の区切りを基に印刷候補のフレーム画像を選択する選択手段と、
前記選択手段により選択されたフレーム画像と、前記認識手段により認識された音声を示すテキスト情報とを合成して出力する出力制御手段とを有することを特徴とする。 An imaging device according to one embodiment of the present invention includes the following configuration. That is,
An imaging device capable of playing a video,
A recognition means for recognizing the audio contained in the video,
A selection unit that selects a frame image of a print candidate based on a voice break recognized by the recognition unit;
Output control means for combining and outputting the frame image selected by the selection means and text information indicating the voice recognized by the recognition means.

本発明の一態様に係る撮像装置は以下のような構成を備える。即ち、
動画の再生が可能な撮像装置であって、
入力された音声情報を音声認識する音声認識手段と、
動画に含まれる音声を認識する認識手段と、
前記音声認識手段により音声認識された音声情報と、前記認識手段により音声認識された音声情報とを比較する比較手段と、
前記比較手段により一致していると判定されたフレーム画像を抽出する抽出手段と、
前記抽出手段により抽出されたフレーム画像と、前記音声情報を示すテキスト情報とを合成して出力する出力制御手段とを有することを特徴とする。 An imaging device according to one embodiment of the present invention includes the following configuration. That is,
An imaging device capable of playing a video,
Speech recognition means for recognizing input speech information;
A recognition means for recognizing the audio contained in the video,
Comparing means for comparing the voice information voice-recognized by the voice recognition means with the voice information voice-recognized by the recognition means;
Extraction means for extracting frame images determined to be coincident by the comparison means;
Output control means for combining and outputting the frame image extracted by the extraction means and text information indicating the audio information.

本発明の一態様に係る撮像装置は以下のような構成を備える。即ち、
動画の再生が可能な撮像装置であって、
動画に含まれる音声を認識する認識手段と、
前記認識手段により認識された音声の区切りを基に印刷候補のフレーム画像を選択する選択手段と、
前記認識手段により認識された音声を示すテキスト情報を分割するか否かを判別し、分割する場合は、当該分割されたテキストに対応する音声に対応するフレーム画像を選択する画像選択手段と、
前記テキスト情報を分割しない場合は前記選択手段により選択されたフレーム画像と、前記認識手段により認識された音声を示すテキスト情報とを合成して出力し、
前記テキスト情報を分割する場合は前記画像選択手段により選択されたフレーム画像と、前記分割されたテキストに対応するテキスト情報とを合成して出力する出力制御手段とを有することを特徴とする。 An imaging device according to one embodiment of the present invention includes the following configuration. That is,
An imaging device capable of playing a video,
A recognition means for recognizing the audio contained in the video,
A selection unit that selects a frame image of a print candidate based on a voice break recognized by the recognition unit;
It is determined whether or not to divide text information indicating the speech recognized by the recognition unit, and in the case of division, an image selection unit that selects a frame image corresponding to the speech corresponding to the divided text;
When the text information is not divided, the frame image selected by the selection unit and the text information indicating the speech recognized by the recognition unit are combined and output,
In the case of dividing the text information, there is provided output control means for synthesizing and outputting the frame image selected by the image selection means and the text information corresponding to the divided text.

本発明の一態様に係る撮像装置の制御方法は以下のような工程を備える。即ち、
動画の再生が可能な撮像装置の制御方法であって、
動画に含まれる音声を認識する認識工程と、
前記認識工程で認識された音声の区切りを基に印刷候補のフレーム画像を選択する選択工程と、
前記選択工程で選択されたフレーム画像と、前記認識工程により認識された音声を示すテキスト情報とを合成して出力する出力制御工程とを有することを特徴とする。 An imaging device control method according to an aspect of the present invention includes the following steps. That is,
A method for controlling an imaging apparatus capable of reproducing a movie,
A recognition process for recognizing the audio contained in the video,
A selection step of selecting a frame image of a print candidate based on the voice break recognized in the recognition step;
And an output control step of synthesizing and outputting the frame image selected in the selection step and text information indicating the speech recognized in the recognition step.

本発明の一態様に係る撮像装置の制御方法は以下のような工程を備える。即ち、
動画の再生が可能な撮像装置の制御方法であって、
入力された音声情報を音声認識する音声認識工程と、
動画に含まれる音声を認識する認識工程と、
前記音声認識工程で音声認識された音声情報と、前記認識工程で音声認識された音声情報とを比較する比較工程と、
前記比較工程で一致していると判定されたフレーム画像を抽出する抽出工程と、
前記抽出工程で抽出されたフレーム画像と、前記音声情報を示すテキスト情報とを合成して出力する出力制御工程とを有することを特徴とする。 An imaging device control method according to an aspect of the present invention includes the following steps. That is,
A method for controlling an imaging apparatus capable of reproducing a movie,
A speech recognition process for recognizing input speech information;
A recognition process for recognizing the audio contained in the video,
A comparison step for comparing the voice information recognized in the voice recognition step with the voice information recognized in the recognition step;
An extraction step of extracting frame images determined to match in the comparison step;
And an output control step of synthesizing and outputting the frame image extracted in the extraction step and text information indicating the audio information.

本発明の一態様に係る撮像装置の制御方法は以下のような工程を備える。即ち、
動画の再生が可能な撮像装置の制御方法であって、
動画に含まれる音声を認識する認識工程と、
前記認識工程で認識された音声の区切りを基に印刷候補のフレーム画像を選択する選択工程と、
前記認識工程で認識された音声を示すテキスト情報を分割するか否かを判別し、分割する場合は、当該分割されたテキストに対応する音声に対応するフレーム画像を選択する画像選択工程と、
前記テキスト情報を分割しない場合は前記選択工程で選択されたフレーム画像と、前記認識工程により認識された音声を示すテキスト情報とを合成して出力し、
前記テキスト情報を分割する場合は前記画像選択工程で選択されたフレーム画像と、前記分割されたテキストに対応するテキスト情報とを合成して出力する出力制御工程と、
を有することを特徴とする。 An imaging device control method according to an aspect of the present invention includes the following steps. That is,
A method for controlling an imaging apparatus capable of reproducing a movie,
A recognition process for recognizing the audio contained in the video,
A selection step of selecting a frame image of a print candidate based on the voice break recognized in the recognition step;
It is determined whether or not to divide the text information indicating the speech recognized in the recognition step, and when dividing, an image selection step of selecting a frame image corresponding to the speech corresponding to the divided text;
If the text information is not divided, the frame image selected in the selection step and the text information indicating the voice recognized in the recognition step are combined and output,
An output control step of combining and outputting the frame image selected in the image selection step and the text information corresponding to the divided text when dividing the text information;
It is characterized by having.

本発明によれば、動画中の音声を認識し、その音声に適したフレーム画像を選択して表示・出力できる。 According to the present invention, it is possible to recognize sound in a moving image, select a frame image suitable for the sound, and display / output it.

また本願発明によれば、ユーザが入力した音声を検索キーワードとしてフレーム画像を選択し、その選択したフレーム画像に対応する音声イメージを合成して画像を印刷できるという効果がある。 In addition, according to the present invention, there is an effect that a frame image can be selected using the voice input by the user as a search keyword, and the voice image corresponding to the selected frame image can be synthesized to print the image.

以下、添付図面を参照して本発明の好適な実施の形態を詳しく説明する。尚、以下の実施の形態は特許請求の範囲に係る発明を限定するものでなく、また本実施の形態で説明されている特徴の組み合わせの全てが発明の解決手段に必須のものとは限らない。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The following embodiments do not limit the invention according to the claims, and all the combinations of features described in the embodiments are not necessarily essential to the solution means of the invention. .

図１は、本発明の実施の形態に係る撮像装置の機能構成を示す機能ブロック図である。この撮像装置は、本実施の形態では例えばデジタルカメラ、或はデジタルビデオカメラ等の場合で説明する。 FIG. 1 is a functional block diagram showing a functional configuration of the imaging apparatus according to the embodiment of the present invention. This imaging apparatus will be described in the case of a digital camera, a digital video camera, or the like in this embodiment.

図において、動画ファイル１０１は、例えばメモリカード或はテープなどの記憶媒体に記憶されている。動画ファイルデマルチプレクサ１０２は、動画ファイルとして記憶されている例えばＭＰＥＧファイルから音声信号と画像信号とを分離する。画像デコーダ１０３は、デマルチプレクサ１０２で分離された画像信号を入力してデコードする。オーディオデコーダ１０４は、デマルチプレクサ１０２で分離された音声信号をデコードして音声認識部１０５に出力する。音声認識部１０５は、そのデコードされた音声信号を音声認識する。その音声認識の結果は文章・文節認識部１０６に送られ、文書、文節の区切りが検出される。こうして検出された文章、文節の区切りに基づいて、画像選択部１０７は、画像デコーダ１０３でデコードされた複数のフレーム画像の中から、その文章、文節の区切りに対応するフレーム画像を選択する。尚、この画像選択部１０７によるフレーム画像を選択するための選択基準は詳しく後述する。こうして選択されたフレーム画像は選択画像１０８となる。 In the figure, a moving image file 101 is stored in a storage medium such as a memory card or a tape. The moving image file demultiplexer 102 separates an audio signal and an image signal from, for example, an MPEG file stored as a moving image file. The image decoder 103 inputs and decodes the image signal separated by the demultiplexer 102. The audio decoder 104 decodes the audio signal separated by the demultiplexer 102 and outputs it to the audio recognition unit 105. The voice recognition unit 105 recognizes the decoded voice signal. The result of the speech recognition is sent to the sentence / sentence recognition unit 106, and a document / sentence break is detected. Based on the sentence and phrase breaks thus detected, the image selection unit 107 selects a frame image corresponding to the sentence and phrase breaks from among a plurality of frame images decoded by the image decoder 103. The selection criteria for selecting a frame image by the image selection unit 107 will be described in detail later. The frame image selected in this way becomes the selected image 108.

一方、文章、文節認識部１０６で切り出された文章は、テキスト化部１０９でテキストデータに変換される。こうして変換されたテキストデータは、文字コードに対応する文字フォントに基づいてパターン展開されテキスト画像１１０となる。そして、このテキスト画像１１０と選択画像１０８とを画像合成部１１１で合成する。こうして合成された画像が表示され、或はプリンタ１１２に送信されて印刷される。尚、マイクロフォン１１３は、音声等の音響信号を入力するのに使用される。 On the other hand, the sentence extracted by the sentence and phrase recognition unit 106 is converted into text data by the text conversion unit 109. The text data thus converted is subjected to pattern development based on the character font corresponding to the character code to become a text image 110. The text image 110 and the selected image 108 are combined by the image combining unit 111. The synthesized image is displayed or transmitted to the printer 112 for printing. The microphone 113 is used to input an acoustic signal such as voice.

図２は、こうしてプリンタ１１２で印刷された画像の印刷例を示す図である。 FIG. 2 is a diagram illustrating a print example of the image printed by the printer 112 in this way.

ここでは選択された選択画像１０８に、テキスト画像１１０が吹き出し２００で示すように合成されて印刷されている。 Here, the text image 110 is combined with the selected selection image 108 as shown by the balloon 200 and printed.

図３は、本実施の形態に係る撮像装置の構成を示すブロック図である。 FIG. 3 is a block diagram illustrating a configuration of the imaging apparatus according to the present embodiment.

制御部３０１は、この撮像装置全体の動作を統括して制御しており、ＣＰＵ３３０と、このＣＰＵ３３０により実行される制御プログラムを格納しているＲＯＭ３３１と、ＣＰＵ３３０による制御処理の実行時に、各種データを一時的に保存するＲＡＭ３３２等を備えている。この制御部３０１における制御対象は、入力部３０２からのシャッタ等のトリガを基にした撮像部３０３における画像撮影及びその画像データの記憶、撮影した画像データを有線通信部３０４を介して外部に送信すること、撮像部３０３により撮影されて記憶媒体３１３に記憶された画像の表示、撮像部３０３で撮影中の画像を表示部３０６に表示するなどの処理を含んでいる。 The control unit 301 controls the overall operation of the imaging apparatus, and controls the CPU 330, the ROM 331 storing a control program executed by the CPU 330, and various data when the CPU 330 executes control processing. A RAM 332 for temporarily storing the data is provided. The control target in the control unit 301 is to capture an image in the imaging unit 303 based on a trigger such as a shutter from the input unit 302, store the image data, and transmit the captured image data to the outside via the wired communication unit 304. In other words, it includes processing such as displaying an image captured by the imaging unit 303 and stored in the storage medium 313, and displaying an image being captured by the imaging unit 303 on the display unit 306.

撮像部３０３は、撮像素子（ＣＣＤ）と画像バッファメモリ等を含み、撮影した画像を、一時バッファメモリに保存した後に、制御部３０１からの制御によって、メモリカード等の記憶媒体３１３に記憶したり、或は有線又は無線通信部３０４，３０８を経由して外部に送信することができる。入力部３０２は、ユーザにより操作されるスイッチやボタン等を含み、ＵＩを使用した操作、電源、シャッタ、カメラズーム等の操作を行うのに使用される。電池３０９は、この撮像装置全体への電力を供給する。尚、場合によっては、コネクタ３０５を経由して外部へ電力を供給するようにしてもよく、また外部から電力の供給を受けるようにしてもよい。電源監視部３１０は、電池３０９の電池容量を測定して制御部３０１に通知する。 The imaging unit 303 includes an imaging device (CCD), an image buffer memory, and the like. The captured image is stored in a temporary buffer memory, and then stored in a storage medium 313 such as a memory card under the control of the control unit 301. Alternatively, it can be transmitted to the outside via a wired or wireless communication unit 304, 308. The input unit 302 includes switches and buttons operated by the user, and is used to perform operations such as operations using a UI, power supply, shutter, and camera zoom. A battery 309 supplies power to the entire imaging apparatus. In some cases, power may be supplied to the outside via the connector 305, or power may be supplied from the outside. The power supply monitoring unit 310 measures the battery capacity of the battery 309 and notifies the control unit 301 of it.

有線通信部３０４は、コネクタ３０５のコントローラの役割を果たし、コネクタ３０５の接続状況の監視も行い、この監視結果を制御部３０１に通知する。コネクタ３０５は、複数の接続端子によって構成され、他の端末（例えばＰＣやプリンタ等）との間で、制御信号を受け渡し画像データを送信する。またこのコネクタ３０５には、電源線等が含まれる。無線通信部３０８は、無線通信に関する機能を有し、制御部３０１から受信した信号を受け取り、ＲＦ信号に変調した後送信する。また受信したＲＦ信号を復調して制御部３０１に渡す。音声処理部３１１は、マイクロフォン１１３から入力される音声信号を処理し、音声認識して出力する。尚、このマイクロフォン１１３は、動画録画時の音響信号の入力にも利用される。記憶／再生制御部３１２は記憶媒体３１３へのデータの書き込み、及び記憶媒体３１３からのデータの読み出しを制御する。ここで再生された音声信号は音声処理部３１１に送られて音声認識される。 The wired communication unit 304 serves as a controller for the connector 305, monitors the connection status of the connector 305, and notifies the control unit 301 of the monitoring result. The connector 305 includes a plurality of connection terminals, and transmits and receives image data to and from other terminals (for example, a PC and a printer). The connector 305 includes a power line and the like. The wireless communication unit 308 has a function related to wireless communication, receives a signal received from the control unit 301, modulates the signal into an RF signal, and transmits the signal. The received RF signal is demodulated and passed to the control unit 301. The voice processing unit 311 processes the voice signal input from the microphone 113, recognizes the voice, and outputs it. The microphone 113 is also used for inputting an acoustic signal during video recording. The storage / reproduction control unit 312 controls writing of data to the storage medium 313 and reading of data from the storage medium 313. The reproduced audio signal is sent to the audio processing unit 311 to be recognized.

［実施の形態１］
図４は、本発明の実施の形態１に係る撮像装置による処理を説明するフローチャートで、この処理を実行するプログラムはＲＯＭ３３１に記憶されており、ＣＰＵ３３０の制御の下に実行される。 [Embodiment 1]
FIG. 4 is a flowchart for explaining processing by the imaging apparatus according to Embodiment 1 of the present invention. A program for executing this processing is stored in the ROM 331 and executed under the control of the CPU 330.

まずステップＳ１で、選択タイムスタンプを決定する。これは選択対象の画像の中からどの画像を選択するかを決定する選択基準を設定するものである。 First, in step S1, a selected time stamp is determined. This sets a selection criterion for deciding which image to select from images to be selected.

図５は、この選択タイムスタンプを選択する際に表示部３０６に表示されるメニュ画面例を示す図である。
（１）選択対象期間の画像の真中の画像を選択する。 FIG. 5 is a diagram showing an example of a menu screen displayed on the display unit 306 when this selection time stamp is selected.
(1) The middle image of the images in the selection target period is selected.

音声認識をもとに文章、文節の区切りを検出した場合の、その文章或は文節に対応する画像期間の略中心のフレーム画像を選択するように指定する。
（２）対象期間の先頭から一定時間後の画像を選択する。 When a sentence or phrase break is detected based on speech recognition, it is specified to select a frame image that is approximately in the center of the image period corresponding to the sentence or phrase.
(2) Select an image after a certain time from the beginning of the target period.

音声認識をもとに文章、文節の区切りを検出した場合の、その文章或は文節に対応する画像期間の先頭から一定時間後の画像を選択するように指定する。
（３）対象期間の最初の有音時の画像を選択する。 When a sentence or phrase break is detected based on speech recognition, it is specified to select an image after a certain time from the beginning of the image period corresponding to the sentence or phrase.
(3) Select the first sounded image of the target period.

音声認識をもとに文章、文節の区切りを検出した場合の、その文章或は文節に対応する画像期間の最初の有音時の画像を選択するように指定する。
（４）動きベクトルが最小の画像を選択する。 When a sentence or phrase break is detected based on speech recognition, it is designated to select the first sounded image in the image period corresponding to the sentence or phrase.
(4) Select an image with the smallest motion vector.

音声認識をもとに文章、文節の区切りを検出した場合の、その文章或は文節に対応する画像期間内で、動きベクトルが最小の画像を選択するように指定する。またこの場合には、例えばイントラフレームだけを選択するようにしても良い。
（５）撮影パラメータから最適な画像を選択する。 It is specified that an image with the smallest motion vector is selected within an image period corresponding to the sentence or phrase when a sentence or phrase break is detected based on speech recognition. In this case, for example, only an intra frame may be selected.
(5) Select an optimal image from the shooting parameters.

音声認識をもとに文章、文節の区切りを検出した場合の、その文章或は文節に対応する画像期間内で、撮影パラメータが適切な画像を選択するように指定する。尚、この撮影パラメータが適切かどうかの判断としては、例えば画像中の輝度分布を解析するなどして撮影された画像の露出が良好か、ホワイトバランスが適正か、或はフォーカシングが最適か（画像の焦点がボケていないか）を判断すること等が考えられる。 When a sentence or phrase break is detected based on speech recognition, it is specified that an image with appropriate shooting parameters is selected within an image period corresponding to the sentence or phrase. Whether or not the shooting parameter is appropriate is determined by, for example, analyzing the luminance distribution in the image, whether the image is well exposed, whether the white balance is appropriate, or whether the focusing is optimal (image It is conceivable to determine whether or not the focus is out of focus.

これら５項目のいずれかを、入力部３０２を使用して選択することができる。 Any of these five items can be selected using the input unit 302.

こうしてステップＳ１では、選択された選択タイムスタンプをＲＡＭ３３２に記憶する。次にステップＳ３で動画を入力し、ステップＳ４で、その入力した動画に含まれる音声を認識する。次にステップＳ５で、その音声認識に基づいて、文節／文章の区切りを認識そ、その認識した文節／文章をテキスト化する。そしてステップＳ７で、ステップＳ１で選択されステップＳ２で記憶された選択タイムスタンプに最も近いフレーム画像を選択する。そしてステップＳ８で、ステップＳ６でテキスト化された文字画像と、ステップＳ７で選択されたフレーム画像とを合成する。そしてステップＳ９で、この合成された画像をＲＡＭ３３２に記憶する。ステップＳ１０では、動画の最終画像かどうかを調べ、最後でないときはステップＳ１１に進み、選択用のタイムスタンプを変更するかどうかをみる。変更しないときはステップＳ３に進んで次の動画を入力するが、選択用のタイムスタンプを変更する場合はステップＳ１に戻り、前述の処理を実行する。 Thus, in step S1, the selected selection time stamp is stored in the RAM 332. Next, in step S3, a moving image is input, and in step S4, the voice included in the input moving image is recognized. Next, in step S5, based on the speech recognition, the phrase / sentence break is recognized, and the recognized phrase / sentence is converted into text. In step S7, the frame image closest to the selected time stamp selected in step S1 and stored in step S2 is selected. In step S8, the character image converted to text in step S6 and the frame image selected in step S7 are synthesized. In step S9, the combined image is stored in the RAM 332. In step S10, it is checked whether or not it is the final image of the moving image. When not changing, it progresses to step S3 and the next moving image is input, but when changing the time stamp for selection, it returns to step S1 and performs the above-mentioned process.

そしてステップＳ１０で、動画の最後であればステップＳ１２に進み、ステップＳ８で合成された画像を表示部３０６に表示する。ここでは複数の合成画像があれば複数の画像が表示部３０６に表示される。そしてステップＳ１３で、ユーザにより選択されたフレームを画像をプリンタ１１２に出力して印刷を実行する。 In step S10, if it is the end of the moving image, the process proceeds to step S12, and the image synthesized in step S8 is displayed on the display unit 306. Here, if there are a plurality of composite images, a plurality of images are displayed on the display unit 306. In step S13, an image of the frame selected by the user is output to the printer 112 to execute printing.

このように本実施の形態１によれば、動画再生期間の中から最適な画像を選択して印刷或は表示できる。 As described above, according to the first embodiment, it is possible to select or print or display an optimal image from the moving image reproduction period.

更にその選択した画像に、その画像に関連した音声の内容を、図２に示すようにテキストデータで吹き出しとして合成して出力・印刷できる。 Furthermore, the selected image can be output and printed by synthesizing the content of the sound related to the image as a balloon with text data as shown in FIG.

［実施の形態２］
次に本発明の実施の形態２について説明する。この実施の形態２では、所望の画像を選択するための選択項目として、その動画に含まれている言葉をマイクロフォン１１３から入力する。そして、その入力された音声に一致する音声が発せられて時点の画像を選択画像として決定するものである。 [Embodiment 2]
Next, a second embodiment of the present invention will be described. In the second embodiment, words included in the moving image are input from the microphone 113 as selection items for selecting a desired image. Then, a voice that matches the input voice is emitted and the image at the time is determined as the selected image.

図６は、本実施の形態２に係る撮像装置における処理を説明するフローチャートで、この処理を実行するプログラムは制御部３０１のＲＯＭ３３１に記憶されており、ＣＰＵ３２０の制御の下に実行される。 FIG. 6 is a flowchart for explaining processing in the imaging apparatus according to the second embodiment. A program for executing this processing is stored in the ROM 331 of the control unit 301 and is executed under the control of the CPU 320.

まずステップＳ２１で、動画の再生を開始し、ステップＳ２２で、ユーザの音声をマイクロフォン１１３から入力する。尚、この音声入力時には、デジタル次にステップＳ２３で、その入力された音声を認識し、ステップＳ２４で音声認識した動画の音声情報とステップＳ２５で比較する。この比較の結果、一致する音声が存在しないときはステップＳ２１に戻って次の動画に対して同様の処理を繰り返す。 First, reproduction of a moving image is started in step S21, and the user's voice is input from the microphone 113 in step S22. When the voice is input, the input voice is recognized next in step S23, and the voice information of the moving image recognized in step S24 is compared with the voice information in step S25. As a result of this comparison, when there is no matching audio, the process returns to step S21 and the same processing is repeated for the next moving image.

ステップＳ２５で、一致する音声が見つかるとステップＳ２６に進み、その時点のフレーム画像を選択する。こうして選択された画像が決定されると、その音声をテキスト化し、ステップＳ２７で、そのテキストデータと、選択したフレーム画像と合成する。これにより例えば図２に示すような、画像に吹き出しで文字が追加された画像が生成される。そしてステップＳ２８で、その合成画像を表示部３０６に表示して、ユーザにその可否を問合せる。ここでユーザがＯＫと入力するとステップＳ２９に進んで、その合成画像を印刷或は出力画像として決定するが、ＯＫが入力されないときはステップＳ２１に戻り、再度同様の操作を行う。 If a matching voice is found in step S25, the process proceeds to step S26, and the frame image at that time is selected. When the selected image is determined in this way, the voice is converted into text, and the text data and the selected frame image are synthesized in step S27. Thereby, for example, an image in which characters are added to the image by a balloon as shown in FIG. 2 is generated. In step S28, the composite image is displayed on the display unit 306, and the user is inquired about whether or not it is possible. If the user inputs OK, the process proceeds to step S29, and the composite image is determined as a print or output image. If OK is not input, the process returns to step S21 and the same operation is performed again.

［実施の形態２の変形例］
図７は、本実施の形態２に係る撮像装置における処理の変形例を説明するフローチャートで、この処理を実行するプログラムは制御部３０１のＲＯＭ３３１に記憶されており、ＣＰＵ３２０の制御の下に実行される。この変形例では、写真用のフレーム画像の選択に際しては、ユーザにより入力された音声を検索用キーワードとする点は前述の実施の形態２と同じであるが、そのフレーム画像の選択に際しては、実施の形態１における選択タイムスタンプを使用し、更に動画の再生においても、動画を全て再生してフレーム画像を選択するか、実施の形態１のステップＳ１２で抽出された合成に画像を含む画像期間（文節単位、文章単位に相当）毎に動画を再生して、所望のフレーム画像を選択する点が、前述の実施の形態２の構成と異なっている。 [Modification of Embodiment 2]
FIG. 7 is a flowchart for explaining a modification of the processing in the imaging apparatus according to the second embodiment. A program for executing this processing is stored in the ROM 331 of the control unit 301 and is executed under the control of the CPU 320. The In this modification, the selection of a frame image for photography is the same as that of the second embodiment described above in that the voice input by the user is used as a search keyword. In the first embodiment, the selected time stamp is used, and also in the reproduction of the moving image, the entire moving image is reproduced and the frame image is selected, or the image period including the image in the composition extracted in step S12 of the first embodiment ( The point of selecting a desired frame image by reproducing a moving image for each phrase unit (corresponding to a phrase unit and a sentence unit) is different from the configuration of the second embodiment.

まずステップＳ３１で、再生したい動画及び選択タイムスタンプを選択する。この選択タイムスタンプは前述の実施の形態１の場合と同様である。ここで動画を選択する際、動画を全てを再生するように選択するか、或は前述のステップＳ１２で抽出されたフレーム画像が含まれている画像期間を再生するか、のいずれかを指定することができる。こうして抽出対象の動画が選択されるとステップＳ３２に進み、ユーザによる音声入力が行われる。この音声入力に際しては、ユーザは、このデジタルカメラを所定のモードに設定した後、特定のボタン（例えば、ファンクションキーなど）を押下しながらマイクロフォン１１２により音声を入力する。こうして入力されたユーザの音声はステップＳ３３で音声認識され、その結果がＲＡＭ３３２に記憶される。次にステップＳ３４で、ステップＳ１で指示された動画の再生が開始される。そしてステップＳ３５で、図４のステップＳ４と同様にして、その動画に含まれている音声情報が音声認識される。次にステップＳ３５で、ステップＳ３３で音声認識されて記憶されている音声情報と、ステップＳ３５で音声認識された結果とが一致するか、並びにステップＳ３１で指定された選択タイムスタンプに基づいて、音声情報が一致するフレーム画像を選択できるかどうかを判定する。これら条件に合うフレーム画像が選択できるとステップＳ３７に進むが、そうでない時はステップＳ３４に戻り、前述の処理を実行する。 First, in step S31, a moving image to be reproduced and a selection time stamp are selected. This selection time stamp is the same as that in the first embodiment. Here, when selecting a moving image, it is specified whether to select all the moving images to be reproduced or to reproduce the image period including the frame image extracted in step S12 described above. be able to. When the moving image to be extracted is selected in this way, the process proceeds to step S32, and voice input by the user is performed. When inputting the voice, the user sets the digital camera in a predetermined mode, and then inputs the voice through the microphone 112 while pressing a specific button (for example, a function key). The user's voice input in this way is recognized as a voice in step S33 and the result is stored in the RAM 332. Next, in step S34, the reproduction of the moving image designated in step S1 is started. In step S35, the voice information included in the moving image is recognized as in step S4 in FIG. Next, in step S35, the voice information that has been voice-recognized and stored in step S33 matches the voice-recognized result in step S35, and the voice is based on the selected time stamp specified in step S31. It is determined whether or not a frame image whose information matches can be selected. If a frame image that meets these conditions can be selected, the process proceeds to step S37. If not, the process returns to step S34 to execute the above-described processing.

ステップＳ３６でフレーム画像が選択できるとステップＳ３７で、その音声情報をテキスト化し、そのテキストデータを文字フォントを使用して文字画像に展開する。次にステップＳ３８で、そのフレーム画像に文字画像を合成し、その合成した画像を表示部３０６に表示する。次にステップＳ３９で、その合成画像がユーザが満足できるものであるかどうかがユーザにより判定され、満足できるとして「ＯＫ」が入力されるとステップＳ４０に進み、その合成画像をプリンタ１１２に出力して印刷を実行する。一方、ユーザが満足できない場合はステップＳ３１に進み、再度動画の指定、選択タイムスタンプの指定を行う。 If a frame image can be selected in step S36, the voice information is converted into text in step S37, and the text data is developed into a character image using a character font. In step S38, a character image is combined with the frame image, and the combined image is displayed on the display unit 306. Next, in step S39, it is determined by the user whether or not the composite image is satisfactory by the user. If “OK” is input as satisfactory, the process proceeds to step S40, and the composite image is output to the printer 112. Print. On the other hand, if the user is not satisfied, the process proceeds to step S31, where the moving image is designated again and the selected time stamp is designated.

尚、ステップＳ３８では、指定された選択タイムスタンプに近い複数の候補フレーム画像を抽出表示して、それら複数の候補フレーム画像の中からユーザが所望のフレーム画像を選択するようにしても良い。 In step S38, a plurality of candidate frame images close to the designated selection time stamp may be extracted and displayed, and the user may select a desired frame image from among the plurality of candidate frame images.

［実施の形態３］
次に本発明の実施の形態３について説明する。この実施の形態３では、音声と画像の最適な組み合わせを選択し印刷するための手段として、文章或は文節の分割を行う。これによりテキスト化された文を、適度な文字数の文章或は文節として画像と共に印刷或は表示できる。 [Embodiment 3]
Next, a third embodiment of the present invention will be described. In the third embodiment, a sentence or a phrase is divided as means for selecting and printing an optimal combination of sound and image. As a result, the text can be printed or displayed together with the image as a text or phrase having an appropriate number of characters.

図８は、動画フレームと音声との関係の一例を説明する図である。 FIG. 8 is a diagram for explaining an example of the relationship between a moving image frame and audio.

８０１〜８０６は、動画を構成している一連の画像フレームを示している。８０７，８０８は、これら一連の画像フレーム（８０１〜８０６）に同期した音声を示している。前述の実施の形態の場合、これら一連の音声は、文章或は文節として、８０７，８０８で示すように音声認識される。更に、それぞれの区間から画像フレーム８０９，８１０が、指定された選択タイムスタンプに基づいて選択される。 Reference numerals 801 to 806 denote a series of image frames constituting a moving image. Reference numerals 807 and 808 denote sounds synchronized with the series of image frames (801 to 806). In the case of the above-described embodiment, these series of voices are recognized as sentences or phrases as shown by 807 and 808. Furthermore, image frames 809 and 810 are selected from the respective sections based on the designated selection time stamp.

図９は、音声「みててね」に対応するフレームとして画像フレーム８０１が選択され、その画像に音声「みててね！」を示すテキストを含む吹き出し９０１が付された画像が印刷・出力されている。このように音声をテキスト化し、その音声を示す吹き出しと、選択された画像フレームとが合成された画像が印刷されている。 In FIG. 9, an image frame 801 is selected as a frame corresponding to the voice “watching”, and an image with a balloon 901 including text indicating the voice “watching!” Is printed and output on the image. Yes. In this way, the sound is converted into text, and an image in which the speech balloon indicating the sound is combined with the selected image frame is printed.

特に本実施の形態３では、図１０に示す印刷画像のテキスト部分を再分割する場合で説明する。 In particular, the third embodiment will be described in the case where the text portion of the print image shown in FIG. 10 is subdivided.

図１１は、本実施の形態３において、音声８０８を表すテキストを分割する例を説明する図で、前述の図８及び図９と共通する部分は同じ記号で示している。 FIG. 11 is a diagram for explaining an example of dividing the text representing the voice 808 in the third embodiment, and the same parts as those in FIGS. 8 and 9 are denoted by the same symbols.

９１０，９１１は、音声８０８を表すテキストを２分割したテキストを示している。このようなテキストの分割は、操作者の指示によって行われても良く、或は自動的に行っても良い。 Reference numerals 910 and 911 denote texts obtained by dividing the text representing the voice 808 into two parts. Such text division may be performed according to an instruction from an operator or may be performed automatically.

図１２及び図１３は、本発明の実施の形態３に係る撮像装置による処理を説明するフローチャートである。 12 and 13 are flowcharts for explaining processing by the imaging apparatus according to Embodiment 3 of the present invention.

図１２において、まずステップＳ４１で動画を入力し、ステップＳ４２で、その動画と共に録音されている音声を音声処理部３１１で認識する。次にステップＳ４３で、その認識した音声をテキスト（文章／文節）で表し、ステップＳ４４で、その認識結果をテキストデータに変換する。次にステップＳ４５では、予め設定されたタイムスタンプの選択方法に基づいて、その認識した音声に対応している動画の中から、所定の画像フレームを選択する。次にステップＳ４６で、ステップＳ４４でテキスト化されたテキスト画像と、ステップＳ４５で選択した画像フレームの画像とを合成する。次にステップＳ４７で、ステップＳ４６で合成した画像と、テキスト化された文章／文節データとを記憶する。そしてステップＳ４８で、動画の最後のフレームかどうかを判断し、最後のフレームでない場合はステップＳ４１に戻り、前述の処理を繰り返す。 In FIG. 12, first, a moving image is input in step S 41, and in step S 42, the sound recorded with the moving image is recognized by the sound processing unit 311. Next, in step S43, the recognized voice is expressed as text (sentence / sentence), and in step S44, the recognition result is converted into text data. Next, in step S45, based on a preset time stamp selection method, a predetermined image frame is selected from the moving image corresponding to the recognized sound. In step S46, the text image converted to text in step S44 and the image of the image frame selected in step S45 are combined. In step S47, the image synthesized in step S46 and the text / sentence data converted into text are stored. In step S48, it is determined whether or not it is the last frame of the moving image.

ステップＳ４８で、動画の最後のフレームと判断した場合はステップＳ４９に進み、ステップＳ４７で作成されて記憶された合成画像を読み出し、その一覧を表示部３０６に表示する。次にステップＳ５０で、操作者に各合成画像はこのままで良いか、或は編集をしたいかを問合せて処理を選択させる。このままで良い場合はステップＳ５１に進み、その選択された合成画像を印刷する。一方、ステップＳ５０で、その合成画像を編集するように選択した場合はステップＳ６１（図１３）に進む。 If it is determined in step S48 that the frame is the last frame of the moving image, the process proceeds to step S49, the composite image created and stored in step S47 is read, and the list is displayed on the display unit 306. Next, in step S50, the operator is inquired whether each composite image can be left as it is, or whether to edit the composite image. If this is acceptable, the process proceeds to step S51, and the selected composite image is printed. On the other hand, if it is selected in step S50 to edit the composite image, the process proceeds to step S61 (FIG. 13).

図１３は、編集処理を説明するフローで、まずステップＳ６１で、入力部３０２を使用して、表示部３０６に表示された合成画像の一覧から編集対象の画像が選択される。次にステップＳ６２で、その選択した合成画像の内の一つの合成画像を表示してステップＳ６３に進む。ステップＳ６３では、編集を終了するか（ＯＫ）、或はテキストである文章／文節を分割するか、また或はタイムスタンプ選択方法を変更するかが、入力部３０２からの指示により選択される。ここで編集を終了するように指示されるとステップＳ４９（図１２）に進み、合成画像の一覧表示を行う。 FIG. 13 is a flowchart for explaining the editing process. First, in step S61, using the input unit 302, an image to be edited is selected from the list of composite images displayed on the display unit 306. Next, in step S62, one of the selected composite images is displayed, and the process proceeds to step S63. In step S 63, whether to end editing (OK), to divide a sentence / sentence that is text, or to change the time stamp selection method is selected by an instruction from the input unit 302. If it is instructed to finish editing, the process proceeds to step S49 (FIG. 12), and a list of composite images is displayed.

一方、文章／文節の再分割を行う場合はステップＳ６４に進み、選択された合成画像のテキストに対応する文章／文節を分割する。次にステップＳ６５で、その分割された各々の対象区間の動画より、設定されているタイムスタンプ選択方法に基づいて、合成画像の基になる画像フレームを再び選択する。次にステップＳ６６で、ステップＳ６４で分割されたテキストと、ステップＳ６５で再選択された画像フレームとを合成する。そしてステップＳ６７で、その合成した画像と、分割されたテキストを記憶する。そしてステップＳ６２に進み、その分割したテキストと再合成した画像を表示する。 On the other hand, when the subdivision of the sentence / phrase is performed, the process proceeds to step S64, and the sentence / phrase corresponding to the text of the selected composite image is divided. Next, in step S65, an image frame that is a basis of the composite image is selected again from the divided moving images in the target section based on the set time stamp selection method. Next, in step S66, the text divided in step S64 and the image frame reselected in step S65 are synthesized. In step S67, the synthesized image and the divided text are stored. In step S62, the divided text and the recombined image are displayed.

またステップＳ６３で、タイムスタンプの選択方法の変更が選択された場合はステップＳ６８に進み、タイムスタンプの選択方法の選択画面を表示する。次にステップＳ６９で選択方法が変更されると、それを記憶する。次にステップＳ７０で、その音声の対象となる動画区間のフレームより、新たに設定されたタイムスタンプ選択方法に基づいて、合成対象の画像フレームを選択する。そしてステップＳ７１で、その選択されたフレームと、音声を示すテキストとを再合成する。このときテキストは、画像フレームを再度選択する前と同じテキストである。そしてステップＳ７２で、その合成画像とテキストとを記憶する。そしてステップＳ６２に進み、その再合成した画像を表示部３０６に表示する。 If the change of the time stamp selection method is selected in step S63, the process proceeds to step S68 to display a time stamp selection method selection screen. Next, when the selection method is changed in step S69, it is stored. Next, in step S70, an image frame to be synthesized is selected from the frame of the moving image section that is the target of the audio based on the newly set time stamp selection method. In step S71, the selected frame is re-synthesized with the text indicating the voice. At this time, the text is the same text as before the image frame is selected again. In step S72, the composite image and text are stored. In step S 62, the recombined image is displayed on the display unit 306.

図１４は、本実施の形態３に係る一連の処理（図１２，図１３）によって分割、選択された画像フレームの一例を示す図で、前述の図８及び図１１と共通する部分は同じ記号で示している。 FIG. 14 is a diagram showing an example of an image frame divided and selected by a series of processes (FIGS. 12 and 13) according to the third embodiment, and the same parts as those in FIGS. Is shown.

ここで９１０，９１１は、音声を示す文章及び文節（「お父さんお父さん乗れたよ」）を再分割をした結果を示している。ここでは音声８０８を表すテキストが、テキスト９１０と９１１に２分割されている。更に、これら分割されたテキスト９１０，９１１の各々に対応する一連の画像フレーム８０３，８０４、及び８０５，８０６より、各テキスト９１０，９１１に対応するフレームとして画像フレーム８０４および８０５がそれぞれ選択されている。 Here, reference numerals 910 and 911 denote the results of subdivision of sentences and phrases ("Dad, Dad got on") indicating voice. Here, the text representing the voice 808 is divided into two texts 910 and 911. Further, image frames 804 and 805 are selected as frames corresponding to the texts 910 and 911 from the series of image frames 803 and 804 and 805 and 806 corresponding to the divided texts 910 and 911, respectively. .

図１５及び図１６は、画像フレーム８０４，８０５の夫々に対応して、各分割されたテキスト９１０，９１１を合成して印刷した画像例を示す図である。この画像データは、各音声に対応するテキストを含む吹き出し１５０１，１５０２により、音声８０８が表示されている。 15 and 16 are diagrams showing examples of images printed by combining the divided texts 910 and 911 corresponding to the image frames 804 and 805, respectively. In this image data, voice 808 is displayed by balloons 1501 and 1502 including text corresponding to each voice.

以上は文章／文節の再分割を編集フローにて操作者に判断させる例について説明したが、一枚の画像フレーム当たりの文章／文節の文字数、テキスト合成画像上の文字画像のレイアウト上の領域のサイズなどに基づいて、文章／文節の再分割、該当区間の画像フレームの再選択、画像の再合成処理を合成画像が最適化されるまで自動的に行うようにしても良い。 In the above, an example in which the operator determines the subdivision of the sentence / phrase in the editing flow has been described. However, the number of characters in the sentence / phrase per image frame, the area on the layout of the character image on the text composite image, and the like. Based on the size or the like, the subdivision of the sentence / sentence, the reselection of the image frame in the corresponding section, and the image recombination processing may be automatically performed until the composite image is optimized.

この場合は、文字をレイアウトできる領域のサイズに応じて、分割する文節の大きさを変えるようにするとよい。このためには、主要被写体を認識して、それ以外の領域の大きさに応じて、文字をレイアウトする吹き出し領域を設定すれば良い。この吹き出しにレイアウトする文字数によって、文節の分割位置が設定できる。これにより、吹き出し内の文字が認識しやすくなり、読みやすいテキストを含む合成画像を作成できる。 In this case, it is preferable to change the size of the segment to be divided according to the size of the area where the character can be laid out. For this purpose, it is only necessary to recognize a main subject and set a balloon area for laying out characters according to the size of the other area. The segment division position can be set according to the number of characters to be laid out in the balloon. This makes it easier to recognize the characters in the balloon and create a composite image that includes easy-to-read text.

図１７は、本実施の形態に係る文章／文節の分割を自動で行うか否かを選択する際に表示部３０６に表示される設定画面例を示す図である。 FIG. 17 is a diagram showing an example of a setting screen displayed on the display unit 306 when selecting whether to automatically divide a sentence / sentence according to the present embodiment.

図では、自動分割モードが設定された（「ＯＮ」が選択された）状態を示している。 The figure shows a state in which the automatic division mode is set (“ON” is selected).

以上、本発明の実施の形態を詳述したが、本発明は、複数の機器から構成されるシステムに適用しても良いし、または一つの機器からなる装置に適用しても良い。 Although the embodiments of the present invention have been described in detail above, the present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device.

なお本発明は、前述した実施の形態の機能を実現するソフトウェアのプログラムを、システム或いは装置に直接或いは遠隔から供給し、そのシステム或いは装置のコンピュータが、その供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。その場合、プログラムの機能を有していれば、その形態はプログラムである必要はない。従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明には、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In the present invention, a software program that realizes the functions of the above-described embodiments is supplied directly or remotely to a system or apparatus, and the computer of the system or apparatus reads and executes the supplied program code. In some cases, it can be achieved by In that case, as long as it has the function of a program, the form does not need to be a program. Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. That is, the present invention includes a computer program itself for realizing the functional processing of the present invention. In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するための記憶媒体としては、例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などがある。その他のプログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、該ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記憶媒体にダウンロードすることによっても供給できる。また本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明のクレームに含まれるものである。 As a storage medium for supplying the program, for example, floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card ROM, DVD (DVD-ROM, DVD-R) and the like. As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program of the present invention itself or a compressed file including an automatic installation function is downloaded from the homepage to a storage medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different home page. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the claims of the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件を満足するユーザに対してインターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who satisfy predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

またコンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program may be part of the actual processing or The functions of the above-described embodiment can also be realized by performing all the processing and performing the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明の実施の形態に係る撮像装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of the imaging device which concerns on embodiment of this invention. 本実施の形態に係る撮像装置からの合成をプリンタで印刷した印刷例を示す図である。It is a figure which shows the printing example which printed the synthesis | combination from the imaging device which concerns on this Embodiment with the printer. 本実施の形態に係る撮像装置の構成を示すブロック図である。It is a block diagram which shows the structure of the imaging device which concerns on this Embodiment. 本発明の実施の形態１に係る撮像装置による処理を説明するフローチャートである。It is a flowchart explaining the process by the imaging device which concerns on Embodiment 1 of this invention. 本実施の形態１に係る選択タイムスタンプを選択する際に表示部に表示されるメニュ画面例を示す図である。It is a figure which shows the example of a menu screen displayed on a display part when selecting the selection time stamp which concerns on this Embodiment 1. FIG. 本実施の形態２に係る撮像装置における処理を説明するフローチャートである。10 is a flowchart for describing processing in the imaging apparatus according to the second embodiment. 本実施の形態２に係る撮像装置における処理の変形例を説明するフローチャートである。12 is a flowchart for describing a modification of processing in the imaging apparatus according to the second embodiment. 動画フレームと音声との関係の一例を説明する図である。It is a figure explaining an example of the relationship between a moving image frame and an audio | voice. 音声を示すテキストと画像フレームとを合成して印刷した印刷例を示す図である。It is a figure which shows the example of a print which synthesize | combined and printed the text and image frame which show a sound. 音声を示すテキストと画像フレームとを合成して印刷した他の印刷例を示す図である。It is a figure which shows the other example of a print which synthesize | combined and printed the text which shows an audio | voice, and an image frame. 本実施の形態３において、音声を表すテキストを分割する例を説明する図である。In Embodiment 3, it is a figure explaining the example which divides | segments the text showing an audio | voice. 、, 本発明の実施の形態３に係る撮像装置による処理を説明するフローチャートである。It is a flowchart explaining the process by the imaging device which concerns on Embodiment 3 of this invention. 本実施の形態３に係る一連の処理（図１２，図１３）によって分割、選択された画像フレームの一例を示す図である。It is a figure which shows an example of the image frame divided | segmented and selected by the series of processes (FIG. 12, FIG. 13) concerning this Embodiment 3. FIG. 、, 本実施の形態３において、各画像フレームに対応して、各分割されたテキストを合成して印刷した画像例を示す図である。In Embodiment 3, it is a figure which shows the example of an image which synthesize | combined and printed each divided text corresponding to each image frame. 本実施の形態に係る文章／文節の分割を自動で行うか否かを選択する際に表示部に表示される設定画面例を示す図である。It is a figure which shows the example of a setting screen displayed on a display part, when selecting whether the division | segmentation of the sentence / sentence concerning this Embodiment is performed automatically.

Claims

An imaging device capable of playing a video,
A recognition means for recognizing the audio contained in the video,
A selection unit that selects a frame image of a print candidate based on a voice break recognized by the recognition unit;
Output control means for combining and outputting the frame image selected by the selection means and text information indicating the speech recognized by the recognition means;
An imaging device comprising:

2. The imaging apparatus according to claim 1, wherein the selection unit selects an image at a substantially center of a target image within a period corresponding to a length of a phrase up to the voice break.

The imaging apparatus according to claim 1, wherein the selection unit selects a frame image at the time of the first sound of a target image within a period corresponding to a length of a phrase up to the voice break.

The imaging apparatus according to claim 1, wherein the selection unit selects a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the speech break.

The said selection means selects the frame image with the magnitude | size of a motion vector the minimum among the target images in the period corresponding to the length of the phrase to the said audio | voice division | segmentation, The Claim 1 characterized by the above-mentioned. Imaging device.

The imaging apparatus according to claim 1, wherein the selection unit includes a unit that acquires a shooting parameter of a frame image, and selects the frame image of the print candidate based on the shooting parameter.

An imaging device capable of playing a video,
Speech recognition means for recognizing input speech information;
A recognition means for recognizing the audio contained in the video,
Comparing means for comparing the voice information voice-recognized by the voice recognition means with the voice information voice-recognized by the recognition means;
Extraction means for extracting frame images determined to be coincident by the comparison means;
Output control means for synthesizing and outputting the frame image extracted by the extraction means and text information indicating the audio information;
An imaging device comprising:

The imaging apparatus according to claim 7, wherein the extraction unit extracts a frame image satisfying a predetermined condition from a plurality of frame images determined to match by the comparison unit.

9. The imaging apparatus according to claim 8, wherein the frame image satisfying the predetermined condition is a frame image at a substantially center of the target image within a period corresponding to a length of a phrase up to the voice break. .

9. The frame image satisfying the predetermined condition is a frame image at the time of the first sound of the target image within a period corresponding to a length of a phrase up to the break of the sound. Imaging device.

9. The frame image satisfying the predetermined condition is a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the voice segment. Imaging device.

The frame image satisfying the predetermined condition is a frame image having a minimum motion vector size among target images within a period corresponding to a length of a phrase up to the voice segment. Item 9. The imaging device according to Item 8.

The imaging apparatus according to claim 8, wherein the frame image satisfying the predetermined condition is determined based on a shooting parameter of the frame image.

An imaging device capable of playing a video,
A recognition means for recognizing the audio contained in the video,
A selection unit that selects a frame image of a print candidate based on a voice break recognized by the recognition unit;
It is determined whether or not to divide text information indicating the speech recognized by the recognition unit, and in the case of division, an image selection unit that selects a frame image corresponding to the speech corresponding to the divided text;
When the text information is not divided, the frame image selected by the selection unit and the text information indicating the speech recognized by the recognition unit are combined and output,
Output control means for combining and outputting the frame image selected by the image selection means and the text information corresponding to the divided text when dividing the text information;
An imaging device comprising:

15. The imaging apparatus according to claim 14, wherein the selection unit selects an image at a substantially center of the target image within a period corresponding to a length of a phrase up to the voice break.

15. The imaging apparatus according to claim 14, wherein the selection unit selects a frame image at the time of the first sound of a target image within a period corresponding to a length of a phrase up to the voice break.

15. The imaging apparatus according to claim 14, wherein the selection unit selects a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the speech break.

The said selection means selects the frame image with the magnitude | size of a motion vector the minimum among the object images in the period corresponding to the length of the clause to the said audio | voice division | segmentation, The said image is characterized by the above-mentioned. Imaging device.

The imaging apparatus according to claim 14, wherein the selection unit includes a unit that acquires a shooting parameter of a frame image, and selects the frame image of the print candidate based on the shooting parameter.

The image selection means determines whether or not to divide the text information indicating the sound based on whether or not the number of characters of the clause until the voice break recognized by the recognition means is larger than a predetermined number of characters. The imaging apparatus according to claim 14.

15. The image selection unit determines whether or not to divide according to whether the characters of the phrase up to the voice break recognized by the recognition unit fit in the synthesized image. Imaging device.

A method for controlling an imaging apparatus capable of reproducing a movie,
A recognition process for recognizing the audio contained in the video,
A selection step of selecting a frame image of a print candidate based on the voice break recognized in the recognition step;
An output control step of synthesizing and outputting the frame image selected in the selection step and text information indicating the voice recognized in the recognition step;
A method for controlling an imaging apparatus, comprising:

23. The method of controlling an imaging apparatus according to claim 22, wherein, in the selection step, an image at a substantially center of a target image within a period corresponding to a length of a phrase up to the voice break is selected.

23. The control of the imaging apparatus according to claim 22, wherein, in the selection step, a frame image at the time of the first sound of the target image within a period corresponding to a length of a phrase up to the speech break is selected. Method.

23. The control of an imaging apparatus according to claim 22, wherein, in the selection step, a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the speech break is selected. Method.

23. The frame selection method according to claim 22, wherein in the selection step, a frame image having a minimum motion vector size is selected from target images within a period corresponding to a length of a phrase up to the speech segment. Control method of imaging apparatus.

23. The method of controlling an imaging apparatus according to claim 22, wherein in the selection step, shooting parameters of a frame image are acquired and the frame image of the print candidate is selected based on the shooting parameters.

A method for controlling an imaging apparatus capable of reproducing a movie,
A speech recognition process for recognizing input speech information;
A recognition process for recognizing the audio contained in the video,
A comparison step for comparing the voice information recognized in the voice recognition step with the voice information recognized in the recognition step;
An extraction step of extracting frame images determined to match in the comparison step;
An output control step of synthesizing and outputting the frame image extracted in the extraction step and text information indicating the audio information;
A method for controlling an imaging apparatus, comprising:

30. The control of an imaging apparatus according to claim 28, wherein in the extraction step, a frame image satisfying a predetermined condition is extracted from a plurality of frame images determined to be coincident in the comparison step. Method.

30. The imaging apparatus according to claim 29, wherein the frame image satisfying the predetermined condition is a frame image at a substantially center of a target image within a period corresponding to a length of a phrase up to the voice break. Control method.

30. The frame image satisfying the predetermined condition is a frame image at the time of the first sound of the target image within a period corresponding to a length of a phrase up to the break of the sound. Method for controlling the imaging apparatus.

30. The frame image satisfying the predetermined condition is a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the voice segment. Method for controlling the imaging apparatus.

The frame image satisfying the predetermined condition is a frame image having a minimum motion vector size among target images within a period corresponding to a length of a phrase up to the voice segment. Item 30. A method for controlling an imaging apparatus according to Item 29.

30. The method of controlling an imaging apparatus according to claim 29, wherein the frame image satisfying the predetermined condition is determined based on a shooting parameter of the frame image.

A method for controlling an imaging apparatus capable of reproducing a movie,
A recognition process for recognizing the audio contained in the video,
A selection step of selecting a frame image of a print candidate based on the voice break recognized in the recognition step;
It is determined whether or not to divide the text information indicating the speech recognized in the recognition step, and when dividing, an image selection step of selecting a frame image corresponding to the speech corresponding to the divided text;
If the text information is not divided, the frame image selected in the selection step and the text information indicating the voice recognized in the recognition step are combined and output,
An output control step of combining and outputting the frame image selected in the image selection step and the text information corresponding to the divided text when dividing the text information;
A method for controlling an imaging apparatus, comprising:

36. The method of controlling an imaging apparatus according to claim 35, wherein the selecting step selects an image at a substantially center of a target image within a period corresponding to a length of a phrase up to the voice break.

36. The control of an imaging apparatus according to claim 35, wherein the selecting step selects a frame image at the time of the first sound of the target image within a period corresponding to a length of a phrase up to the voice break. Method.

36. The control of an imaging apparatus according to claim 35, wherein the selection step selects a frame image after a predetermined time from the beginning of a target image within a period corresponding to a length of a phrase up to the voice break. Method.

36. The method according to claim 35, wherein the selecting step selects a frame image having a minimum motion vector size among target images within a period corresponding to the length of a phrase up to the speech break. Control method of imaging apparatus.

36. The method of controlling an imaging apparatus according to claim 35, wherein the selecting step includes a step of acquiring a shooting parameter of a frame image, and the frame image of the print candidate is selected based on the shooting parameter.

In the image selection step, it is determined whether or not to divide the text information indicating the voice based on whether or not the number of characters of the clause until the voice break recognized in the recognition step is larger than a predetermined number of characters. 36. The method of controlling an imaging apparatus according to claim 35.

36. In the image selection step, it is determined whether or not to divide according to whether the characters of the phrase up to the voice break recognized in the recognition step fit in the combined image. Method for controlling the imaging apparatus.