JP2011030089A

JP2011030089A - Image processing apparatus and program

Info

Publication number: JP2011030089A
Application number: JP2009175585A
Authority: JP
Inventors: Takeshi Iwamoto; 健士岩本
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2009-07-28
Filing date: 2009-07-28
Publication date: 2011-02-10
Anticipated expiration: 2029-07-28
Also published as: JP5434337B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image processing apparatus and program for accurately determining a user-desired subject scene from among consecutive images in reflection with the intention of a user who captures images. <P>SOLUTION: The image processing apparatus includes: a program memory 22 storing audio frequency band information indicative of audio frequency bands corresponding to kinds of a plurality of subjects; imaging and recording systems 11-13, 16, 17 for capturing a plurality of image data items with which audio data are attached; an image processing section 13 for recognizing the type of a subject captured in the image from the plurality of image data items captured; a control section 20, a main memory 21 and the program memory 22 in which audio frequency band information corresponding to the subject type recognized in the image processing section 13 is read from the program memory 22, the readout audio frequency band is set and image data are extracted with which audio data are attached for reproducing a volume for which the voice of a sound belonging to the set audio frequency band is determined equal to or higher than a specified volume among sounds reproduced from audio data of the plurality of image data items captured. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、特に予測不能な動きを見せる被写体を撮影するデジタルカメラなどの撮像装置に好適な画像処理装置及びプログラムに関する。 The present invention relates to an image processing apparatus and a program suitable for an imaging apparatus such as a digital camera that captures a subject that shows an unpredictable movement.

従来、音声データ付きの画像を撮影しながら、所定レベル以上の音量が所定時間以上検出されたシーンをハイライトシーンであると判断する技術が提案されている。（例えば、特許文献１）
この特許文献の技術によれば、例えばスポーツ映像において観客の歓声やアナウンサの大声等を検出し、ハイライトシーンであると判断することができる。 2. Description of the Related Art Conventionally, there has been proposed a technique for determining a scene in which a sound volume of a predetermined level or higher is detected for a predetermined time or more as a highlight scene while shooting an image with audio data. (For example, Patent Document 1)
According to the technique of this patent document, for example, a cheer of a spectator or a loud voice of an announcer can be detected in a sports video, and it can be determined that the scene is a highlight scene.

また、上記特許文献には、撮影している画像に所定の音声周波数に類似する周波数が所定時間以上検出されたシーンをハイライトシーンと判断する技術も提案されている。係る技術によれば、例えば音楽映像において歌手の高音や低音等を検出し、ハイライトシーンと判断することができる。 The above-mentioned patent document also proposes a technique for determining a scene in which a frequency similar to a predetermined audio frequency is detected for a predetermined time or longer in a captured image as a highlight scene. According to such a technique, for example, it is possible to detect a singer's treble or bass in a music video and determine that it is a highlight scene.

特開２００９−０２７４３０号公報JP 2009-027430 A

上記特許文献に記載された技術では、単に所定レベル以上の音量あるいは予め設定した音声周波数に類似する周波数帯域の音量が、所定レベル以上となる時間によりハイライトシーンを判断する。そのため、撮影画像中に撮影者であるユーザが望まない被写体が発する音声にも、ハイライトシーンであると判断して対応した処理を行なうことも充分あり得る。 In the technique described in the above-mentioned patent document, the highlight scene is determined based on the time when the volume of a predetermined level or higher or the volume of a frequency band similar to a preset audio frequency is equal to or higher than the predetermined level. For this reason, it is also possible to determine that the subject is not a highlight scene and to perform corresponding processing on the sound produced by the subject that the user who is the photographer does not want in the photographed image.

本発明は上記のような実情に鑑みてなされたもので、その目的とするところは、撮影者であるユーザの意図を反映し、連写画像中からユーザの所望する被写体のシーンを的確に判断することが可能な画像処理装置及びプログラムを提供することにある。 The present invention has been made in view of the above circumstances, and its purpose is to reflect the intention of the user who is the photographer and accurately determine the scene of the subject desired by the user from among the continuous shot images. An object of the present invention is to provide an image processing apparatus and a program that can be used.

本発明の一態様は、複数の被写体の種別にそれぞれ対応する音声周波数帯域を示す音声周波数帯域情報を記憶した第１の記憶手段と、音声データが付帯した複数の画像データを取得する取得手段と、上記取得手段で取得した上記複数の画像データによりそれぞれ表現される画像に写る被写体の種別を認識する認識手段と、上記認識手段で認識した被写体の種別に対応する音声周波数帯域情報を上記第１の記憶手段から読出し、読出した音声周波数帯域情報が示す音声周波数帯域を設定する設定手段と、上記取得手段で取得した上記複数の画像データに付帯する音声データから再現される音声のうち、上記設定手段で設定した音声周波数帯域に属する音声の音量が規定の音量以上であると判断した音量を再現する音声データが付帯された画像データを抽出する抽出手段とを具備したことを特徴とする。 One aspect of the present invention is a first storage unit that stores audio frequency band information indicating audio frequency bands respectively corresponding to a plurality of types of subjects, and an acquisition unit that acquires a plurality of image data attached with audio data. Recognizing means for recognizing the type of the subject in each of the images represented by the plurality of image data acquired by the acquiring means, and audio frequency band information corresponding to the type of the subject recognized by the recognizing means. Among the voices reproduced from the voice data attached to the plurality of image data acquired by the acquisition means and the setting means for setting the voice frequency band indicated by the read voice frequency band information Image data with sound data that reproduces the sound volume determined to be greater than or equal to the specified sound volume in the sound frequency band set by the means. Characterized by comprising an extraction means for extracting the data.

本発明の他の一態様は、音声データが付帯した複数の画像データを取得する取得手段と、上記取得手段で取得した上記複数の画像データによりそれぞれ表現される画像に写る被写体の表情を検出する検出手段と、上記検出手段で検出した被写体の表情が予め設定された表情となっている画像データを抽出する第１の抽出手段と、上記第１の抽出手段で抽出した被写体が写っている画像データに付帯する音声データから再現される音声の音量が規定の音量以上であるか否かを判断する判断手段と、上記第１の抽出手段で抽出した被写体が写っている画像データから、上記判断手段で規定の音量以上の音声が再現できる音声データが付帯されている画像データを抽出する第２の抽出手段とを具備したことを特徴とする。 According to another aspect of the present invention, an acquisition unit that acquires a plurality of image data attached with audio data and a facial expression of a subject that is reflected in an image that is represented by the plurality of image data acquired by the acquisition unit are detected. Detection means, first extraction means for extracting image data in which the facial expression of the subject detected by the detection means is a preset facial expression, and an image showing the subject extracted by the first extraction means The determination unit determines whether or not the volume of the sound reproduced from the audio data attached to the data is equal to or higher than a specified volume, and the image data including the subject extracted by the first extraction unit. And a second extraction means for extracting image data accompanied by voice data capable of reproducing a sound having a sound volume equal to or higher than a prescribed volume.

本発明によれば、撮影者であるユーザの意図を反映し、連写画像中からユーザの所望する被写体のシーンを的確に判断できる。 According to the present invention, it is possible to accurately determine a scene of a subject desired by a user from continuous shot images, reflecting the intention of the user who is a photographer.

本発明の第１の実施形態に係るデジタルカメラの電子回路の機能構成を示すブロック図。1 is a block diagram showing a functional configuration of an electronic circuit of a digital camera according to a first embodiment of the present invention. 同実施形態に係る一押しショット機能実行時の処理内容を示すフローチャート。The flowchart which shows the processing content at the time of the one push shot function execution based on the embodiment. 同実施形態に係る一押しショット選択のサブルーチンの処理内容を示すフローチャート。6 is a flowchart showing the processing content of a subroutine for selecting one-shot shot according to the embodiment. 同実施形態に係る一連の撮影画像の一部を例示する図。The figure which illustrates a part of a series of picked-up images according to the embodiment. 同実施形態に係る一連の撮影画像の一部を例示する図。The figure which illustrates a part of a series of picked-up images according to the embodiment. 本発明の第２の実施形態に係る一押しショット機能実行時の処理内容を示すフローチャート。The flowchart which shows the processing content at the time of 1 push shot function execution which concerns on the 2nd Embodiment of this invention. 同実施形態に係る一押しショット選択のサブルーチンの処理内容を示すフローチャート。6 is a flowchart showing the processing content of a subroutine for selecting one-shot shot according to the embodiment. 同実施形態に係る一連の撮影画像の一部を例示する図。The figure which illustrates a part of a series of picked-up images according to the embodiment.

（第１の実施形態）
以下本発明をデジタルカメラに適用した場合の第１の実施形態について図面を参照して説明する。 (First embodiment)
A first embodiment when the present invention is applied to a digital camera will be described below with reference to the drawings.

図１は、本実施形態に係るデジタルカメラ１０の回路構成を示すものである。同図では、カメラ筐体前面に配設される光学レンズユニット１１により、固体撮像素子であるＣＭＯＳイメージセンサ１２の撮像面上に被写体の光像が結像される。 FIG. 1 shows a circuit configuration of a digital camera 10 according to the present embodiment. In the figure, a light image of a subject is formed on the imaging surface of a CMOS image sensor 12 that is a solid-state imaging device by an optical lens unit 11 disposed on the front surface of the camera casing.

スルー画像表示、あるいはライブビュー画像表示とも称されるモニタ状態では、このＣＭＯＳイメージセンサ１２での撮像により得た画像信号を画像処理部１３に送る。 In a monitor state, also referred to as through image display or live view image display, an image signal obtained by imaging with the CMOS image sensor 12 is sent to the image processing unit 13.

画像処理部１３は、画像信号に対して相関二乗サンプリングや自動ゲイン調整、Ａ／Ｄ変換処理を実行してデジタル化する。 The image processing unit 13 digitizes the image signal by performing correlation square sampling, automatic gain adjustment, and A / D conversion processing.

画像処理部１３はさらに、このデジタル化した信号に画素補間処理、γ補正処理を含むカラープロセス処理を施した後、システムバスＳＢを介して画像バッファ１４に一時的に保持させる。以下、この画像バッファ１４に保持した画像信号を「画像データ」と呼称する。 The image processing unit 13 further performs color process processing including pixel interpolation processing and γ correction processing on the digitized signal, and temporarily stores the digitized signal in the image buffer 14 via the system bus SB. Hereinafter, the image signal held in the image buffer 14 is referred to as “image data”.

画像バッファ１４は、後述する循環記憶動作に対応するように一定時間分、例えばフレームレートが３０フレーム／秒で１０秒間分であれば計３００フレーム分の画像データを記憶可能な容量を有しているものとする。 The image buffer 14 has a capacity capable of storing image data for a certain time, for example, a total of 300 frames if the frame rate is 30 frames / second for 10 seconds so as to correspond to a circular storage operation described later. It shall be.

画像バッファ１４に保持された画像データをシステムバスＳＢを介して画像処理部１３に読出し、再びシステムバスＳＢを介して表示部１５へ送り、スルー画像として表示させる。
なお上記画像処理部１３は、後述する顔認識技術により画像バッファ１４に保持される画像データにより表現される画像中に含まれる顔領域を検出する。 The image data held in the image buffer 14 is read out to the image processing unit 13 via the system bus SB, sent again to the display unit 15 via the system bus SB, and displayed as a through image.
The image processing unit 13 detects a face area included in an image represented by image data held in the image buffer 14 by a face recognition technique described later.

また、上記光学レンズユニット１１と同じくカメラ筐体前面には、上記光学レンズユニット１１の撮影画角と略同等の音響指向性を有するマイクロホン１６が配設され、このマイクロホン１６から被写体方向の音声が入力される。マイクロホン１６は入力した音声を電気信号化し、音声処理部１７へ出力する。 Similarly to the optical lens unit 11, a microphone 16 having an acoustic directivity substantially equal to the shooting angle of view of the optical lens unit 11 is disposed on the front surface of the camera casing. Entered. The microphone 16 converts the input sound into an electrical signal and outputs it to the sound processing unit 17.

音声処理部１７は、音声の録音時にはマイクロホン１６から入力する音声信号をデジタルデータ化する。以下、デジタル化した音声信号を「音声データ」と呼称する。音声処理部１７は、デジタル値の音声データをシステムバスＳＢを介してリングバッファで構成する音声バッファ１８に一時的に保持させる。 The audio processing unit 17 converts the audio signal input from the microphone 16 into digital data when recording audio. Hereinafter, the digitized audio signal is referred to as “audio data”. The audio processing unit 17 temporarily stores the digital audio data in the audio buffer 18 formed of a ring buffer via the system bus SB.

加えて音声処理部１７は、複数の周波数帯域に対応した音声フィルタ機能を有する。音声処理部１７は、例えば成人男性の音声に対応した８０［Ｈｚ］〜１５０［Ｈｚ］、成人女性の音声に対応した２００［Ｈｚ］〜３５０［Ｈｚ］、子供の音声に対応した３００［Ｈｚ］〜８００［Ｈｚ］の３つの音声フィルタ機能を有する。音声処理部１７が、音声バッファ１８に保持される音声データに対して上記音声フィルタ機能を働かせることで、成人男性、成人女性、子供に対応した周波数帯域の音声を抽出することが可能となる。 In addition, the audio processing unit 17 has an audio filter function corresponding to a plurality of frequency bands. The voice processing unit 17 is, for example, 80 [Hz] to 150 [Hz] corresponding to the voice of an adult male, 200 [Hz] to 350 [Hz] corresponding to the voice of an adult female, and 300 [Hz] corresponding to the voice of a child. ] To 800 [Hz]. The voice processing unit 17 can extract the voice in the frequency band corresponding to the adult male, the adult female, and the child by using the voice filter function for the voice data held in the voice buffer 18.

さらに音声処理部１７は、必要に応じて音声バッファ１８に保持する音声データを部分的に切り出し、所定のデータファイル形式、例えばＡＡＣ（ｍｏｖｉｎｇｐｉｃｔｕｒｅｅｘｐｅｒｔｓｇｒｏｕｐ−４ＡｄｖａｎｃｅｄＡｕｄｉｏＣｏｄｉｎｇ）形式でデータ圧縮して音声データファイルを作成し、後述する記録媒体へ送出する。 Further, the audio processing unit 17 partially cuts out the audio data held in the audio buffer 18 as necessary, and compresses the data in a predetermined data file format, for example, AAC (moving picture experts group-4 Advanced Audio Coding) format. An audio data file is created and sent to a recording medium to be described later.

加えて音声処理部１７は、ＰＣＭ音源等の音源回路を備え、音声の再生時に送られてくる音声データファイルの圧縮を解いてアナログ化し、このデジタルカメラ１０の筐体背面側に設けられるスピーカ１９を駆動して、拡声放音させる。 In addition, the sound processing unit 17 includes a sound source circuit such as a PCM sound source, uncompresses and converts the sound data file sent during sound reproduction into an analog signal, and a speaker 19 provided on the rear side of the casing of the digital camera 10. To sound a loud sound.

以上の回路を制御部２０が統括制御する。制御部２０はＣＰＵで構成され、メインメモリ２１、プログラムメモリ２２と直接接続される。メインメモリ２１は、ＳＤＲＡＭ（シンクロナスＤＲＡＭ）で構成され、ワークメモリとして機能する。プログラムメモリ２２は、例えばフラッシュメモリなどの電気的に書換可能な不揮発性メモリで構成され、後述する撮影モード時の制御を含む動作プログラムやデータ等を固定的に記憶する。
加えてプログラムメモリ２２は、被写体の複数の種別に応じた音声周波数帯域情報や複数の被写体の種別の優先順位を示す情報等も予め記憶している。 The control unit 20 performs overall control of the above circuit. The control unit 20 is composed of a CPU and is directly connected to the main memory 21 and the program memory 22. The main memory 21 is composed of SDRAM (synchronous DRAM) and functions as a work memory. The program memory 22 is composed of an electrically rewritable non-volatile memory such as a flash memory, for example, and fixedly stores an operation program, data including control in a photographing mode described later.
In addition, the program memory 22 stores in advance audio frequency band information corresponding to a plurality of types of subjects, information indicating the priority order of the types of subjects, and the like.

制御部２０は、プログラムメモリ２２から必要なプログラムやデータ等を読出し、メインメモリ２１に適宜一時的に展開して記憶させながら、このデジタルカメラ１０全体の制御動作を実行する。 The control unit 20 reads out necessary programs, data, and the like from the program memory 22 and executes the control operation of the entire digital camera 10 while temporarily expanding and storing it in the main memory 21 as appropriate.

さらに上記制御部２０は、キー入力部２３から直接入力されるキー操作信号に対応して各種制御動作を実行する。制御部２０は、システムバスＳＢを介して上記画像処理部１３、画像バッファ１４、表示部１５、音声処理部１７、及び音声バッファ１８の他、さらにレンズ駆動部２４、フラッシュ駆動部２５、ＣＭＯＳ駆動部２６、圧縮伸長処理部２７、メモリカードコントローラ２８、及びＵＳＢインタフェース（Ｉ／Ｆ）２９とも接続される。 Further, the control unit 20 executes various control operations in response to key operation signals directly input from the key input unit 23. In addition to the image processing unit 13, the image buffer 14, the display unit 15, the audio processing unit 17, and the audio buffer 18, the control unit 20 further includes a lens driving unit 24, a flash driving unit 25, and a CMOS drive via the system bus SB. Are also connected to the unit 26, the compression / decompression processing unit 27, the memory card controller 28, and the USB interface (I / F) 29.

キー入力部２３は、例えば電源キー、シャッタキー、ズームキー、撮影モードキー、再生モードキー、メニューキー、カーソル（「↑」「→」「↓」「←」）キー、セットキー、プレイバックキー等を備える。 The key input unit 23 includes, for example, a power key, shutter key, zoom key, shooting mode key, playback mode key, menu key, cursor (“↑” “→” “↓” “←”) key, set key, playback key, and the like. Is provided.

レンズ駆動部２４は、制御部２０からの制御信号を受けてレンズ用ステッピングモータ（Ｍ）３０の回動を制御し、上記光学レンズユニット１１を構成する複数のレンズ群中の一部、具体的にはフォーカスレンズ及びズームレンズの位置をそれぞれ個別に光軸方向に沿って移動させる。 The lens driving unit 24 receives a control signal from the control unit 20 and controls the rotation of the lens stepping motor (M) 30, and a part of a plurality of lens groups constituting the optical lens unit 11, specifically In this case, the positions of the focus lens and the zoom lens are individually moved along the optical axis direction.

フラッシュ駆動部２５は、静止画像撮影時に制御部２０からの制御信号を受けて複数の白色高輝度ＬＥＤで構成されるフラッシュ部３１を撮影タイミングに同期して点灯駆動する。 The flash drive unit 25 receives a control signal from the control unit 20 during still image shooting, and drives the flash unit 31 including a plurality of white high-intensity LEDs to be lit in synchronization with the shooting timing.

ＣＭＯＳ駆動部２６は、その時点で設定されている撮影条件に応じて上記ＣＭＯＳイメージセンサ１２の走査駆動を行なう。 The CMOS drive unit 26 scans and drives the CMOS image sensor 12 in accordance with the photographing conditions set at that time.

圧縮伸長処理部２７は、上記キー入力部２３のシャッタキー操作に伴う画像撮影時に、画像バッファ１４に保持される画像データを所定のデータファイル形式、例えばＪＰＥＧ（ＪｏｉｎｔＰｈｏｔｏｇｒａｐｈｉｃＥｘｐｅｒｔｓＧｒｏｕｐ）であればＤＣＴ（離散コサイン変換）やハフマン符号化等のデータ圧縮処理を施してデータ量を大幅に削減した画像データファイルを作成する。作成した画像データファイルはシステムバスＳＢ、メモリカードコントローラ２８を介してメモリカード３２に記録される。 The compression / decompression processing unit 27 converts the image data stored in the image buffer 14 into a predetermined data file format, for example, DCT if it is a JPEG (Joint Photographic Experts Group), at the time of shooting an image accompanying the shutter key operation of the key input unit 23. Data compression processing such as (discrete cosine transform) and Huffman coding is performed to create an image data file with a greatly reduced data amount. The created image data file is recorded on the memory card 32 via the system bus SB and the memory card controller 28.

また圧縮伸長処理部２７は、再生モード時にメモリカード３２からメモリカードコントローラ２８を介して読出されてくる画像データをシステムバスＳＢを介して受取り、記録時とは逆の手順で圧縮を解く伸長処理により元のサイズの画像データを得、これをシステムバスＳＢを介して画像バッファ１４に保持させる。そして、画像バッファ１４に保持された画像データにより表示部１５で再生のための表示が実行される。
メモリカードコントローラ２８は、カードコネクタ３３を介してメモリカード３２と接続される。メモリカード３２は、このデジタルカメラ１０に着脱自在に装着され、このデジタルカメラ１０の記録媒体となる画像データ等の記録用メモリであり、内部にはブロック単位で電気的に書換え可能な不揮発性メモリであるフラッシュメモリと、その駆動回路とが設けられる。
ＵＳＢインタフェース２９は、ＵＳＢコネクタ３４を介してこのデジタルカメラ１０を外部機器、例えばパーソナルコンピュータと接続する際のデータの送受を司る。 The compression / decompression processing unit 27 receives the image data read from the memory card 32 via the memory card controller 28 in the playback mode via the system bus SB, and decompresses the decompression process in the reverse order of the recording process. Thus, the original size image data is obtained and held in the image buffer 14 via the system bus SB. Then, display for reproduction is executed on the display unit 15 by the image data held in the image buffer 14.
The memory card controller 28 is connected to the memory card 32 via the card connector 33. The memory card 32 is a memory for recording image data and the like that is detachably attached to the digital camera 10 and serves as a recording medium for the digital camera 10, and has a nonvolatile memory that can be electrically rewritten in units of blocks. A flash memory and a driving circuit thereof are provided.
The USB interface 29 controls transmission / reception of data when the digital camera 10 is connected to an external device such as a personal computer via the USB connector 34.

次に上記実施形態の動作について説明する。
なお、以下に示す動作は、撮影モード下で「一押しショット」機能による静止画像の撮影を行なう際に、制御部２０がプログラムメモリ２２に記憶されている動作プログラム及びデータを読出してメインメモリ２１に展開して記憶させた上で実行するものである。 Next, the operation of the above embodiment will be described.
In the following operation, the control unit 20 reads out the operation program and data stored in the program memory 22 when the still image is shot by the “one-shot shot” function in the shooting mode, and the main memory 21. It is executed after being expanded and memorized.

ここで「一押しショット」機能とは、一連の連写画像中から最も優れているシーンをデジタルカメラ１０側で自動的に選択して記録する機能である。 Here, the “one-shot shot” function is a function for automatically selecting and recording the best scene from a series of continuous shot images on the digital camera 10 side.

プログラムメモリ２２に記憶されている動作プログラムや固定データ等は、このデジタルカメラ１０の製造工場出荷時にプログラムメモリ２２に記憶されていたものに加え、例えばこのデジタルカメラ１０のバージョンアップに際して、デジタルカメラ１０をユーザがパーソナルコンピュータと接続することにより、外部からＵＳＢコネクタ３４、ＵＳＢインタフェース２９を経由して新たな動作プログラム、データ等をダウンロードして記憶していたものも含む。 The operation program, fixed data, and the like stored in the program memory 22 are stored in the program memory 22 at the time of shipment from the manufacturing factory of the digital camera 10, for example, when the digital camera 10 is upgraded. Are stored by downloading a new operation program, data, and the like from the outside via the USB connector 34 and the USB interface 29 when the user connects to the personal computer.

また、キー入力部２３の一部を構成するシャッタキーは、２段階の操作ストロークを有し、第１段目の操作ストローク（以下「半押し」と称する）で撮影準備状態となり、顔認識技術を用いて被写体の顔領域の検出とパスト撮影とを開始する。 The shutter key that constitutes a part of the key input unit 23 has a two-step operation stroke, and enters a shooting preparation state at the first step operation stroke (hereinafter referred to as “half-press”). Is used to start detection of the face area of the subject and past shooting.

さらにシャッタキーが第２段階の操作ストローク（以下「全押し」と称する）となってからは、当該全押し状態が解除されるまでの間、連写による本撮影を続行する。 Further, after the shutter key reaches the second stage operation stroke (hereinafter referred to as “full press”), the continuous shooting is continued until the full press state is released.

ここで、上記顔認識技術を用いて画像中から顔領域を検出するについても簡単に説明しておく。
画像中に含まれる顔領域を検出する手法としては様々な手法が知られており、本実施形態に係るデジタルカメラ１０はいずれの手法も採用可能である。
例えば、特開２０００−１０５８１９号公報に記載された手法の如く、入力画像から肌色領域を抽出することによって顔領域を検出してもよいし、特開２００６−２１１１３９号公報または特開２００６−０７２７７０号公報に記載された手法を用いて顔領域を検出してもよい。 Here, the detection of a face area from an image using the face recognition technique will be briefly described.
Various methods are known as a method for detecting a face area included in an image, and any method can be adopted for the digital camera 10 according to the present embodiment.
For example, a face region may be detected by extracting a skin color region from an input image, as in the method described in Japanese Patent Application Laid-Open No. 2000-105819, or Japanese Patent Application Laid-Open No. 2006-21111 or Japanese Patent Application Laid-Open No. 2006-072770. The face area may be detected using the technique described in the Japanese Patent Publication.

典型的には、例えば画像内に設定された着目領域の画像と所定のテンプレート顔画像とを対比して両画像の類似度を判定し、その類似度に基づいて着目領域に顔が含まれているか否か、換言すれば着目領域が顔領域であるか否かを検出する。
類似度の判定は、顔であるか否かを識別するのに有効な特徴量を抽出することによって行なう。該特徴量は、水平エッジ、垂直エッジ、右斜めエッジ、左斜めエッジ等を含む。 Typically, for example, the image of the region of interest set in the image is compared with a predetermined template face image to determine the similarity of both images, and the face is included in the region of interest based on the similarity Whether or not the region of interest is a face region is detected.
The similarity is determined by extracting a feature amount effective for identifying whether the face is a face. The feature amount includes a horizontal edge, a vertical edge, a right diagonal edge, a left diagonal edge, and the like.

画像中の着目領域は、１画素ずつ左右方向または上下方向にずらされる。そして、ずらされた後の着目領域内の画像とテンプレート顔画像とが対比されて、再度、両画像の類似度が判定され、同様の検出が行なわれる。
このように、着目領域は、例えば画像中の左上から右下方向に向けて１画素ずつずらされながら更新設定される。また、画像を一定の割合で縮小し、縮小後の画像に対して、同様に顔領域の検出処理を行なう。
このような処理を繰返すことにより、画像中から任意の大きさの顔領域を検出することが可能となる。
本実施形態のデジタルカメラ１０では、制御部２０の制御の下に画像処理部１３が画像バッファ１４に保持される画像データを用いて上記顔領域の検出処理を実行する。 The region of interest in the image is shifted in the horizontal direction or the vertical direction by one pixel. Then, the image in the region of interest after the shift is compared with the template face image, the similarity between both images is determined again, and the same detection is performed.
In this way, the region of interest is updated and set while being shifted pixel by pixel from the upper left to the lower right in the image, for example. Further, the image is reduced at a certain rate, and the face area detection process is similarly performed on the reduced image.
By repeating such processing, it is possible to detect a face region of an arbitrary size from the image.
In the digital camera 10 of this embodiment, the image processing unit 13 performs the face area detection process using the image data held in the image buffer 14 under the control of the control unit 20.

図２は、シャッタキー操作前の画像も含めて連写を行ない、最も評価の高い静止画像を自動的に選択して記録する一押しショット機能選択時の、画像データの取扱いの処理内容を示す。 FIG. 2 shows the processing contents of image data handling when the one-press shot function is selected in which continuous shooting is performed including the image before the shutter key operation, and the still image with the highest evaluation is automatically selected and recorded. .

その当初に制御部２０は、スルー画像表示とも称されるモニタリング動作を開始し、ＣＭＯＳイメージセンサ１２で撮像した画像データを画像バッファ１４に一時保持させ、その保持した画像データに基づいて表示部１５での表示を行なわせることで、ＣＭＯＳイメージセンサ１２で撮像している内容をリアルタイムに表示部１５で表示させる（ステップＳ１０１）。 Initially, the control unit 20 starts a monitoring operation, also referred to as through image display, temporarily stores image data captured by the CMOS image sensor 12 in the image buffer 14, and displays the display unit 15 based on the stored image data. By performing the display at, the content captured by the CMOS image sensor 12 is displayed on the display unit 15 in real time (step S101).

このスルー画像表示状態を維持しながら、制御部２０は併せてキー入力部２３のシャッタキーが半押し操作されたか否かを判断する（ステップＳ１０２）。
ここで、シャッタキーが半押し操作されていなければ、上記ステップＳ１０１からの処理に戻り、スルー画像表示を続けながら、シャッタキーが半押し操作されるのを待機する。 While maintaining this through image display state, the control unit 20 also determines whether or not the shutter key of the key input unit 23 is half-pressed (step S102).
Here, if the shutter key is not half-pressed, the process returns to step S101 to wait for the shutter key to be half-pressed while continuing through image display.

そしてシャッタキーが半押し操作されると、制御部２０は上記ステップＳ１０２でそれを判断し、音声付きの静止画像を連続して一定時間分循環的に取得するパスト撮影を開始させる（ステップＳ１０３）。 When the shutter key is half-pressed, the control unit 20 determines that in step S102, and starts past photographing that continuously acquires still images with sound for a predetermined period of time (step S103). .

このパスト撮影においては、予め設定したフレームレートと時間、例えば３０［フレーム／秒］で２秒間分、計最大６０フレーム分の画像データとそれに付帯する音声データとを随時取得し、画像バッファ１４と音声バッファ１８とで循環的に保持させることで、常に最大２秒間分の最新の音声付き画像を保持し続ける。 In this past shooting, image data for a maximum of 60 frames in total for 2 seconds at a preset frame rate and time, for example, 30 [frames / second], and audio data attached thereto are acquired as needed. By keeping the sound buffer 18 in a circular manner, the latest image with sound for a maximum of 2 seconds is always kept.

パスト撮影時には、最新の画像データに対し、その画像データにより表現される画像から、上述した如く顔認識技術により顔領域の検出が実行される。
事前の設定によっては、表示部１５で表示する画像に重畳して、顔領域を示すフレームを表示することで、顔領域の検出結果をデジタルカメラ１０のユーザに報知させるものとしてもよい。 At the time of past photographing, detection of a face area is performed on the latest image data from the image expressed by the image data by the face recognition technique as described above.
Depending on the setting in advance, the detection result of the face area may be notified to the user of the digital camera 10 by displaying a frame indicating the face area superimposed on the image displayed on the display unit 15.

この音声付きのパスト撮影の実行と同時に、制御部２０はシャッタキーが全押し操作されたか否か（ステップＳ１０４）、シャッタキーの半押し操作が解除されたか否か（ステップＳ１０５）を連続して判断する。 Simultaneously with the execution of past photography with sound, the control unit 20 continuously determines whether or not the shutter key is fully pressed (step S104) and whether or not the shutter key is half-pressed (step S105). to decide.

こうしてステップＳ１０３〜Ｓ１０５の処理を繰返し実行することで、制御部２０はパスト撮影による循環記録を続行しながら、シャッタキーが全押し操作されるか、または半押し操作が解除されるのを待機する。 By repeatedly executing the processing of steps S103 to S105 in this way, the control unit 20 waits for the shutter key to be fully pressed or the half-pressed operation to be released while continuing the circular recording by past shooting. .

上記パスト撮影の状態でシャッタキーの半押し操作が解除された場合、制御部２０は上記ステップＳ１０５でそれを判断し、それまでのパスト撮影で画像バッファ１４に保持していた画像データ及び音声バッファ１８に保持していた音声データを一括して消去した上で（ステップＳ１０６）、上記ステップＳ１０３からの処理に戻る。 When the half-press operation of the shutter key is released in the past shooting state, the control unit 20 determines that in step S105, and the image data and audio buffer held in the image buffer 14 in the past shooting so far. After the audio data held in 18 is erased in a batch (step S106), the process returns to step S103.

また、上記ステップＳ１０４でシャッタキーが全押し操作されたと判断した場合に制御部２０は、それまでのパスト撮影に引き続き、本撮影として音声付きの静止画像の取得を開始する（ステップＳ１０７）。 If it is determined in step S104 that the shutter key has been fully pressed, the control unit 20 starts acquiring a still image with sound as the main shooting following the past shooting so far (step S107).

この本撮影において制御部２０は、予め設定したフレームレート、例えば３０［フレーム／秒］で画像データとそれに付帯する音声データとを随時取得し、画像バッファ１４と音声バッファ１８とに保持させる。 In this actual photographing, the control unit 20 acquires image data and accompanying audio data at a preset frame rate, for example, 30 [frames / second] as needed, and stores them in the image buffer 14 and the audio buffer 18.

この音声付きの本撮影の実行中に、制御部２０はシャッタキーの全押し操作が解除されるか、あるいは全押し操作を開始してから一定時間、例えば８秒が経過したか、のいずれかの状態となったかどうかを判断する（ステップＳ１０８）。 During the execution of the main photographing with sound, the control unit 20 either releases the shutter key full-press operation or whether a certain time, for example, 8 seconds has elapsed since the full-press operation was started. It is determined whether or not the state becomes (step S108).

ここで、そのいずれの状態でもないと判断した場合には、上記ステップＳ１０７の処理に戻り、音声付きの本撮影動作を実行しながら、シャッタキーの全押し操作が解除されるか、一定時間が経過するのを待機する。 If it is determined that neither of these states is present, the process returns to step S107, and the shutter key full-press operation is canceled while the main photographing operation with sound is executed, or for a certain period of time. Wait for it to elapse.

そして、シャッタキーの全押し操作が解除されるか、一定時間が経過したと判断すると、制御部２０は上記ステップＳ１０８でそれを判断し、その時点で本撮影動作を停止する（ステップＳ１０９）。 If it is determined that the shutter key full-press operation is released or a certain time has elapsed, the control unit 20 determines that in step S108, and stops the actual photographing operation at that time (step S109).

次いで制御部２０は、それまでのパスト撮影及び本撮影により音声バッファ１８に保持した一連の音声データとに基づき、画像バッファ１４に保持した一連の画像データ中から、被写体の種別を認識した上で優先順位の高い種別の被写体が写り込んでいる一押しショット画像のデータを選択する（ステップＳ１１０）。 Next, the control unit 20 recognizes the type of the subject from the series of image data held in the image buffer 14 based on the series of audio data held in the audio buffer 18 by past imaging and main imaging. Data of a one-shot shot image in which a high-priority type subject is reflected is selected (step S110).

図３は、この被写体の種別を勘案した一押しショット画像の選択に係るサブルーチンの処理内容を示す。
その当初には、制御部２０が画像処理部１３により、画像バッファ１４に保持されている本撮影の先頭に位置する１枚目に該当する画像データに対する顔認識処理を実行させ、画像データにより表現される画像に写り込んでいる被写体の顔の数、及び各顔毎の種別、具体的には成人男性、成人女性、及び子供の識別を行なわせる（ステップＳ３０１）。 FIG. 3 shows the processing contents of a subroutine related to selection of a one-shot shot image in consideration of the type of subject.
At the beginning, the control unit 20 causes the image processing unit 13 to execute face recognition processing for the image data corresponding to the first image located at the head of the main photographing held in the image buffer 14, and is expressed by the image data. The number of faces of the subject in the image to be displayed and the type of each face, specifically, adult male, adult female, and child are identified (step S301).

画像データにより表現される画像に写り込んでいる顔領域から、その人物の性別や年齢層を識別する技術に関しては、例えば特開平１１−１７５７２４号公報に記載されている技術を含んで公知である。 A technique for identifying the gender and age group of a person from a face area reflected in an image represented by image data is known, including the technique described in Japanese Patent Laid-Open No. 11-175724, for example. .

例えば特開平１１−１７５７２４号公報では、多数の学習用顔画像を性別及び年齢層にグループ化し、このグループ化した学習用顔画像を各グループと対応させて記憶した属性記憶部と、属性毎にグループ化された学習用顔画像ベクトルの集合をグループ固有の部分空間に展開し各グループ毎の固有ベクトルを生成する部分空間生成部と、この部分空間生成部が生成した各グループ毎の固有ベクトルを未知の入力顔画像のグループ判定を行なうためのデータとして記憶する判定用記憶部とを備え、判定用記憶部のデータを参照して未知の入力顔画像と各グループとの類似度を判定すると共に、判定したグループと属性記憶部に記憶した内容から入力顔画像の属性を識別するものとしている。 For example, in Japanese Patent Application Laid-Open No. 11-175724, a large number of learning face images are grouped into genders and age groups, the attribute learning unit stores the grouped learning face images in association with each group, and for each attribute. A set of grouped learning face image vectors is expanded into a group-specific subspace to generate an eigenvector for each group, and the eigenvector for each group generated by this subspace generator is unknown. A determination storage unit that stores data for performing group determination of the input face image, and determines the similarity between the unknown input face image and each group with reference to the data in the determination storage unit, and determination The attribute of the input face image is identified from the stored group and the content stored in the attribute storage unit.

上記ステップＳ３０１の処理結果に基づき、制御部２０はまず顔認識技術により画像処理部１３で人物の顔領域を少なくとも１つ検出できたか否かを判断する（ステップＳ３０２）。 Based on the processing result of step S301, the control unit 20 first determines whether or not at least one person's face area has been detected by the image processing unit 13 using the face recognition technique (step S302).

ここで人物の顔領域を１つも検出することができなかった場合には、被写体の種別に応じた画像データの選択を行なうことができないため、音声バッファ１８に保持している音声データを用いることなく、画像バッファ１４に記憶している一連の画像データのみで、画像中の被写体のブレが最も少ないものを画像処理部１３により選択させ（ステップＳ３０３）、以上でこの図３のサブルーチンを一旦終了する。 If no person's face area can be detected, image data corresponding to the type of subject cannot be selected, so the audio data held in the audio buffer 18 is used. Instead, only the series of image data stored in the image buffer 14 and the image processing unit 13 selects the image with the least subject blur in the image (step S303), and the subroutine of FIG. To do.

この画像データの選択に関しては、画像処理部１３が例えば画像全体に対するエッジ成分の抽出を行なうことで画像毎の評価値を算出し、最も評価値が高かった画像がすなわちブレが最も少ない画像であるものとして、対応する画像データを選択する。 Regarding the selection of the image data, the image processing unit 13 calculates an evaluation value for each image, for example, by extracting edge components from the entire image, and the image with the highest evaluation value is the image with the least blurring. As an object, corresponding image data is selected.

また、上記ステップＳ３０２で制御部２０が画像処理部１３により少なくとも１つの人物の顔部分の画像を検出することができたと判断した場合には、次いで検出できた被写体の顔の種別が複数であるか否かを判断する（ステップＳ３０４）。 If the control unit 20 determines in step S302 that the image processing unit 13 has detected an image of at least one person's face, there are a plurality of types of subjects that can be detected next. Whether or not (step S304).

ここで検出できた被写体の顔の種別が複数ではなく、１つであると判断した場合には、認識した、ただ１つの種別に合わせて音声周波数帯域をプログラムメモリ２２から読出して設定する（ステップＳ３０５）。 If it is determined that the number of types of the face of the subject detected here is not plural but one, the audio frequency band is read from the program memory 22 and set in accordance with the recognized type (step). S305).

図４（Ａ）は、一押しショット機能の選択時に、被写体からの音声をトリガとする設定を行なう場合の表示部１５での操作画面を例示する。ここでは、被写体側からの音声をトリガとする設定に関し、
「ＯＦＦ」（音声をトリガとしない）
「ＯＮ」（音声をトリガとする）
「顔認識時のみ（音声をトリガとする）」
の３つの選択肢の中から、「顔認識時のみ（音声をトリガとする）」を選択している状態を示す。 FIG. 4A illustrates an operation screen on the display unit 15 in the case where setting is performed using the sound from the subject as a trigger when the one-press shot function is selected. Here, regarding the setting that triggers the sound from the subject side,
"OFF" (sound is not triggered)
"ON" (sound as a trigger)
"During face recognition only (sound as a trigger)"
The state where “only at the time of face recognition (using voice as a trigger)” is selected from the three options is shown.

上記図４（Ａ）で示した設定を行なった場合、実際の撮影時に表示部１５で表示される画像の一部を図４（Ｂ）に例示する。同図（Ｂ）では、画像中の被写体Ｔ１に対する顔認識処理を画像処理部１３で実行し、画像中に検出した顔領域ＦＡを重ねて表示することで、被写体Ｔ１の顔領域が検出されていることがこのデジタルカメラ１０のユーザに報知されている。 When the setting shown in FIG. 4A is performed, a part of an image displayed on the display unit 15 at the time of actual photographing is illustrated in FIG. 4B. In FIG. 5B, the face recognition process for the subject T1 in the image is executed by the image processing unit 13, and the detected face area FA is displayed in the image so that the face area of the subject T1 is detected. That the user of the digital camera 10 is informed.

上記ステップＳ３０５で被写体の種別が例えば「子供」であると認識した場合、制御部２０は音声周波数帯域情報をプログラムメモリ２２から読出すことで、３００［Ｈｚ］〜８００［Ｈｚ］の音声周波数帯域を設定する。 When it is recognized in step S305 that the type of the subject is “child”, for example, the control unit 20 reads out the audio frequency band information from the program memory 22, thereby enabling the audio frequency band of 300 [Hz] to 800 [Hz]. Set.

また、上記ステップＳ３０４で検出できた被写体の顔の種別が複数あると判断した場合に制御部２０は、次いで検出できた複数の種別に対し、プログラムメモリ２２に記憶される各種別の優先順位に基づき、最も高い種別に合わせて音声周波数帯域情報をプログラムメモリ２２から読出して設定する（ステップＳ３０６）。 In addition, when it is determined that there are a plurality of types of face of the subject that can be detected in the above step S304, the control unit 20 sets various priority orders stored in the program memory 22 for the plurality of types that can be detected next. Based on the highest type, the audio frequency band information is read from the program memory 22 and set (step S306).

例えば、図５（Ａ）に示すように本画像１枚目の画像中から２つの被写体Ｔ２，Ｔ３を検出し、その顔認識の結果、被写体Ｔ２が種別「成人男性」、被写体Ｔ３が種別「成人女性」であったとする。 For example, as shown in FIG. 5A, two subjects T2 and T3 are detected from the first image of the main image, and as a result of the face recognition, the subject T2 is of the type “adult male” and the subject T3 is of the type “ Suppose you are an “adult woman”.

このデジタルカメラ１０では、プログラムメモリ２２に予め記憶している情報により、種別の優先順位が
「子供」＞「成人女性」＞「成人男性」
であり、種別「子供」が最も上位、種別「成人男性」が最も下位に設定されているものとする。 In this digital camera 10, the priority order of types is determined by information stored in advance in the program memory 22.
“Children”> “Adult Women”> “Adult Men”
It is assumed that the type “child” is set at the top and the type “adult male” is set at the bottom.

上記図５（Ａ）で示した、被写体Ｔ２，Ｔ３が写っている画像データでは、被写体Ｔ３側の種別「成人女性」がより優先順位が高いものとして制御部２０が選択する。その選択結果にしたがって上記ステップＳ３０６で制御部２０は、被写体の種別「成人女性」に合わせた音声周波数帯域情報をプログラムメモリ２２から読出し、２００［Ｈｚ］〜３５０［Ｈｚ］の音声周波数帯域を設定する。 In the image data showing the subjects T2 and T3 shown in FIG. 5A, the control unit 20 selects the type “adult female” on the subject T3 side as the higher priority. According to the selection result, in step S306, the control unit 20 reads out the audio frequency band information corresponding to the subject type “adult female” from the program memory 22, and sets the audio frequency band of 200 [Hz] to 350 [Hz]. To do.

上記ステップＳ３０５またはステップＳ３０６で音声周波数帯域を設定した後、設定された音声周波数帯域にしたがって、音声バッファ１８に記憶されているパスト撮影及び本撮影時の音声データを音声フィルタ処理し、処理した音声データにより再現される音声の音量レベルを検出する（ステップＳ３０８）。 After the audio frequency band is set in step S305 or step S306, the audio data subjected to the past shooting and the main shooting stored in the audio buffer 18 according to the set audio frequency band is subjected to the audio filter process, and the processed audio is processed. The sound volume level reproduced by the data is detected (step S308).

そして、その検出の結果、最も音量レベルが高い音声データを選出し、その音声データに対応する画像データを画像バッファ１４から選択する（ステップＳ３０８）。 As a result of the detection, audio data having the highest volume level is selected, and image data corresponding to the audio data is selected from the image buffer 14 (step S308).

上記図５（Ａ）が本画像撮影時の１枚目となる一連の画像データに対しては、女性の音声データの音量レベルが最も高かった画像データとして図５（Ｂ）に示すような画像データを選択するものとする。
上記画像データの選択をもって上記図３のサブルーチンを終了し、図２のメインルーチンに復帰する。 For the series of image data shown in FIG. 5A as the first image at the time of photographing the main image, the image data as shown in FIG. Data shall be selected.
When the image data is selected, the subroutine shown in FIG. 3 is terminated, and the process returns to the main routine shown in FIG.

図２のメインルーチンでは、選択した画像データを画像バッファ１４から読出して圧縮伸長処理部２７で所定のデータファイル形式でデータ圧縮して画像データファイルを作成した後、メモリカードコントローラ２８を介して記録媒体であるメモリカード３２に記録させる（ステップＳ１１１）。
以上で主として制御部２０による音声データに基づく一押しショット機能での撮影を完了し、次の撮影に備えて上記ステップＳ１０１からの処理に戻る。 In the main routine of FIG. 2, the selected image data is read from the image buffer 14 and compressed by the compression / decompression processing unit 27 in a predetermined data file format to create an image data file, which is then recorded via the memory card controller 28. Recording is performed on the memory card 32 as a medium (step S111).
As described above, the photographing by the one-press shot function based mainly on the audio data by the control unit 20 is completed, and the process returns to the processing from step S101 in preparation for the next photographing.

以上本実施形態によれば、撮影者であるユーザの意図を反映し、連写画像中からユーザの所望する被写体のシーンを音量レベルの変位により的確に判断して記録することができる。 As described above, according to the present embodiment, it is possible to reflect the intention of the user who is the photographer, and to accurately determine and record the scene of the subject desired by the user from the continuous shot images based on the displacement of the volume level.

特に本実施形態によれば、被写体の複数の種別から１つを選択し、選択した種別の被写体に対応した音声周波数帯域に基づいて音量レベルを検出するものとしたので、選択していない種別の被写体に関連した音声に惑わされることなく、より的確なシーンの判断を行なうことができる。 In particular, according to the present embodiment, one of a plurality of types of subjects is selected, and the volume level is detected based on the audio frequency band corresponding to the selected type of subjects. A more accurate scene can be determined without being confused by the sound associated with the subject.

その場合、本実施形態では、対象とする被写体の種別を予め設定されている優先順位にしたがって自動的に選択するものとしたので、ユーザが煩雑な操作等を行なう必要なしに、最も優先順位が高いと思われる種別の被写体を選択した上で一連の画像データ中から１つの画像データを選択する工程に移行するため、ユーザ側の手間を簡略化でき、手軽に撮影を実行できる。 In this case, in the present embodiment, since the type of the subject to be processed is automatically selected according to a preset priority order, the highest priority order can be obtained without requiring the user to perform complicated operations. Since the process moves to the process of selecting one image data from a series of image data after selecting a subject of a type that seems to be high, the user's labor can be simplified and photographing can be performed easily.

なお、上記実施形態では説明しなかったが、あえて被写体の種別の選択をユーザがキー入力部２３の操作により手動で行ない、これを受付けた制御部２０が選択情報をプログラムメモリ２２に記憶させることができるようにしてもよい。
その場合、制御部２０はメインメモリ２１に記憶したユーザの選択情報にしたがって被写体の種別の優先順位を判断するようになるので、ユーザの撮影意図を確実に反映して適切な画像データを選択できるようになる。 Although not described in the above embodiment, the user manually selects the subject type by operating the key input unit 23, and the control unit 20 that receives the selection manually stores the selection information in the program memory 22. You may be able to.
In this case, the control unit 20 determines the priority order of the subject type according to the user selection information stored in the main memory 21, so that it is possible to select appropriate image data that reliably reflects the user's intention to shoot. It becomes like this.

また、上記実施形態では、画像データに対応する音声データのうち、音声の音量レベルが最も高いものを選択して、対応する画像データを選択するものとした。これにより、画像データの選択基準をより簡略化し、装置側の負担を軽減して迅速に画像を選択できる。 Moreover, in the said embodiment, the audio | voice data with the highest audio | voice volume level among audio | voice data corresponding to image data shall be selected, and the corresponding image data shall be selected. This further simplifies the selection criteria for image data, reduces the burden on the apparatus side, and allows images to be selected quickly.

さらに、上記実施形態では示さなかったが、画像に対応する音声データに関し、ユーザが撮影の事前にキー入力部２３の操作により任意に音量レベルを設定し、制御部２０がこの操作を受付けて設定された音量レベルをプログラムメモリ２２に記憶させておき、実際の撮影時には制御部２０がプログラムメモリ２２に記憶した音量レベル以上に達していれば対応する画像データを記録媒体であるメモリカード３２に記録するものとしてもよい。 Further, although not shown in the above embodiment, regarding the audio data corresponding to the image, the user arbitrarily sets the volume level by operating the key input unit 23 in advance of shooting, and the control unit 20 receives and sets this operation. The recorded volume level is stored in the program memory 22, and if the control unit 20 has reached or exceeded the volume level stored in the program memory 22 during actual shooting, the corresponding image data is recorded in the memory card 32 as a recording medium. It is good also as what to do.

このようにユーザが基準となる音量レベルを任意に設定しておくことを可能とすることにより、特に記録媒体であるメモリカード３２の空き容量に余裕がある場合など、一連の画像データ中の一枚分のみでなく、ユーザの設定に応じた音量レベルに応じて、より多くの画像データを自動的に記録して残すことができる。 As described above, by allowing the user to arbitrarily set the reference volume level, it is possible to select one of a series of image data, particularly when there is a free space in the memory card 32 as a recording medium. More image data can be automatically recorded and left depending not only on the number of images but also on the volume level according to the user's setting.

なお、上記実施形態では、被写体の種別として成人男性、成人女性及び子供を顔認識技術により認識した上で、予め設定された優先順位にしたがって種別を選択するものとした。これら種別の優先順位に関しては、ユーザが予め撮影意図に応じて任意に設定してプログラムメモリ２２に記憶させておくことにより、撮影時に主要な被写体をユーザが選択する手間を簡略化し、よりスムーズ且つ的確に主要被写体を判断して画像データを記録させることができる。 In the above embodiment, adult males, adult females, and children are recognized as face types by the face recognition technology, and the type is selected according to a preset priority order. With regard to the priority order of these types, the user can arbitrarily set in advance according to the shooting intention and store it in the program memory 22, thereby simplifying the user's trouble of selecting the main subject during shooting, Image data can be recorded by accurately determining the main subject.

（第２の実施形態）
以下本発明をデジタルカメラに適用した場合の第２の実施形態について図面を参照して説明する。 (Second Embodiment)
A second embodiment when the present invention is applied to a digital camera will be described below with reference to the drawings.

なお、本実施形態に係るデジタルカメラ１０′の回路構成に関しては、上記図１に示した内容と基本的に同様であるものとして、同一部分には同一符号を用いることにしてその図示と説明とを省略する。 Note that the circuit configuration of the digital camera 10 'according to the present embodiment is basically the same as the contents shown in FIG. 1, and the same reference numerals are used for the same parts to illustrate and explain the same. Is omitted.

ここで「一押しショット」機能とは、一連の連写画像中から自動的に最も優れているシーンをデジタルカメラ１０側で選択して記録する機能である。 Here, the “one-shot shot” function is a function that automatically selects and records the best scene from a series of continuous shot images on the digital camera 10 side.

プログラムメモリ２２に記憶されている動作プログラム等は、このデジタルカメラ１０′の製造工場出荷時にプログラムメモリ２２に記憶されていたものに加え、例えばこのデジタルカメラ１０′のバージョンアップに際して、デジタルカメラ１０をユーザがパーソナルコンピュータと接続することにより、外部からＵＳＢコネクタ３４、ＵＳＢインタフェース２９を経由して新たな動作プログラム、データ等をダウンロードして記憶していたものも含む。 The operation program stored in the program memory 22 is stored in the program memory 22 at the time of shipment of the digital camera 10 ′, for example, when the digital camera 10 ′ is upgraded. Also included are those in which new operation programs, data, and the like are downloaded and stored from the outside via the USB connector 34 and the USB interface 29 when the user connects to the personal computer.

ここで、上記顔認識技術を用いて画像中から顔領域を検出する技術に関しては、上記第１の実施形態と同様であるので、その説明を省略する。 Here, a technique for detecting a face area from an image using the face recognition technique is the same as that in the first embodiment, and a description thereof will be omitted.

本実施形態のデジタルカメラ１０′では、制御部２０の制御の下に画像処理部１３が画像バッファ１４に保持される画像データを用いて顔領域の検出処理を実行する。 In the digital camera 10 ′ of the present embodiment, the image processing unit 13 executes face area detection processing using image data held in the image buffer 14 under the control of the control unit 20.

図６は、シャッタキー操作前の画像も含めて連写を行ない、最も評価の高い静止画像を自動的に選択して記録する一押しショット機能選択時の、画像データの取扱いの処理内容を示す。 FIG. 6 shows the processing contents of image data handling when the one-shot shot function is selected in which continuous shooting is performed including the image before the shutter key operation, and the still image having the highest evaluation is automatically selected and recorded. .

その当初には、制御部２０がスルー画像表示とも称されるモニタリング動作を開始し、ＣＭＯＳイメージセンサ１２で撮像した画像データを画像バッファ１４に一時保持させ、その保持した画像データに基づいて表示部１５での表示を行なわせることで、ＣＭＯＳイメージセンサ１２で撮像している内容をリアルタイムに表示部１５で表示させる（ステップＳ５０１）。 Initially, the control unit 20 starts a monitoring operation, also referred to as through image display, temporarily stores image data captured by the CMOS image sensor 12 in the image buffer 14, and displays the display unit based on the stored image data. By performing the display at 15, the content captured by the CMOS image sensor 12 is displayed on the display unit 15 in real time (step S501).

このスルー画像表示状態を続行しながら制御部２０は、併せてキー入力部２３のシャッタキーが半押し操作されたか否かを判断する（ステップＳ５０２）。
ここで、シャッタキーが半押し操作されていなければ、制御部２０は上記ステップＳ５０１からの処理に戻り、スルー画像表示を続けながら、シャッタキーが半押し操作されるのを待機する。 While continuing this through image display state, the control unit 20 also determines whether or not the shutter key of the key input unit 23 has been half-pressed (step S502).
Here, if the shutter key is not half-pressed, the control unit 20 returns to the processing from step S501 and waits for the shutter key to be half-pressed while continuing through image display.

そしてシャッタキーが半押し操作されると、上記ステップＳ５０２で制御部２０がそれを判断し、音声付きの静止画像を連続して一定時間分循環的に取得するパスト撮影を開始する（ステップＳ５０３）。 When the shutter key is half-pressed, the control unit 20 determines that in step S502, and starts past imaging in which a still image with sound is continuously acquired cyclically for a predetermined time (step S503). .

パスト撮影時には、最新の画像データに対し、その画像データにより表現される画像から、上記第１の実施形態で詳述した如く顔認識技術により顔領域の検出が実行される。
事前の設定によっては、表示部１５で表示する画像に重畳して、顔領域を示すフレームを表示することで、顔領域の検出結果をデジタルカメラ１０のユーザに報知させるものとしてもよい。 At the time of past photographing, detection of a face area is performed on the latest image data from the image expressed by the image data by the face recognition technique as described in detail in the first embodiment.
Depending on the setting in advance, the detection result of the face area may be notified to the user of the digital camera 10 by displaying a frame indicating the face area superimposed on the image displayed on the display unit 15.

この音声付きのパスト撮影の実行と同時に、制御部２０はシャッタキーが全押し操作されたか否か（ステップＳ５０４）、シャッタキーの半押し操作が解除されたか否か（ステップＳ５０５）を連続して判断する。 Simultaneously with the execution of past photography with sound, the control unit 20 continuously determines whether or not the shutter key is fully pressed (step S504) and whether or not the shutter key is half-pressed (step S505). to decide.

こうしてステップＳ５０３〜Ｓ５０５の処理を繰返し実行することで、制御部２０はパスト撮影による循環記録を続行しながら、シャッタキーが全押し操作されるか、または半押し操作が解除されるのを待機する。 By repeatedly executing the processing of steps S503 to S505 in this way, the control unit 20 waits for the shutter key to be fully pressed or the half-pressed operation to be released while continuing the circular recording by past shooting. .

上記パスト撮影の状態でシャッタキーの半押し操作が解除された場合、制御部２０は上記ステップＳ５０５でそれを判断し、それまでのバスト撮影で画像バッファ１４に保持していた画像データ及び音声バッファ１８に保持していた音声データを一括して消去した上で（ステップＳ５０６）、上記ステップＳ５０３からの処理に戻る。 When the half-press operation of the shutter key is released in the past shooting state, the control unit 20 determines that in step S505, and the image data and audio buffer held in the image buffer 14 in the previous bust shooting. After the audio data held in 18 is erased at once (step S506), the process returns to step S503.

また、上記ステップＳ５０４でシャッタキーが全押し操作されたと判断した場合に制御部２０は、それまでのパスト撮影に引き続き、本撮影として音声付きの静止画像の取得を開始する（ステップＳ５０７）。 If it is determined in step S504 that the shutter key has been fully pressed, the control unit 20 starts acquiring a still image with sound as the main shooting following the past shooting (step S507).

この音声付きの本撮影の実行中に、制御部２０はシャッタキーの全押し操作が解除されるか、あるいは全押し操作を開始してから一定時間、例えば８秒が経過したか、のいずれかの状態となったかどうかを判断する（ステップＳ５０８）。 During the execution of the main photographing with sound, the control unit 20 either releases the shutter key full-press operation or whether a certain time, for example, 8 seconds has elapsed since the full-press operation was started. It is determined whether or not the state becomes (step S508).

ここで、そのいずれの状態でもないと判断した場合には、上記ステップＳ５０７の処理に戻り、音声付きの本撮影動作を実行しながら、シャッタキーの全押し操作が解除されるか、一定時間が経過するのを待機する。 If it is determined that neither state is found, the process returns to step S507, and the shutter key full-press operation is canceled or a certain period of time is executed while performing the main photographing operation with sound. Wait for it to elapse.

そして、シャッタキーの全押し操作が解除されるか、一定時間が経過したと判断すると、制御部２０は上記ステップＳ５０８でそれを判断し、その時点で本撮影動作を停止する（ステップＳ５０９）。 If it is determined that the shutter key full-press operation is released or a certain time has elapsed, the control unit 20 determines that in step S508 and stops the actual photographing operation at that time (step S509).

次いで制御部２０は、それまでのパスト撮影及び本撮影により音声バッファ１８に保持した一連の音声データとに基づき、画像バッファ１４に保持した一連の画像データ中から、被写体の表情を認識した上で一押しショット画像を選択する（ステップＳ５１０）。 Next, the control unit 20 recognizes the facial expression of the subject from the series of image data held in the image buffer 14 based on the series of audio data held in the audio buffer 18 by past imaging and main imaging. A one-shot shot image is selected (step S510).

図７は、この被写体の表情を勘案した一押しショット画像の選択に係るサブルーチンの処理内容を示す。
その当初には、制御部２０が画像処理部１３により、画像バッファ１４に保持されている本撮影の先頭に位置する１枚目に該当する画像データに対する顔認識処理を実行させ、画像データにより表現される画像に写り込んでいる被写体の顔の識別を行なわせる（ステップＳ７０１）。 FIG. 7 shows the processing contents of a subroutine related to selection of a one-shot shot image taking into account the expression of the subject.
At the beginning, the control unit 20 causes the image processing unit 13 to execute face recognition processing for the image data corresponding to the first image located at the head of the main photographing held in the image buffer 14, and is expressed by the image data. The face of the subject that is reflected in the image to be identified is identified (step S701).

上記ステップＳ７０１の処理結果に基づき、まず顔認識技術により人物の顔部分を少なくとも１つ検出できたか否かを判断する（ステップＳ７０２）。
なお、顔認識技術については上記第１の実施形態でも説明した如く周知の技術であるため、その具体的な内容についての説明は省略する。 Based on the processing result of step S701, it is first determined whether or not at least one face portion of a person has been detected by the face recognition technique (step S702).
Since the face recognition technique is a well-known technique as described in the first embodiment, the detailed description thereof is omitted.

ここで人物の顔を１つも検出することができなかった場合には、被写体の表情に応じた画像の選択を行なうことができないため、音声バッファ１８に保持している音声データを用いることなく、画像バッファ１４に記憶している一連の画像データのみで、画像中の被写体のブレが最も少ないものを画像処理部１３により選択させ（ステップＳ７０３）、以上でこの図７のサブルーチンを一旦終了する。 Here, when no person's face can be detected, it is not possible to select an image according to the facial expression of the subject. Therefore, without using the audio data held in the audio buffer 18, Only the series of image data stored in the image buffer 14 is selected by the image processing unit 13 with the smallest subject blur in the image (step S703), and the subroutine of FIG.

また、上記ステップＳ７０２で制御部２０が画像処理部１３により少なくとも１つの人物の顔部分の画像を検出することができたと判断した場合には、次いで検出できた被写体の顔が複数であるか否かを判断する（ステップＳ７０４）。 If it is determined in step S702 that the control unit 20 has detected an image of at least one person's face by the image processing unit 13, then whether or not there are a plurality of detected subject faces is determined. Is determined (step S704).

ここで、検出できた被写体の顔が複数あると判断した場合にのみ、次いでそれら複数の顔の中から、最も顔領域が大きいものを選択する（ステップＳ７０５）。 Here, only when it is determined that there are a plurality of detected faces of the subject, the face having the largest face area is selected from the plurality of faces (step S705).

その後、この選択した顔領域に関し、一連の画像データにより表現される各画像に写り込んでいる同一顔領域毎に笑顔度を画像処理部１３で算出し、予め設定される笑顔度以上である画像データをサーチする（ステップＳ７０６）。 Thereafter, with respect to the selected face area, the smile degree is calculated by the image processing unit 13 for each same face area reflected in each image represented by a series of image data, and the image is equal to or higher than a preset smile degree. Data is searched (step S706).

この笑顔度を算出する方法としては各種手法が公知となっており、本実施形態に係るデジタルカメラはいずれの手法も採用可能である。
例えば、笑顔度を算出する一手法として、特開２００４−０４６５９１号公報に記載されている手法を採用できる。この手法を採用した場合、制御部２０の制御の下に画像処理部１３は、プログラムメモリ２２に予め記憶されている笑顔評価基準テーブルを参照して顔画像を構成する眉、瞳孔、唇の各要素の形状の評価ポイントを算出し、各評価ポイントを所定の係数により重み付けした結果を合算して、画像に写る顔画像の笑顔度とする。 Various methods are known as a method for calculating the smile level, and any method can be adopted for the digital camera according to the present embodiment.
For example, a technique described in Japanese Patent Application Laid-Open No. 2004-046591 can be adopted as one technique for calculating the smile level. When this method is employed, the image processing unit 13 under the control of the control unit 20 refers to each of the eyebrows, pupils, and lips that constitute the face image with reference to a smile evaluation criterion table stored in advance in the program memory 22. The evaluation points of the element shape are calculated, and the results obtained by weighting each evaluation point by a predetermined coefficient are added together to obtain the smile level of the face image shown in the image.

次いで、上記サーチの結果得られた、予め設定される笑顔度以上である各画像データに基づき、画像バッファ１４に保持している対応する各音声データにより再現される音声の音量レベルを検出する（ステップＳ７０７）。 Next, based on each image data obtained as a result of the search and having a smile level equal to or higher than a preset smile level, a sound volume level reproduced by each corresponding sound data held in the image buffer 14 is detected ( Step S707).

そして、その検出の結果、最も音量レベルが高い音声データを選出し、その音声データに対応する画像データを画像バッファ１４から選択する（ステップＳ７０８）。 As a result of the detection, audio data having the highest volume level is selected, and image data corresponding to the audio data is selected from the image buffer 14 (step S708).

図８（Ａ）〜図８（Ｄ）は、画像バッファ１４に保持される一連の画像データ中から、上記ステップＳ７０６の処理により制御部２０の制御の下に画像処理部１３が予め設定した笑顔度以上の画像データをサーチして抽出した結果を例示する。 8A to 8D show smiles preset by the image processing unit 13 from the series of image data held in the image buffer 14 under the control of the control unit 20 by the processing in step S706. An example of the result of searching and extracting image data of a degree or more is illustrated.

こうしてサーチした画像データに対し、さらに最も音声データの音量レベルが高いものを検出した結果、上記図８（Ｃ）に示した画像データを図８（Ｅ）で示すように検出結果として制御部２０が得たものとする。 As a result of detecting the image data searched in this way with the highest volume level of the audio data, the control unit 20 detects the image data shown in FIG. 8C as a detection result as shown in FIG. 8E. Is obtained.

上記画像データの選択をもって上記図７のサブルーチンを終了し、図６のメインルーチンに復帰する。
図６のメインルーチンでは、制御部２０が選択した画像データを画像バッファ１４から読出して圧縮伸長処理部２７で所定のデータファイル形式でデータ圧縮して画像データファイルを作成した後、メモリカードコントローラ２８を介して記録媒体であるメモリカード３２に記録させる（ステップＳ５１１）。
以上で主として制御部２０による音声データに基づく一押しショット機能での撮影を完了し、次の撮影に備えて上記ステップＳ５０１からの処理に戻る。 When the image data is selected, the subroutine shown in FIG. 7 is terminated and the process returns to the main routine shown in FIG.
In the main routine of FIG. 6, the image data selected by the control unit 20 is read from the image buffer 14 and compressed in a predetermined data file format by the compression / decompression processing unit 27 to create an image data file. Through the memory card 32 (step S511).
As described above, photographing by the one-press shot function based mainly on the audio data by the control unit 20 is completed, and the processing returns to the processing from step S501 in preparation for the next photographing.

以上本実施形態によれば、撮影者であるユーザの意図を反映し、連写画像中からユーザの所望する被写体の表情の変化に富んだシーンを音量レベルの検出と併せてより的確に判断して記録することができる。 As described above, according to the present embodiment, a scene that reflects the intention of the user who is a photographer and is rich in changes in the facial expression of the subject desired by the user is determined more accurately together with the detection of the volume level. Can be recorded.

なお、上記実施形態では、顔認識技術を用い、予め設定した表情として被写体の笑顔度を検出するものとして説明したが、本発明はこれに限らず、例えば泣き顔、怒り顔の度合いを数値化して設定するものとしてもよい。 In the above-described embodiment, the face recognition technology is used and the smile level of the subject is detected as a preset expression. However, the present invention is not limited to this. For example, the degree of crying face and angry face is quantified. It may be set.

このようにユーザが予め被写体の取り得る表情とその度合を選択すると、制御部２０がこれを受付け、選択された内容をプログラムメモリ２２に記憶しておくことにより、実際の撮影時に制御部２０はプログラムメモリ２２に記憶している設定情報を読出し、よりユーザの撮影意図を的確に判断して一連の画像データ中から最も適した画像を選択し、記録させることができる。 As described above, when the user selects a facial expression that can be taken by the user in advance and the degree thereof, the control unit 20 accepts this and stores the selected contents in the program memory 22, so that the control unit 20 at the time of actual photographing can be used. The setting information stored in the program memory 22 can be read, the user's intention of photographing can be judged more accurately, and the most suitable image can be selected from a series of image data and recorded.

なお、上記第１及び第２の実施形態はいずれも画像の撮影を行なうデジタルカメラに適用した場合について説明したものであるが、本発明はこれに限らず、静止画像も撮影可能なデジタルビデオムービーカメラや、カメラ機能付きの携帯電話端末その他の各種撮像機器、あるいは撮影機能を持たずに、一連の音声データ付き画像データを記録するデジタルフォトデータストレージ装置等にも同様に適用可能である。 The first and second embodiments have been described with reference to a case where the present invention is applied to a digital camera that captures an image. However, the present invention is not limited to this, and a digital video movie that can also capture still images. The present invention can be similarly applied to a camera, a mobile phone terminal with a camera function, and other various imaging devices, or a digital photo data storage device that records a series of image data with audio data without having a shooting function.

その他、本発明は上述した実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、上述した実施形態で実行される機能は可能な限り適宜組み合わせて実施しても良い。上述した実施形態には種々の段階が含まれており、開示される複数の構成要件による適宜の組み合せにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、効果が得られるのであれば、この構成要件が削除された構成が発明として抽出され得る。 In addition, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the invention in the implementation stage. Further, the functions executed in the above-described embodiments may be combined as appropriate as possible. The above-described embodiment includes various stages, and various inventions can be extracted by an appropriate combination of a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, if the effect is obtained, a configuration from which the constituent requirements are deleted can be extracted as an invention.

１０，１０′…デジタルカメラ、１１…光学レンズユニット、１２…ＣＭＯＳイメージセンサ、１３…画像処理部、１４…画像バッファ、１５…表示部、１６…マイクロホン、１７…音声処理部、１８…音声バッファ、１９…スピーカ、２０…制御部、２１…メインメモリ、２２…プログラムメモリ、２３…キー入力部、２４…レンズ駆動部、２５…フラッシュ駆動部、２６…ＣＭＯＳ駆動部、２７…圧縮伸長処理部、２８…メモリカードコントローラ、２９…ＵＳＢインタフェース、３０…レンズ用ステッピングモータ（Ｍ）、３１…フラッシュ部、３２…メモリカード、３３…カードコネクタ、３４…ＵＳＢコネクタ、ＳＢ…システムバス。 DESCRIPTION OF SYMBOLS 10,10 '... Digital camera, 11 ... Optical lens unit, 12 ... CMOS image sensor, 13 ... Image processing part, 14 ... Image buffer, 15 ... Display part, 16 ... Microphone, 17 ... Sound processing part, 18 ... Sound buffer , 19 ... Speaker, 20 ... Control unit, 21 ... Main memory, 22 ... Program memory, 23 ... Key input unit, 24 ... Lens drive unit, 25 ... Flash drive unit, 26 ... CMOS drive unit, 27 ... Compression / decompression processing unit 28 ... Memory card controller, 29 ... USB interface, 30 ... Lens stepping motor (M), 31 ... Flash unit, 32 ... Memory card, 33 ... Card connector, 34 ... USB connector, SB ... System bus.

Claims

First storage means for storing audio frequency band information indicating audio frequency bands respectively corresponding to a plurality of types of subjects;
Acquisition means for acquiring a plurality of image data accompanied by audio data;
Recognizing means for recognizing the type of subject appearing in each of the images represented by the plurality of image data acquired by the acquiring means;
Setting means for reading audio frequency band information corresponding to the type of subject recognized by the recognition means from the first storage means and setting the audio frequency band indicated by the read audio frequency band information;
Of the sound reproduced from the audio data attached to the plurality of image data acquired by the acquisition means, the volume determined to be equal to or higher than a prescribed volume of the sound belonging to the audio frequency band set by the setting means An image processing apparatus comprising: extraction means for extracting image data accompanied with sound data to be reproduced.

A selecting means for selecting one of the recognized object types when the recognizing means recognizes the types of the plurality of objects;
The setting means reads the audio frequency band information corresponding to one type selected by the selection means from the plurality of object types recognized by the recognition means from the first storage means, and reads the read audio frequency band The image processing apparatus according to claim 1, wherein an audio frequency band indicated by the information is set.

A second storage means for storing priorities respectively corresponding to the types of the plurality of subjects;
When the recognition unit recognizes a plurality of different subject types, the selection unit corresponds to the highest priority order among the recognized different subject types in the priority order stored in the second storage unit. The image processing apparatus according to claim 2, wherein one type is selected.

An instruction means for instructing any one of a plurality of different subject types;
The selecting means, when the recognition means recognizes a plurality of different subject types, selects any one of the recognized different subject types designated by the instruction means. The image processing apparatus according to claim 2.

The image processing apparatus according to claim 1, further comprising a determination unit that arbitrarily determines a prescribed volume that is a determination criterion of the extraction unit.

2. The image processing apparatus according to claim 1, wherein the extraction unit extracts image data accompanied by audio data having a maximum volume of audio belonging to the audio frequency band set by the setting unit.

The image processing apparatus according to claim 1, wherein the plurality of types of subjects include at least one of a male, a female, an adult, and a child.

Acquisition means for acquiring a plurality of image data accompanied by audio data;
Detecting means for detecting facial expressions of the subject in the images respectively represented by the plurality of image data acquired by the acquiring means;
First extraction means for extracting image data in which the facial expression of the subject detected by the detection means is a preset facial expression;
Determining means for determining whether or not the volume of the sound reproduced from the audio data attached to the image data in which the subject extracted by the first extracting means is greater than or equal to a predetermined volume;
Second image extracting means for extracting image data accompanied by audio data capable of reproducing sound of a sound volume higher than a specified volume by the determining means from image data showing the subject extracted by the first extracting means; An image processing apparatus comprising:

9. The image processing apparatus according to claim 8, further comprising selection means for arbitrarily selecting a preset facial expression from a plurality of candidates as a reference for extraction by the first extraction means.

A program executed by a computer incorporated in an image processing apparatus, the computer being
First storage means for storing audio frequency band information indicating audio frequency bands respectively corresponding to a plurality of types of subjects;
Acquisition means for acquiring a plurality of image data accompanied by audio data;
Recognizing means for recognizing the type of subject in each of the images represented by the plurality of image data acquired by the acquiring means;
The setting means for reading the audio frequency band information corresponding to the type of the subject recognized by the recognition means from the content stored in the first storage means, and setting the audio frequency band indicated by the read audio frequency band information, and the acquisition Of the sound reproduced from the audio data attached to the plurality of image data acquired by the means, the sound volume determined to be equal to or higher than the prescribed sound volume belonging to the sound frequency band set by the setting means is reproduced. A computer-executable program that functions as an extraction unit that extracts image data accompanied by audio data.

A program executed by a computer incorporated in an image processing apparatus, the computer being
Acquisition means for acquiring a plurality of image data accompanied by audio data;
Recognizing means for recognizing the facial expression of the subject in the images respectively represented by the plurality of image data acquired by the acquiring means;
First extraction means for extracting image data in which the facial expression of the subject recognized by the recognition means is a preset facial expression;
Determining means for determining whether or not the volume of the sound reproduced from the audio data attached to the image data including the subject extracted by the first extracting means is equal to or higher than a prescribed volume; and the first extraction And functioning as second extraction means for extracting image data attached with sound data capable of reproducing sound of a sound level equal to or higher than a prescribed volume by the determination means from image data showing the subject extracted by the means. A program executable by the computer.