JP2012151544A

JP2012151544A - Imaging apparatus and program

Info

Publication number: JP2012151544A
Application number: JP2011006759A
Authority: JP
Inventors: Yoshiki Ishige; 善樹石毛
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2011-01-17
Filing date: 2011-01-17
Publication date: 2012-08-09

Abstract

PROBLEM TO BE SOLVED: To obtain an acoustic effect with a rich atmosphere by increasing an acoustic sense of presence even when only a monaural microphone can be mounted.SOLUTION: A face detector 7 detects the face of a subject from an image imaged by an imaging unit 6. After identifying the position of the subject face in the imaged image as a sound source position, a controller 1 obtains pseudo stereo from a monaural sound, collected by a sound collector 9 through a monaural microphone 8, according to the identified sound source position. To obtain the pseudo stereo, a sound volume difference between left and right channels is adjusted to be fit to the ratio of the sound source position from the left and right ends of a screen.

Description

本発明は、動画像を撮影する撮像手段を備えた撮像装置及びプログラムに関する。 The present invention relates to an image pickup apparatus and a program provided with an image pickup unit that takes a moving image.

従来、コストなどの観点から簡易な動画録画及び音声録音が可能なデジタルカメラ（コンパクトカメラ）においては、モノラルマイクが採用されていることが多いが、モノラルマイクでの録音では臨場感が得られ難いといった問題があった。
そこで、従来では、雰囲気豊かな音響効果を持たせるために、例えば、撮影画像を解析する顔検出部によって、所定時間にわたって撮影画像内に被写体（人物）が存在していると判定した場合には収音範囲を狭くするようにし、所定時間にわたって画像内に人物が存在していないと判定した場合には収音範囲を広くするようにした技術（カメラ、再生装置、および再生方法）が開示されている（特許文献１参照）。また、撮影画像を解析することにより人物の顔の輪郭を抽出し、その顔の大きさが小さくなるほど音声の増幅率を大きくするようにした技術（テレビ電話機）が開示されている（特許文献２参照）。 Conventionally, monaural microphones are often used in digital cameras (compact cameras) capable of simple video recording and audio recording from the viewpoint of cost, etc., but it is difficult to obtain a sense of reality when recording with a monaural microphone. There was a problem.
Therefore, conventionally, in order to provide a rich sound effect, for example, when a face detection unit that analyzes a captured image determines that a subject (person) exists in the captured image for a predetermined period of time. A technique (camera, playback device, and playback method) is disclosed that narrows the sound collection range and widens the sound collection range when it is determined that no person is present in the image for a predetermined time. (See Patent Document 1). In addition, a technique (video phone) is disclosed in which a contour of a person's face is extracted by analyzing a photographed image, and a voice amplification factor is increased as the face size is reduced (Patent Document 2). reference).

特開２０１０−９３６０３号公報JP 2010-93603 A 特開平０５−６４１８１号公報Japanese Patent Laid-Open No. 05-64181

しかしながら、上述した特許文献１の技術にあっては、人物の有無に応じてマイクの指向性を制御することにより収音範囲を調整するようにしたものであるが、収音範囲を調整するだけでは音声の臨場感を高めることはできない。同様に、特許文献２の技術にあっては、人物との距離に応じて音声の増幅率を調整するようにしたものであるが、音声の増幅率を調整するだけでは音声の臨場感を高めることはできない。 However, in the technique of Patent Document 1 described above, the sound collection range is adjusted by controlling the directivity of the microphone according to the presence or absence of a person, but only the sound collection range is adjusted. However, it is not possible to increase the sense of presence of the voice. Similarly, in the technique of Patent Document 2, the sound amplification factor is adjusted according to the distance to the person. However, just adjusting the sound amplification factor increases the sense of presence of the sound. It is not possible.

本発明の課題は、モノラルマイクしか搭載することができない場合であっても、音声の臨場感を高めて雰囲気豊かな音響効果を実現できるようにすることである。 SUMMARY OF THE INVENTION An object of the present invention is to enhance the realistic sensation of sound and realize a rich sound effect even when only a monaural microphone can be mounted.

上述した課題を解決するために本発明の一つの態様は、
動画像を撮影する撮像手段を備えた撮像装置であって、
モノラル音声を集音する集音手段と、
前記撮像手段により撮影された画像内から被写体の顔を検出する検出手段と、
前記集音手段によりモノラル音声が集音されている状態において、前記検出手段により検出された被写体の顔の前記撮影画像内での位置を音源位置として特定する位置特定手段と、
前記集音手段により集音されたモノラル音声を、前記位置特定手段により特定された前記撮影画像内の音源位置に応じて、疑似ステレオ化する音声制御手段と、
を備えることを特徴とする。 In order to solve the above-described problem, one aspect of the present invention is as follows.
An image pickup apparatus including an image pickup unit for shooting a moving image,
Sound collecting means for collecting monaural sound;
Detecting means for detecting the face of the subject from within the image taken by the imaging means;
In a state where monaural sound is collected by the sound collecting means, position specifying means for specifying the position in the captured image of the face of the subject detected by the detecting means as a sound source position;
Audio control means for converting the monaural sound collected by the sound collection means into a pseudo-stereo form according to a sound source position in the captured image specified by the position specifying means;
It is characterized by providing.

上述した課題を解決するために本発明の他の態様は、
コンピュータに対して、
動画像を撮影する機能と、
モノラル音声を集音する機能と、
撮影された画像内から被写体の顔を検出する機能と、
前記モノラル音声が集音されている状態において、前記検出された被写体の顔の前記撮影画像内での位置を音源位置として特定する機能と、
前記集音されたモノラル音声を、前記特定された前記撮影画像内の音源位置に応じて、疑似ステレオ化する機能と、を実現させるためのプログラム、を特徴とする。 In order to solve the above-described problem, another aspect of the present invention provides:
Against the computer,
A function to shoot moving images,
The ability to collect monaural audio,
A function to detect the face of the subject from the captured image;
A function of specifying the position of the detected face of the subject in the captured image as a sound source position in a state where the monaural sound is collected;
A program for realizing a function of converting the collected monaural sound into a pseudo-stereo according to a sound source position in the specified captured image.

本発明によれば、モノラルマイクしか搭載することができない場合であっても、音声の臨場感を高めて雰囲気豊かな音響効果を実現することができ、コスト面や電力消費の点でも有利なものとなる。 According to the present invention, even when only a monaural microphone can be mounted, it is possible to realize a sound effect with a rich atmosphere by enhancing the realistic sensation of sound, which is advantageous in terms of cost and power consumption. It becomes.

撮像装置として適用したデジタルカメラの基本的な構成要素を示したブロック図。The block diagram which showed the basic component of the digital camera applied as an imaging device. カメラの全体動作のうち、本実施形態の特徴部分の動作概要を示したフローチャート。The flowchart which showed the operation | movement outline | summary of the characteristic part of this embodiment among the whole operation | movement of a camera. 図２に続く動作を示したフローチャート。The flowchart which showed the operation | movement following FIG. 擬似ステレオ化処理（図３のステップＡ２１）を詳述するためのフローチャート。The flowchart for explaining in detail the pseudo-stereo processing (step A21 in FIG. 3). （１）〜（５）は、本実施形態の特徴を具体的に示した図で、被写体（例えば、人物）の顔の数や口部分の動きに応じて音源位置が特定される様子を例示した図。(1) to (5) are diagrams specifically showing the features of the present embodiment, and illustrate how the sound source position is specified according to the number of faces of the subject (for example, a person) and the movement of the mouth portion. Figure. （１）〜（４）は、本実施形態の特徴を具体的に示した図で、被写体（例えば、人物）の顔の数や口部分の動きに応じて音源位置が特定される様子を例示した図。(1) to (4) are diagrams specifically showing the features of the present embodiment, and illustrate how the sound source position is specified according to the number of faces of the subject (for example, a person) and the movement of the mouth. Figure.

以下、図１〜図６を参照して本発明の実施形態を説明する。
図１は、撮像装置として適用したデジタルカメラの基本的な構成要素を示したブロック図である。
このデジタルカメラ（撮像装置）は、例えば、静止画像のほかに動画像の撮影も可能なコンパクトカメラで、動画撮影中に周囲の音声をモノラルにより集音する集音機能と、この集音機能により集音されたモノラル音声を擬似ステレオ化してから録音する録音機能を有し、制御部１を中核としている。この実施形態において“擬似ステレオ化”とは、後で詳細に説明するが、モノラル音声をその音源位置の画面左右端からの比率に合わせて左右チャンネルの音量差を調整することによって擬似的にステレオ化することを意味しているが、これに限らないことは勿論である。制御部１は、電源部（二次電池）２からの電力供給によって動作し、記憶部３内の各種のプログラムに応じてこのデジタルカメラ（以下、カメラと略称する）の全体動作を制御するもので、この制御部１には図示しないＣＰＵ（中央演算処理装置）やメモリなどが設けられている。 Hereinafter, an embodiment of the present invention will be described with reference to FIGS.
FIG. 1 is a block diagram showing basic components of a digital camera applied as an imaging apparatus.
This digital camera (imaging device) is, for example, a compact camera that can shoot moving images in addition to still images. With this sound collection function, the sound collection function collects the surrounding sound in monaural during movie recording. It has a recording function of recording the collected monaural sound after making it pseudo-stereo, and the control unit 1 is the core. In this embodiment, “pseudo-stereo” will be described in detail later. By adjusting the volume difference between the left and right channels in accordance with the ratio of monaural sound from the left and right edges of the sound source position on the screen, Of course, this is not limited to this. The control unit 1 operates by supplying power from the power source unit (secondary battery) 2 and controls the overall operation of the digital camera (hereinafter abbreviated as a camera) according to various programs in the storage unit 3. The control unit 1 is provided with a CPU (Central Processing Unit) and a memory (not shown).

記憶部３は、例えば、ＲＯＭ、フラッシュメモリなどを有する構成で、後述する図２〜図４に示した動作手順に応じて本実施形態を実現するためのプログラムや各種のアプリケーションなどが格納されているプログラム記憶部Ｍ１と、撮像済み画像（静止画像、動画像）をフォルダ単位などに分類して記憶保存する画像記憶部Ｍ２と、撮影された動画像に対応付けてその動画撮影中に集音されたカメラの周囲音（被写体からの音声など）を記憶保存する音声記憶部Ｍ３を有するほか、このカメラが動作するために必要となる各種の情報（例えば、後述する音源有りフラグ、タイマなど）を一時的に記憶するワーク領域を有する構成となっている。なお、記憶部３は、例えば、ＳＤカード、ＩＣカードなど、着脱自在な可搬型メモリ（記録メディア）を含む構成であってもよく、図示しないが、通信機能を介してネットワークに接続された状態においては所定の外部サーバ側の記憶領域を含むものであってもよい。 The storage unit 3 includes, for example, a ROM, a flash memory, and the like, and stores programs and various applications for realizing the present embodiment in accordance with operation procedures shown in FIGS. A program storage unit M1, an image storage unit M2 that stores and saves captured images (still images, moving images) in units of folders, and the like. In addition to having an audio storage unit M3 for storing and storing ambient sounds (such as audio from the subject) of the camera, various information necessary for the operation of the camera (for example, a sound source flag, a timer, etc., which will be described later) Is configured to have a work area for temporarily storing. Note that the storage unit 3 may be configured to include a removable portable memory (recording medium) such as an SD card or an IC card, and is connected to a network via a communication function (not shown). May include a storage area on a predetermined external server side.

表示部４は、例えば、高精細液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイ、電気泳動型ディスプレイ（電子ペーパ）のいずれかを使用し、画像を高精細に表示するもので、撮像された画像（スルー画像：ライブビュー画像）を表示するファインダ画面（モニタ画面）として機能したり、保存済み画像を表示する再生画面として機能したりする。なお、この表示部４の表面に指の接触を検出する透明なタッチパネルを積層配設することにより、例えば、静電容量方式のタッチスクリーン（タッチ画面）を構成するようにしてもよい。操作部５は、図示省略したが、シャッタ操作、動画撮影操作、露出やシャッタスピードなどの撮影条件の設定操作、撮影済み画像の再生を指示する再生操作などを行う押しボタン式の各種のキーを備えたもので、制御部１は、この操作部５からの入力操作信号に応じた処理として、例えば、撮影処理、撮影条件の設定、再生処理、データ入力処理などを行う。 The display unit 4 uses, for example, any one of a high-definition liquid crystal display, an organic EL (Electro Luminescence) display, and an electrophoretic display (electronic paper) to display an image with high definition. It functions as a finder screen (monitor screen) for displaying a through image (live view image) or as a playback screen for displaying a saved image. For example, a capacitive touch screen (touch screen) may be configured by stacking a transparent touch panel that detects finger contact on the surface of the display unit 4. Although not shown, the operation unit 5 has various push button type keys for performing a shutter operation, a moving image shooting operation, an operation for setting shooting conditions such as exposure and shutter speed, and a playback operation for instructing playback of a captured image. The control unit 1 performs, for example, photographing processing, photographing condition setting, reproduction processing, data input processing, and the like as processing according to the input operation signal from the operation unit 5.

撮像部６は、図示省略したが、光学レンズからの被写体像が撮像素子（ＣＣＤやＣＭＯＳなど）に結像されることにより被写体を高精細に撮影可能なカメラ部を構成するもので、静止画像のほかに動画像の撮影も可能となっている。この撮像部６と制御部１とは、オートフォーカス処理（ＡＦ処理）、ズーム処理、露出調整処理（ＡＥ処理）を実行したり、撮像部６からの画像データに対して各種の画像処理としてホワイトバランス調整・色補間処理を実行したり、画像処理として収差補正処理を行ったり、画像をＪＰＥＧ形式などの画像形式に圧縮する圧縮処理を実行したり、この圧縮データを伸長復元する復元処理などを実行したりする。 Although not shown, the imaging unit 6 constitutes a camera unit that can photograph a subject with high definition by forming a subject image from an optical lens on an imaging element (CCD, CMOS, etc.). In addition, it is also possible to shoot moving images. The imaging unit 6 and the control unit 1 execute autofocus processing (AF processing), zoom processing, and exposure adjustment processing (AE processing), or perform white processing as various image processing on the image data from the imaging unit 6. Performs balance adjustment and color interpolation processing, performs aberration correction processing as image processing, performs compression processing to compress the image into an image format such as JPEG format, and restore processing to decompress and restore this compressed data Or run.

顔検出部７は、撮影画像中の被写体を追尾して認識する被写体認識機能を構成するもので、最大フレームレート（例えば、３００ｆｐｓ）で撮像部６から撮像画像を取り込み、フレーム単位毎に撮影画像を解析しながら被写体（例えば、人物やペットなど）の顔の輪郭や顔を形成するバーツ（目、口、鼻、額など）の形や位置関係などを総合的に判断して、被写体の顔を検出するようにしている。この場合、撮影画像内に複数の被写体が含まれているときには、被写体毎に顔をそれぞれ検出するようにしている。なお、顔検出の機能は、カメラにおいて一般的に用いられている技術であり、本実施形態ではその周知技術を利用するようにしているため、その具体的な説明については省略するものとする。 The face detection unit 7 constitutes a subject recognition function for tracking and recognizing a subject in a captured image. The face detection unit 7 captures a captured image from the imaging unit 6 at a maximum frame rate (for example, 300 fps), and captures the captured image for each frame. Analyze the subject's face (for example, the person's face, pet, etc.) by comprehensively judging the face outline and the shape and positional relationship of the baht (eyes, mouth, nose, forehead, etc.) that form the face. To detect. In this case, when a plurality of subjects are included in the captured image, the face is detected for each subject. The face detection function is a technique that is generally used in a camera, and in the present embodiment, the well-known technique is used. Therefore, a specific description thereof will be omitted.

また、顔検出部７は、複数フレームにわたって連続的に登場する同一被写体の顔を追尾しながら顔の口部分が動いているかを検出するようにしている。すなわち、口部分が動いている被写体の顔を音源として特定するために顔の口部分が動いているかを検出するが、その際、撮影画像内に複数の被写体が含まれていれば、各被写体の顔をそれぞれ検出すると共に、各顔毎にその口部分が動いているかを検出するようにしている。その結果、制御部１は、口部分が動いている被写体が単数であれば、その被写体の顔の撮影画像内での位置を音源位置として特定するようにしている。この場合、撮影画像内の左右方向（例えば、平面座標系のＸ軸方向）において、その被写体の口部分の中心位置（Ｘ座標値）を音源位置として特定するようにしているが、口部分が動いている被写体が複数であれば、複数の顔において各口部分の中間位置（Ｘ座標値）を音源位置として特定するようにしている。 Further, the face detection unit 7 detects whether the mouth portion of the face is moving while tracking the face of the same subject that appears continuously over a plurality of frames. That is, it is detected whether the mouth part of the face is moving in order to identify the face of the subject whose mouth part is moving as a sound source. At that time, if a plurality of subjects are included in the captured image, each subject is detected. And detecting whether the mouth part is moving for each face. As a result, if there is a single subject whose mouth is moving, the control unit 1 identifies the position of the face of the subject in the captured image as the sound source position. In this case, the center position (X coordinate value) of the mouth portion of the subject is specified as the sound source position in the left-right direction in the captured image (for example, the X-axis direction of the plane coordinate system). If there are a plurality of moving subjects, an intermediate position (X coordinate value) of each mouth portion in a plurality of faces is specified as a sound source position.

モノラルマイク８は、例えば、動画撮影中にカメラ周囲の音声（被写体やその周囲からの音声）を録音するための無指向性（全指向性）のモノラルマイクロフォンである。このモノラルマイク８から同一の感度で均等に集音された音声信号は、集音部９に与えられる。この集音部９は、モノラルマイク８により集音した音声を処理するもので、図示省略したが、Ａ／Ｄ（アナログ／デジタル）変換部、音声バッファ、音声信号処理回路部などを有する構成で、制御部１は、動画撮影中において集音部９からの音声（モノラル音声）を、音源位置（口部分が動いている被写体の顔の位置）に応じて疑似ステレオ化する音声制御を行うと共に、ステレオ化した音声を撮影動画に対応付けて記録するようにしている。この場合、本実施形態においては、撮影画像内の左右方向において、音源位置の画面左右端からの比率を算出し、集音したモノクロ音声を、この算出した比率に合わせて左右チャンネルの音量差を調整することにより疑似ステレオ化する音声制御を行うようにしている。なお、左右端とは、画面（撮影範囲）の左右方向における両端の位置を意味している。 The monaural microphone 8 is, for example, a non-directional (omnidirectional) monaural microphone for recording sound around the camera (sound from the subject and its surroundings) during moving image shooting. The sound signal that is evenly collected from the monaural microphone 8 with the same sensitivity is given to the sound collecting unit 9. The sound collection unit 9 processes the sound collected by the monaural microphone 8 and is not shown in the figure, but has a configuration including an A / D (analog / digital) conversion unit, an audio buffer, an audio signal processing circuit unit, and the like. The control unit 1 performs sound control for converting the sound (monaural sound) from the sound collection unit 9 into pseudo-stereo according to the sound source position (the position of the face of the subject whose mouth is moving) during moving image shooting. In addition, the stereo sound is recorded in association with the captured moving image. In this case, in the present embodiment, the ratio of the sound source position from the left and right edges of the screen is calculated in the left-right direction in the captured image, and the volume of the left and right channels is adjusted according to the calculated ratio for the collected monochrome sound. The sound control for making the pseudo-stereo by adjusting is performed. The left and right ends mean the positions of both ends in the left and right direction of the screen (shooting range).

次に、本実施形態におけるカメラの動作概念を図２〜図４に示すフローチャートを参照して説明する。ここで、これらのフローチャートに記述されている各機能は、読み取り可能なプログラムコードの形態で格納されており、このプログラムコードに従った動作が逐次実行される。また、ネットワークなどの伝送媒体を介して伝送されてきた上述のプログラムコードに従った動作を逐次実行することもできる。すなわち、記録媒体のほかに、伝送媒体を介して外部供給されたプログラム／データを利用して本実施形態特有の動作を実行することもできる。 Next, the operation concept of the camera in the present embodiment will be described with reference to the flowcharts shown in FIGS. Here, each function described in these flowcharts is stored in the form of a readable program code, and operations according to the program code are sequentially executed. Further, it is possible to sequentially execute the operation according to the above-described program code transmitted via a transmission medium such as a network. In other words, in addition to the recording medium, an operation unique to the present embodiment can be executed using a program / data supplied externally via a transmission medium.

図２及び図３は、カメラの全体動作のうち、本実施形態の特徴部分の動作概要を示したフローチャートであり、電源投入に応じて実行開始される。以下、本実施形態の特徴部分を図５及び図６を参照して具体的に説明するものとする。なお、図５及び図６は、撮影画像内の被写体（人物）の顔の数や口部分の動きに応じて音源位置が特定される様子を例示した図である。
先ず、制御部１は、所定メモリの内容をクリアしたり、記憶部３のワーク領域内の音源位置（図示省略）をクリアしたりするなどの初期化処理を行う（図２のステップＡ１）。なお、音源位置は、前回の動画撮影時において後述の擬似ステレオ化処理（図４のステップＢ６）によって記憶されたもので、電源投入時の初期化処理により音源位置を全てクリアするようにしている。次に、動画撮影操作が行われたかを調べたり（ステップＡ２）、その他の操作が行われたかを調べたりしながら（ステップＡ３）、操作待ち状態となる。 2 and 3 are flowcharts showing an outline of the operation of the characteristic part of the present embodiment in the entire operation of the camera, and the execution is started when the power is turned on. Hereinafter, characteristic portions of the present embodiment will be specifically described with reference to FIGS. 5 and 6. FIG. 5 and FIG. 6 are diagrams illustrating the manner in which the sound source position is specified according to the number of faces of the subject (person) in the captured image and the movement of the mouth portion.
First, the control unit 1 performs initialization processing such as clearing the contents of a predetermined memory or clearing the sound source position (not shown) in the work area of the storage unit 3 (step A1 in FIG. 2). Note that the sound source position is stored by a pseudo-stereo processing (step B6 in FIG. 4) described later at the time of the previous moving image shooting, and all the sound source positions are cleared by the initialization process at the time of power-on. . Next, while checking whether a moving image shooting operation has been performed (step A2) or checking whether another operation has been performed (step A3), an operation waiting state is entered.

動画撮影操作以外のその他の操作が行われたとき（ステップＡ３でＹＥＳ）、例えば、静止画撮影操作（シャッタ操作）、撮影済み画像の再生を指示する再生操作などが行われたときには、その操作に応じた処理として、例えば、静止画撮影処理、画像再生処理などを行い（ステップＡ４）、その後、上述のステップＡ２に戻る。なお、その他の操作として、電源オフ操作が行われたときには、図示しないが、電源オフ処理を行った後に、このフローの終了となる。 When an operation other than the movie shooting operation is performed (YES in step A3), for example, when a still image shooting operation (shutter operation), a playback operation instructing playback of a shot image, or the like is performed, the operation is performed. For example, a still image shooting process, an image reproduction process, and the like are performed as the process corresponding to (Step A4), and then the process returns to Step A2. As another operation, when a power-off operation is performed, although not illustrated, this flow is ended after performing a power-off process.

動画撮影操作が行われたときには（ステップＡ２でＹＥＳ）、後述する音源有りフラグをオフにしてその値を“０”とした後（ステップＡ５）、撮像部６を起動させて動画撮影動作を開始させると共に撮像部６から動画像を取得し、この動画像を所定の圧縮処理によって圧縮しながら画像記憶部Ｍ２に記憶保存させる動画撮影処理を行う（ステップＡ６）。このようにして動画撮影を開始させた後は、モノラルマイク８から集音部９を介して所定音圧以上の音声（録音可能な最小音以上の音声）を集音したかを調べたり（ステップＡ７）、動画撮影終了操作が行われたかを調べたり（ステップＡ１１）、その他の撮影操作が行われたかを調べたりする（ステップＡ１３）。 When a moving image shooting operation is performed (YES in step A2), a sound source presence flag to be described later is turned off and the value is set to “0” (step A5), and then the imaging unit 6 is activated to start a moving image shooting operation. At the same time, a moving image is acquired from the imaging unit 6, and a moving image shooting process is performed in which the moving image is stored and saved in the image storage unit M2 while being compressed by a predetermined compression process (step A6). After starting the video recording in this way, it is checked whether or not sound having a predetermined sound pressure or higher (sound exceeding the recordable minimum sound) is collected from the monaural microphone 8 via the sound collecting unit 9 (steps). A7), whether or not a moving image shooting end operation has been performed (step A11), or whether or not another shooting operation has been performed (step A13).

いま、動画撮影中に所定音圧以上の音声を集音しない場合に（ステップＡ７でＮＯ）、いずれの操作も行われなければ（ステップＡ１１でＮＯ、ステップＡ１３でＮＯ）、上述のステップＡ６に戻り、以下、動画撮影を継続する。また、その他の操作として、例えば、ズームや露出などの調整操作などが行われたときには（ステップＡ１３でＹＥＳ）、操作に応じた処理として、ズームや露出などの調整処理などを行った後（ステップＡ１４）、上述のステップＡ６に戻り、動画撮影を継続する。また、動画撮影終了操作が行われたときには（ステップＡ１１でＹＥＳ）、動画撮影終了処理（ステップＡ１２）を実行した後、上述のステップＡ２に戻り、操作待ち状態となる。 If no sound higher than a predetermined sound pressure is collected during moving image shooting (NO in step A7), if no operation is performed (NO in step A11, NO in step A13), the process proceeds to step A6 described above. Return to the following and continue video shooting. As other operations, for example, when an adjustment operation such as zoom or exposure is performed (YES in step A13), an adjustment process such as zoom or exposure is performed as a process corresponding to the operation (step S13). A14) Returning to step A6 described above, the moving image shooting is continued. When the moving image shooting end operation is performed (YES in step A11), the moving image shooting end process (step A12) is executed, and then the process returns to the above-described step A2 to enter an operation waiting state.

また、動画撮影中において所定音圧以上の音声を集音したときには（ステップＡ７でＹＥＳ）、集音した音声の中から被写体（人物、ペットなど）の声を抽出して（ステップＡ８）、被写体の声が含まれているかを調べる（ステップＡ９）。この場合、人物、ペットなどの声の周波数特性などを考慮して総合的な判断を行い、被写体の声のみを抽出するようにしているが、被写体の声をだけを完全に抽出しなくても被写体の声を集中的に抽出することができればよい。なお、この抽出技術は周知であるため、その具体的な説明については省略するものとする。ここで、集音した音声の中に被写体の声が含まれていなければ（ステップＡ９でＮＯ）、集音した音声をそのままモノラル音声として音声記憶部Ｍ３に記録させる録音処理を行う（ステップＡ１０）。この場合、画像記憶部Ｍ２に録画した動画に対応付けてモノラル音声を録音する処理が行われる。その後、操作に応じた処理に移る（ステップＡ１１〜Ａ１４）。 Further, when a sound having a predetermined sound pressure or higher is collected during movie shooting (YES in step A7), the voice of the subject (person, pet, etc.) is extracted from the collected sound (step A8). Is included (step A9). In this case, comprehensive judgment is made in consideration of the frequency characteristics of the voices of people, pets, etc., and only the subject's voice is extracted, but it is not necessary to extract only the subject's voice. It is only necessary that the voice of the subject can be extracted intensively. In addition, since this extraction technique is well known, the detailed description thereof will be omitted. Here, if the voice of the subject is not included in the collected voice (NO in step A9), a recording process for recording the collected voice in the voice storage unit M3 as monaural voice is performed (step A10). . In this case, a process of recording monaural sound in association with the moving image recorded in the image storage unit M2 is performed. Thereafter, the process proceeds to a process corresponding to the operation (steps A11 to A14).

集音した音声の中に被写体の声が含まれている場合には（ステップＡ９でＹＥＳ）、図３のフローに移り、顔検出部７を動作させて顔検出処理を行わせる（ステップＡ１５）。この場合、顔検出部７は、撮像部６から取得した撮影画像を解析することにより撮影画像内の被写体の顔を検出する。その結果、顔検出部７により被写体の顔が検出されたかを調べ（ステップＡ１６）、顔が検出されなければ（ステップＡ１６でＮＯ）、後述するステップＡ２４に移るが、顔が検出されたときには（ステップＡ１６でＹＥＳ）、検出された顔は複数であるか、つまり、撮影画像内に複数の被写体が含まれている場合に複数の被写体の顔が検出されたかを調べる（ステップＡ１７）。 If the voice of the subject is included in the collected sound (YES in step A9), the flow proceeds to the flow of FIG. 3, and the face detection unit 7 is operated to perform face detection processing (step A15). . In this case, the face detection unit 7 detects the face of the subject in the captured image by analyzing the captured image acquired from the imaging unit 6. As a result, it is checked whether or not the face of the subject is detected by the face detection unit 7 (step A16). If no face is detected (NO in step A16), the process proceeds to step A24 described later, but when a face is detected ( It is checked whether there are a plurality of detected faces, that is, if a plurality of subjects are detected in the captured image (step A17).

いま、検出された顔が複数ではなく単数であれば（ステップＡ１７でＮＯ）、口部分が動いている顔は単数か、つまり、検出された顔の口部分は動いているかを調べる（ステップＡ２２）。ここで、検出された顔の口部分が動いていなければ（ステップＡ２２でＮＯ）、つまり、撮影の途中で音声を発しているにも関わらず、撮影画像を解析した結果、口部分の動きが小さかったなどの理由により口部分が動かないことが検出された場合には、後述するステップＡ２４に移る。また、検出された顔の口部分が動いていれば（ステップＡ２２でＹＥＳ）、撮影画像内に音源となる被写体が存在しているものとして、その口部分の位置を音源位置として特定する（ステップＡ２３）。この場合、口部分の左右方向の中心位置を音源位置として特定するようにしている。 If the detected face is not a plurality but a single face (NO in step A17), it is checked whether the mouth part is moving, that is, whether the detected mouth part is moving (step A22). ). Here, if the mouth part of the detected face is not moving (NO in step A22), that is, the sound of the mouth part is analyzed as a result of analyzing the photographed image even though the voice is emitted during the photographing. When it is detected that the mouth portion does not move due to a reason such as being small, the process proceeds to Step A24 described later. If the mouth part of the detected face is moving (YES in step A22), the position of the mouth part is specified as the sound source position on the assumption that a subject as a sound source exists in the captured image (step S22). A23). In this case, the center position in the left-right direction of the mouth portion is specified as the sound source position.

ここで、音源位置の特定の仕方を図５（１）〜（４）を参照して更に具体的に説明する。図５（１）は、検出した人物の顔が単数でその口部分が動いている状態において、口部分の画面左右端からの比率が“１：０．３３”の場合を例示したものである。この場合、口部分が動いている被写体が単数であるので、その口部分の位置が音源位置として特定される。なお、図中、画面（撮影範囲）の右端からの音源位置までの距離比率を“１”とすると、画面の左端から音源位置までの距離比率は“０．３３”となる。図５（２）は、検出した人物の顔が単数でその口部分が動いている状態において、口部分の画面左右端からの比率が“１：１”の場合を例示したものである。この場合、音源位置（口部分）は、図中、破線（縦線）で示すように、画面の中央位置に特定され、その比率は“１：１”となる。 Here, how to specify the sound source position will be described more specifically with reference to FIGS. FIG. 5A illustrates the case where the ratio of the mouth portion from the left and right edges of the screen is “1: 0.33” in a state where the detected face of the person is single and the mouth portion is moving. . In this case, since there is a single subject whose mouth portion is moving, the position of the mouth portion is specified as the sound source position. In the figure, if the distance ratio from the right end of the screen (shooting range) to the sound source position is “1”, the distance ratio from the left end of the screen to the sound source position is “0.33”. FIG. 5B illustrates a case where the ratio of the mouth part from the left and right edges of the screen is “1: 1” in a state where the detected face of the person is single and the mouth part is moving. In this case, the sound source position (mouth portion) is specified at the center position of the screen as indicated by a broken line (vertical line) in the figure, and the ratio is “1: 1”.

図５（３）は、検出した人物の顔が単数でその口部分が動いている状態において、口部分の画面左右端からの比率が“０．１４：１”の場合を例示したものである。この場合、音源位置（口部分）は、図中、破線（縦線）で示すように、画面の右側に寄った位置に特定され、画面の左端からの音源位置までの距離比率を“１”とすると、画面の右端から音源位置までの距離比率は“０．１４”となる。図５（４）は、検出した人物の顔が単数でその口部分が動いている状態において、口部分の画面左右端からの比率が“０：１”の場合を例示したものである。この場合、音源位置（口部分）は、図中、破線（縦線）で示すように、画面の右端の位置に特定され、画面の左端からの音源位置までの距離比率を“１”とすると、右端から音源位置までの距離比率は“０”となる。 FIG. 5 (3) illustrates a case where the ratio of the mouth portion from the left and right edges of the screen is “0.14: 1” in a state where the detected human face is single and the mouth portion is moving. . In this case, as indicated by a broken line (vertical line) in the figure, the sound source position (mouth portion) is specified at a position close to the right side of the screen, and the distance ratio from the left end of the screen to the sound source position is “1”. Then, the distance ratio from the right end of the screen to the sound source position is “0.14”. FIG. 5 (4) illustrates a case where the ratio of the mouth portion from the left and right edges of the screen is “0: 1” in a state where the detected face of the person is single and the mouth portion is moving. In this case, as shown by the broken line (vertical line) in the figure, the sound source position (mouth portion) is specified at the right end position of the screen, and the distance ratio from the left end of the screen to the sound source position is “1”. The distance ratio from the right end to the sound source position is “0”.

次に、上述の音源有りフラグをオンにしてその値を“１”とした後（図３のステップＡ２０）、擬似ステレオ化処理を行う（ステップＡ２１）。音源有りフラグは、撮影画像内に音源としての被写体が存在していることを示すもので、撮影画像内に口部分が動いている顔を検出した時点でその値は“１”となる。擬似ステレオ化処理は、後で詳述するが、この特定した音源位置を記憶させたり、集音したモノラル音声を擬似ステレオ化して記憶させたりする処理で、この擬似ステレオ化処理を実行した後は、図２のステップＡ１１に移る。 Next, after the above-described sound source presence flag is turned on and the value thereof is set to “1” (step A20 in FIG. 3), pseudo-stereo processing is performed (step A21). The sound source presence flag indicates that a subject as a sound source exists in the photographed image, and its value becomes “1” when a face with a moving mouth is detected in the photographed image. The pseudo-stereo processing will be described in detail later, but after the pseudo-stereo processing is executed, the specified sound source position is stored, or the collected monaural sound is stored as pseudo-stereo. Then, the process proceeds to step A11 in FIG.

図４は、擬似ステレオ化処理（図３のステップＡ２１）を詳述するためのフローチャートである。
先ず、制御部１は、上述のように集音した被写体の声（例えば、人の声）を疑似ステレオ化する場合に、被写体の声を左右チャンネルに同等に分割してステレオ化する（ステップＢ１）。そして、撮影画像内の左右方向において、特定音源位置の画面左右端からの比率を算出し（ステップＢ２）、この算出した比率に合わせて左右チャンネルの音量差を調整する（ステップＢ３）。例えば、特定音源位置の画面左右端からの比率として、図５（１）の場合には、“１：０．３３”が算出され、図５（２）の場合には、“１：１”が算出されてその比率に合わせて左右チャンネルの音量差が調整される。同様に、図５（３）の場合には、“０．１４：１”が算出され、図５（４）の場合には、“０：１”が算出されて、その比率に合わせて左右チャンネルの音量差が調整される。 FIG. 4 is a flowchart for explaining the pseudo-stereo processing (step A21 in FIG. 3) in detail.
First, when the subject's voice (for example, human voice) collected as described above is pseudo-stereo, the control unit 1 equally divides the subject's voice into left and right channels for stereoization (step B1). ). Then, the ratio of the specific sound source position from the left and right edges of the screen in the left-right direction in the captured image is calculated (step B2), and the volume difference between the left and right channels is adjusted according to the calculated ratio (step B3). For example, as the ratio of the specific sound source position from the left and right edges of the screen, “1: 0.33” is calculated in the case of FIG. 5 (1), and “1: 1” in the case of FIG. 5 (2). Is calculated and the volume difference between the left and right channels is adjusted according to the ratio. Similarly, in the case of FIG. 5 (3), “0.14: 1” is calculated, and in the case of FIG. 5 (4), “0: 1” is calculated. Channel volume difference is adjusted.

このようにして擬似ステレオ化した被写体の声（左右チャンネルの音声）に、この被写体の声以外の音声（モノクロの周囲音）をそれぞれ合成する（ステップＢ４）。これによって生成した左右チャンネルの合成音声を画像記憶部Ｍ２内の動画に対応付けて音声記憶部Ｍ３に記録させると共に（ステップＢ５）、特定した音源位置をこの合成音声に対応付けて記憶部３のワーク領域に記憶させる処理を行う（ステップＢ６）。その後、図４のフローから抜けて図２のステップＡ１１に戻る。 Voices other than the voice of the subject (monochrome ambient sound) other than the voice of the subject are synthesized with the voice of the subject (left and right channel voice) pseudo-stereo (step B4). The synthesized sound of the left and right channels generated thereby is associated with the moving image in the image storage unit M2 and recorded in the speech storage unit M3 (step B5), and the identified sound source position is associated with this synthesized speech in the storage unit 3. Processing to be stored in the work area is performed (step B6). Thereafter, the flow of FIG. 4 is exited and the process returns to step A11 of FIG.

一方、被写体の顔を検出しなかった場合、つまり音源としての被写体が存在しない場合には（図３のステップＡ１６でＮＯ）、次のステップＡ２４に移り、音源有りフラグの値が“１”であるか、つまり、撮影画像内に音源としての被写体の顔を検出済みであったかを調べ、音源有りフラグの値が“１”であれば（ステップＡ２４でＹＥＳ）、つまり、動画撮影中に被写体の顔を検出していたが、撮影の途中でその被写体が画面（撮影範囲）から外れた場合には、上述の擬似ステレオ化処理（ステップＡ２１）で前回記憶した音源位置をそのまま今回の音源位置として特定する（ステップＡ２５）。図５（５）は、人物の顔が画面（撮影範囲）の右端から外れた場合を例示したもので、この場合の音源位置は、図中、破線（縦線）で示すように、画面（撮影範囲）の右端から外れる直前の位置（前回の音源位置）に保持される。 On the other hand, if the face of the subject is not detected, that is, if there is no subject as a sound source (NO in step A16 in FIG. 3), the process proceeds to the next step A24, and the value of the sound source presence flag is “1”. In other words, it is checked whether or not the face of the subject as a sound source has been detected in the photographed image. If the value of the sound source presence flag is “1” (YES in step A24), that is, If a face is detected but the subject is out of the screen (shooting range) during shooting, the previously stored sound source position in the pseudo-stereo processing (step A21) is used as the current sound source position. Specify (step A25). FIG. 5 (5) exemplifies a case where the face of a person is off the right end of the screen (shooting range). In this case, the sound source position is shown on the screen (vertical line) as shown by a broken line (vertical line) in the figure. It is held at a position (previous sound source position) immediately before the right end of the shooting range.

また、被写体の顔を検出しなかった場合に（ステップＡ１６でＮＯ）、音源有りフラグの値が“０”であれば（ステップＡ２４でＮＯ）、つまり、動画撮影中に被写体の顔を検出していなかった場合、例えば、風景や建物など撮影を行っている場合で画面外から声が集音されているような場合には、表示部４の画面の中央位置（撮影範囲の中央位置）を音源位置として特定する（ステップＡ２６）。また、今回の動画撮影中に被写体の顔を検出したが（ステップＡ１６でＹＥＳ）、撮影の始めから口部分が動いている顔が存在しなかった場合には（ステップＡ２２でＮＯ）、音源有りフラグの値は“０”のままであるから（ステップＡ２４でＮＯ）、上述のステップＡ２６に移り、画面の中央位置を音源位置として特定する。図６（１）は、検出した人物の顔が単数でその口部分が撮影の始めから動いていなかった場合を例示したもので、音源位置は、図中、破線（縦線）で示すように、画面の中央位置に特定される。この場合、音源位置の画面左右端からの比率は、“１：１”となり、この比率に合わせて左右チャンネルの音量差が調整されるため、左右チャンネルの音量は同等に割り当てられる。 If the face of the subject is not detected (NO in step A16), if the value of the sound source presence flag is “0” (NO in step A24), that is, the face of the subject is detected during movie shooting. If it is not, for example, when shooting a landscape or a building and a voice is being collected from the outside of the screen, the center position of the screen of the display unit 4 (the center position of the shooting range) is set. The sound source position is specified (step A26). If the face of the subject is detected during the current video shooting (YES in step A16), but there is no face whose mouth has moved from the beginning of shooting (NO in step A22), there is a sound source. Since the value of the flag remains “0” (NO in step A24), the process proceeds to step A26 described above, and the center position of the screen is specified as the sound source position. FIG. 6 (1) illustrates the case where the detected face of a person is singular and the mouth portion has not moved from the beginning of shooting, and the sound source position is indicated by a broken line (vertical line) in the figure. , Specified at the center position of the screen. In this case, the ratio of the sound source position from the left and right ends of the screen is “1: 1”, and the volume difference between the left and right channels is adjusted according to this ratio, so the volume of the left and right channels is assigned equally.

また、動画撮影中に被写体の顔を検出していたが（ステップＡ１６でＹＥＳ）、撮影の途中で口部分が動いている顔が存在しなくなったことを検出した場合（ステップＡ２２でＮＯ）、つまり、撮影の途中で音声を発しているにも関わらず、撮影画像を解析した結果、口部分の動きが小さかったなど理由により口部分が動かないことが検出された場合には、音源有りフラグの値は“１”のままであるから（ステップＡ２４でＹＥＳ）、上述のステップＡ２５に移り、上述の擬似ステレオ化処理（ステップＡ２１）で前回記憶した音源位置をそのまま今回の音源位置として特定する。図６（２）は、検出した人物の顔が単数でその口部分が撮影の途中から動かなくなった場合を例示したもので、音源位置は、口部分が動いていた前回の音源位置に特定される。図示の例は、被写体が移動せずに口部分の動きを止めた場合である。 If the face of the subject was detected during movie shooting (YES in step A16), but it is detected that there is no face with a moving mouth part during shooting (NO in step A22), In other words, if it is detected that the mouth part does not move due to the movement of the mouth part being small as a result of analyzing the photographed image even though the sound is emitted during the shooting, Remains at “1” (YES in step A24), the process proceeds to step A25 described above, and the sound source position previously stored in the pseudo-stereo processing (step A21) is directly specified as the current sound source position. . FIG. 6 (2) illustrates the case where the detected person's face is single and the mouth portion stops moving during the shooting, and the sound source position is specified as the previous sound source position where the mouth portion was moving. The The illustrated example is a case where the movement of the mouth portion is stopped without moving the subject.

他方、撮影画像内から検出された被写体の顔が複数であれば（ステップＡ１７でＹＥＳ）、口部分の動いている顔は複数であるかを調べ（ステップＡ１８）、口部分の動いている顔が複数でなければ（ステップＡ１８でＮＯ）、上述のステップＡ２２に移って、口部分が動いている顔は単数であるかを調べ、検出された顔の全ての口部分が動いていなければ（ステップＡ２２でＮＯ）、上述のステップＡ２４に移り、以下、上述と同様の処理を行う（ステップＡ２４〜Ａ２６）。 On the other hand, if there are a plurality of faces of the subject detected from the photographed image (YES in step A17), it is checked whether there are a plurality of moving faces in the mouth (step A18), and the moving face of the mouth. If there is not more than one (NO in step A18), the process proceeds to step A22 described above to check whether the mouth part is moving, and if all the mouth parts of the detected face are not moving ( NO at step A22), the process proceeds to step A24 described above, and the same processing as described above is performed (steps A24 to A26).

また、被写体の顔のいずれか一つの顔の口部分が動いていれば（ステップＡ２２でＹＥＳ）、上述と同様にその口部分の中心位置を音源位置として特定した後（ステップＡ２３）、上述の音源有りフラグをオンにしてその値を“１”とする（ステップＡ２０）。図６（３）は、検出した人物の顔が複数であるが、口部分が動いている顔が単数の場合を例示したもので、音源位置は、動いている口部分の位置が音源位置として特定される。 If the mouth part of any one of the faces of the subject is moving (YES in step A22), after specifying the center position of the mouth part as the sound source position in the same manner as described above (step A23), The sound source presence flag is turned on and its value is set to “1” (step A20). FIG. 6 (3) illustrates the case where there are a plurality of detected human faces, but there is only one face with a moving mouth part. The sound source position is the position of the moving mouth part as the sound source position. Identified.

また、被写体の顔が複数で（ステップＡ１７でＹＥＳ）、口部分の動いている顔も複数の場合には（ステップＡ１８でＹＥＳ）、複数の口部分の中間位置を音源位置として特定する（ステップＡ１９）。すなわち、撮影画像内に口部分の動いている複数の被写体が存在する場合には、被写体毎にその口部分の位置を求めると共に、各口部分の位置の中心位置を音源位置として特定する。次に、上述の音源有りフラグをオンにしてその値を“１”とする（ステップＡ２０）。そして、擬似ステレオ化処理を実行した後（ステップＡ２１）、図２のステップＡ１１に移る。図６（４）は、複数の人物の顔を検出した場合で、口部分が動いている顔が複数の場合を例示したもので、図中、破線（縦線）で示すように、複数の口部分の中間位置が音源位置として特定される。 If there are a plurality of faces of the subject (YES in step A17) and there are a plurality of moving face parts (YES in step A18), an intermediate position of the plurality of mouth parts is specified as a sound source position (step S18). A19). That is, when there are a plurality of subjects whose mouth portions are moving in the captured image, the positions of the mouth portions are obtained for each subject, and the center position of each mouth portion is specified as the sound source position. Next, the above-mentioned sound source presence flag is turned on and its value is set to “1” (step A20). Then, after executing the pseudo-stereo processing (step A21), the process proceeds to step A11 in FIG. FIG. 6 (4) illustrates a case where a plurality of faces of a person are detected, and a plurality of faces whose mouth portion is moving are illustrated. As shown by broken lines (vertical lines) in FIG. The middle position of the mouth portion is specified as the sound source position.

以上のように、本実施形態においては、撮像部６により撮影された画像内から被写体の顔を検出すると共に、この被写体の顔の撮影画像内での位置を音源位置として特定した後、マイク８を介して集音部９により集音されたモノラル音声を特定音源位置に応じて疑似ステレオ化するようにしたので、モノラルマイクしか搭載することができない場合であっても、音声の臨場感を高めて雰囲気豊かな音響効果を実現することができ、コスト面や電力消費の点でも有利なものとなる。 As described above, in the present embodiment, the face of the subject is detected from the image taken by the imaging unit 6 and the position of the face of the subject in the taken image is specified as the sound source position, and then the microphone 8 Since the monaural sound collected by the sound collection unit 9 via the sound is converted into a pseudo-stereo according to the specific sound source position, even if only a monaural microphone can be installed, the realism of the sound is enhanced. It is possible to realize a sound effect rich in atmosphere, which is advantageous in terms of cost and power consumption.

疑似ステレオ化する場合に、音源位置の画面左右端からの比率に合わせて左右チャンネルの音量差を調整することにより疑似ステレオ化するようにしたので、容易な処理で音量差による適切なステレオ化が可能となる。 In the case of pseudo-stereo, it was made pseudo-stereo by adjusting the volume difference between the left and right channels according to the ratio of the sound source position from the left and right edges of the screen. It becomes possible.

被写体の顔を検出した場合にその口部分を検出し、この口部分が動いていれば、その被写体の位置を音源位置として特定するようにしたので、例えば、被写体の顔は撮影されているが音声を発している被写体は画面外といった不整合を回避でき、音源を適切に特定することができる。 When the face of the subject is detected, the mouth portion is detected. If the mouth portion moves, the position of the subject is specified as the sound source position. For example, the face of the subject is photographed. A subject that is producing sound can avoid inconsistencies such as outside the screen, and a sound source can be specified appropriately.

複数の被写体の顔を検出した場合に、その中から口部分が動いている被写体の位置を音源位置として特定するようにしたので、複数の被写体の顔を検出した場合でも撮影されている被写体の方向と音声が聞こえてくる方向を略一致させることができる。 When the faces of multiple subjects are detected, the position of the subject whose mouth is moving is identified as the sound source position, so even if multiple subjects' faces are detected, It is possible to make the direction and the direction in which the sound is heard substantially coincide.

複数の被写体の顔を検出した場合に、その中から口部分が動いている被写体が複数であれば、口部分が動いている複数の被写体の中間位置を音源位置として特定するようにしたので、撮影されている複数の被写体が同時に音声を発している場合でも、適切な位置を音源として特定することができる。 When detecting the faces of multiple subjects, if there are multiple subjects whose mouth part is moving from among them, the intermediate position of the multiple subjects whose mouth part is moving is specified as the sound source position. Even when a plurality of photographed subjects are simultaneously producing sound, an appropriate position can be specified as a sound source.

被写体の顔を検出することができなくなった場合に、その被写体の顔の検出ができなくなる直前の位置を音源位置として特定するようにしたので、一時的に音声を発している被写体が撮影画面から外れたような場合でも、適切な位置を音源として特定することができる。 When the subject's face can no longer be detected, the position immediately before the subject's face can no longer be detected is specified as the sound source position. Even when it is off, an appropriate position can be specified as a sound source.

被写体の顔を検出した場合に、撮影の途中でその口部分が動かなくなったことが検出されさたときには、口部分が動いていた直前の位置を音源位置として特定するようにしたので、撮影の途中で音声を発しているにも関わらず、撮影画像を解析した結果、口部分の動きが小さかったなど理由により口部分が動かないことが検出されたとしても適切な位置を音源として特定することができる。 When the subject's face is detected and it is detected that the mouth has stopped moving during shooting, the position immediately before the mouth has moved is specified as the sound source position. Even if a voice is emitted in the middle, even if it is detected that the mouth part does not move due to a small movement of the mouth part as a result of analyzing the shot image, an appropriate position is specified as a sound source Can do.

集音された音声から被写体の声のみを抽出して疑似ステレオ化するようにしたので、周囲の騒音などをできるだけ回避して被写体の音声をステレオ化することが可能となる。 Since only the subject's voice is extracted from the collected sound and pseudo-stereo is made, it is possible to avoid the surrounding noise as much as possible and to make the subject's sound stereo.

なお、上述した実施形態においては、疑似ステレオ化する場合に、音源位置の画面左右端からの比率に合わせて左右チャンネルの音量差を調整することにより疑似ステレオ化するようにしたが、疑似ステレオ化する場合に、音源位置の画面左右端からの比率に合わせて左右チャンネルの位相差を調整することにより疑似ステレオ化するようにしてもよい。また、左右チャンネルの音量差及び位相差を調整することにより疑似ステレオ化するようにしてもよい。これによって本格的なステレオ化が可能となり、臨場感を高めることができる。 In the above-described embodiment, in the case of pseudo-stereo, pseudo-stereo is achieved by adjusting the volume difference between the left and right channels according to the ratio of the sound source position from the left and right edges of the screen. In this case, pseudo stereo may be realized by adjusting the phase difference between the left and right channels in accordance with the ratio of the sound source position from the left and right ends of the screen. Further, pseudo stereo may be realized by adjusting the volume difference and the phase difference between the left and right channels. This makes it possible to make a full-fledged stereo and enhance the sense of reality.

また、上述した実施形態においては、コンパクトカメラに適用したが、カメラ機能付きの携帯電話機・ＰＤＡ・音楽プレイヤーなどであってもよい。 In the above-described embodiment, the present invention is applied to a compact camera, but may be a mobile phone with a camera function, a PDA, a music player, or the like.

その他、上述した実施形態において示した“装置”や“部”とは、機能別に複数の筐体に分離されていてもよく、単一の筐体に限らない。また、上述したフローチャートに記述した各ステップは、時系列的な処理に限らず、複数のステップを並列的に処理したり、別個独立して処理したりするようにしてもよい。 In addition, the “apparatus” and “unit” shown in the above-described embodiments may be separated into a plurality of cases by function, and are not limited to a single case. In addition, each step described in the above-described flowchart is not limited to time-series processing, and a plurality of steps may be processed in parallel or separately.

以上、この発明の実施形態について説明したが、この発明は、これらに限定されるものではなく、特許請求の範囲に記載された発明とその均等の範囲を含むものである。
以下、本願出願の特許請求の範囲に記載された発明を付記する。 As mentioned above, although embodiment of this invention was described, this invention is not limited to these, The invention described in the claim and its equivalent range are included.
Hereinafter, the invention described in the claims of the present application will be appended.

（付記）
請求項１に記載の発明は、
動画像を撮影する撮像手段を備えた撮像装置であって、
モノラル音声を集音する集音手段と、
前記撮像手段により撮影された画像内から被写体の顔を検出する検出手段と、
前記集音手段によりモノラル音声が集音されている状態において、前記検出手段により検出された被写体の顔の前記撮影画像内での位置を音源位置として特定する位置特定手段と、
前記集音手段により集音されたモノラル音声を、前記位置特定手段により特定された前記撮影画像内の音源位置に応じて、疑似ステレオ化する音声制御手段と、
を備えることを特徴とする撮像装置。 (Appendix)
The invention described in claim 1
An image pickup apparatus including an image pickup unit for shooting a moving image,
Sound collecting means for collecting monaural sound;
Detecting means for detecting the face of the subject from within the image taken by the imaging means;
In a state where monaural sound is collected by the sound collecting means, position specifying means for specifying the position in the captured image of the face of the subject detected by the detecting means as a sound source position;
Audio control means for converting the monaural sound collected by the sound collection means into a pseudo-stereo form according to a sound source position in the captured image specified by the position specifying means;
An imaging apparatus comprising:

請求項２に記載の発明は、請求項１に記載の撮像装置において、
前記位置特定手段により特定された前記撮影画像内の左右端からの音源位置の比率を算出する算出手段を更に備え、
前記音声制御手段は、前記集音手段により集音された音声を、前記算出手段により算出された比率に合わせて左右チャンネルの音量差を調整することにより疑似ステレオ化する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 2 is the imaging apparatus according to claim 1,
A calculation means for calculating a ratio of the sound source position from the left and right ends in the captured image specified by the position specifying means;
The sound control means converts the sound collected by the sound collecting means into a pseudo stereo by adjusting the volume difference between the left and right channels in accordance with the ratio calculated by the calculation means.
An imaging apparatus characterized by being configured as described above.

請求項３に記載の発明は、請求項１に記載の撮像装置において、
前記位置特定手段により特定された前記撮影画像内の左右端からの音源位置の比率を算出する算出手段を更に備え、
前記音声制御手段は、前記集音手段により集音された音声を、前記算出手段により算出された比率に合わせて左右チャンネルの位相差を調整することにより疑似ステレオ化する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 3 is the imaging apparatus according to claim 1,
A calculation means for calculating a ratio of the sound source position from the left and right ends in the captured image specified by the position specifying means;
The sound control means converts the sound collected by the sound collection means into a pseudo-stereo by adjusting the phase difference between the left and right channels in accordance with the ratio calculated by the calculation means.
An imaging apparatus characterized by being configured as described above.

請求項４に記載の発明は、請求項１〜請求項３のいずれかに記載の撮像装置において、
前記検出手段は、被写体の顔を検出した場合に更にその口部分を検出し、
前記位置特定手段は、前記検出手段により検出された口部分が動いている場合にその被写体の位置を音源位置として特定する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 4 is the imaging apparatus according to any one of claims 1 to 3,
The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying means specifies the position of the subject as a sound source position when the mouth portion detected by the detection means is moving;
An imaging apparatus characterized by being configured as described above.

請求項５に記載の発明は、請求項１〜請求項３のいずれかに記載の撮像装置において、
前記検出手段は、被写体の顔を検出した場合に更にその口部分を検出し、
前記位置特定手段は、前記検出手段により複数の被写体の顔が検出された場合に、その中から口部分が動いている被写体の位置を音源位置として特定する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 5 is the imaging apparatus according to any one of claims 1 to 3,
The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying means specifies the position of a subject whose mouth portion is moving as a sound source position when the faces of a plurality of subjects are detected by the detection means;
An imaging apparatus characterized by being configured as described above.

請求項６に記載の発明は、請求項１〜請求項３のいずれかに記載の撮像装置において、
前記検出手段は、被写体の顔を検出した場合に更にその口部分を検出し、
前記位置特定手段は、前記検出手段により複数の被写体の顔が検出され、かつその中から口部分が動いている被写体が複数である場合に、口部分が動いている複数の被写体の中間位置を音源位置として特定する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 6 is the imaging apparatus according to any one of claims 1 to 3,
The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying unit determines an intermediate position of the plurality of subjects whose mouth part is moving when the detection unit detects faces of the plurality of subjects and a plurality of subjects whose mouth part is moving from among the faces. Specify as the sound source position,
An imaging apparatus characterized by being configured as described above.

請求項７に記載の発明は、請求項１〜請求項３のいずれかに記載の撮像装置において、
前記位置特定手段は、前記検出手段により被写体の顔の検出ができなくなった場合に、その被写体の顔の検出ができなくなる直前の位置を音源位置として特定する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 7 is the imaging apparatus according to any one of claims 1 to 3,
The position specifying means specifies the position immediately before the face of the subject cannot be detected as the sound source position when the face of the subject cannot be detected by the detection means;
An imaging apparatus characterized by being configured as described above.

請求項８に記載の発明は、請求項１〜請求項３のいずれかに記載の撮像装置において、
前記検出手段は、被写体の顔を検出した場合に更にその口部分を検出し、
前記位置特定手段は、前記検出手段により被写体の顔が検出され、かつ撮影の途中でその口部分が動かなくなったことが検出された場合には、口部分が動いていた直前の位置を音源位置として特定する、
ようにしたことを特徴とする記載の撮像装置。 The invention according to claim 8 is the imaging apparatus according to any one of claims 1 to 3,
The detection means further detects the mouth portion when detecting the face of the subject,
If the detection unit detects the face of the subject and detects that the mouth portion has stopped moving during shooting, the position specifying unit determines the position immediately before the mouth portion has moved as the sound source position. As specified,
An imaging apparatus according to claim, wherein the imaging apparatus is configured as described above.

請求項９に記載の発明は、請求項１〜請求項８のいずれかに記載の撮像装置において、
前記集音手段により集音された音声から被写体の声のみを抽出する音声抽出手段を更に備え、
前記音声制御手段は、前記音声抽出手段により抽出された被写体の声のみを疑似ステレオ化する、
ようにしたことを特徴とする撮像装置。 The invention according to claim 9 is the imaging apparatus according to any one of claims 1 to 8,
Voice extraction means for extracting only the voice of the subject from the voice collected by the sound collection means;
The sound control means pseudo-stereoize only the voice of the subject extracted by the sound extraction means;
An imaging apparatus characterized by being configured as described above.

請求項１０に記載の発明は、
コンピュータに対して、
動画像を撮影する機能と、
モノラル音声を集音する機能と、
撮影された画像内から被写体の顔を検出する機能と、
前記モノラル音声が集音されている状態において、前記検出された被写体の顔の前記撮影画像内での位置を音源位置として特定する機能と、
前記集音されたモノラル音声を、前記特定された前記撮影画像内の音源位置に応じて、疑似ステレオ化する機能と、
を実現させるためのプログラム。 The invention according to claim 10 is:
Against the computer,
A function to shoot moving images,
The ability to collect monaural audio,
A function to detect the face of the subject from the captured image;
A function of specifying the position of the detected face of the subject in the captured image as a sound source position in a state where the monaural sound is collected;
A function of making the collected monaural sound pseudo-stereo according to a sound source position in the identified captured image;
A program to realize

１制御部
３記憶部
４表示部
５操作部
６撮像部
７顔検出部
８モノラルマイク
９集音部
Ｍ１プログラム記憶部
Ｍ２画像記憶部
Ｍ３音声記憶部 DESCRIPTION OF SYMBOLS 1 Control part 3 Storage | storage part 4 Display part 5 Operation part 6 Imaging part 7 Face detection part 8 Monaural microphone 9 Sound collection part M1 Program storage part M2 Image storage part M3 Voice storage part

Claims

An image pickup apparatus including an image pickup unit for shooting a moving image,
Sound collecting means for collecting monaural sound;
Detecting means for detecting the face of the subject from within the image taken by the imaging means;
In a state where monaural sound is collected by the sound collecting means, position specifying means for specifying the position in the captured image of the face of the subject detected by the detecting means as a sound source position;
Audio control means for converting the monaural sound collected by the sound collection means into a pseudo-stereo form according to a sound source position in the captured image specified by the position specifying means;
An imaging apparatus comprising:

A calculation means for calculating a ratio of the sound source position from the left and right ends in the captured image specified by the position specifying means;
The sound control means converts the sound collected by the sound collecting means into a pseudo stereo by adjusting the volume difference between the left and right channels in accordance with the ratio calculated by the calculation means.
The imaging apparatus according to claim 1, which is configured as described above.

A calculation means for calculating a ratio of the sound source position from the left and right ends in the captured image specified by the position specifying means;
The sound control means converts the sound collected by the sound collection means into a pseudo-stereo by adjusting the phase difference between the left and right channels in accordance with the ratio calculated by the calculation means.
The imaging apparatus according to claim 1, which is configured as described above.

The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying means specifies the position of the subject as a sound source position when the mouth portion detected by the detection means is moving;
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying means specifies the position of a subject whose mouth portion is moving as a sound source position when the faces of a plurality of subjects are detected by the detection means;
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

The detection means further detects the mouth portion when detecting the face of the subject,
The position specifying unit determines an intermediate position of the plurality of subjects whose mouth part is moving when the detection unit detects faces of the plurality of subjects and a plurality of subjects whose mouth part is moving from among the faces. Specify as the sound source position,
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

The position specifying means specifies the position immediately before the face of the subject cannot be detected as the sound source position when the face of the subject cannot be detected by the detection means;
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

The detection means further detects the mouth portion when detecting the face of the subject,
If the detection unit detects the face of the subject and detects that the mouth portion has stopped moving during shooting, the position specifying unit determines the position immediately before the mouth portion has moved as the sound source position. As specified,
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

Voice extraction means for extracting only the voice of the subject from the voice collected by the sound collection means;
The sound control means pseudo-stereoize only the voice of the subject extracted by the sound extraction means;
The imaging apparatus according to claim 1, wherein the imaging apparatus is configured as described above.

Against the computer,
A function to shoot moving images,
The ability to collect monaural audio,
A function to detect the face of the subject from the captured image;
A function of specifying the position of the detected face of the subject in the captured image as a sound source position in a state where the monaural sound is collected;
A function of making the collected monaural sound pseudo-stereo according to a sound source position in the identified captured image;
A program to realize