JP2022135674A

JP2022135674A - Electronic apparatus, information processing apparatus, control method, learning method, and program

Info

Publication number: JP2022135674A
Application number: JP2021035631A
Authority: JP
Inventors: 拓人鈴木; Takuto Suzuki; 優大糸井; Yudai Itoi
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2022-09-15

Abstract

To appropriately discriminate a main object person and a person having a conversation with the main object person from a photographed image.SOLUTION: An electronic apparatus has: person detection means that detects persons from a picked-up image; main object selection means that selects a person to be a main object from the persons detected by the person detection means; person motion detection means that detects motions of the persons detected by the person detection means; person motion counting means that counts the number of times of the motions detected by the person motion detection means; and grouping means that, when the number of times of motions of a predetermined person detected within a predetermined first time from when the utterance of the main object person is detected reaches a predetermined number of times within the predetermined first time, groups the predetermined person as a person having a conversation with the main object.SELECTED DRAWING: Figure 2

Description

本発明は、撮影時に人物の会話音声を記録する技術に関する。 The present invention relates to a technique for recording conversational voices of people when shooting.

動画撮影時などにおいて人物の会話音声を録音する場合、撮影環境によっては人物の音声が聞き取り難い音量で録音されることがあるため、会話している人物を判別しその人物の会話音声を分離し強調して録音する技術が知られている。このように会話している人物を判別する技術として、特許文献１には、口唇動作の特徴と音声の特徴から話し手となる人物を認識する方法が提案されている。また、特許文献２には、互いに隣接する２つの顔の向きと位置から画像中の人物をグループ分けする方法が提案されている。 When recording conversational voices of people, such as when shooting movies, the voices of people may be recorded at a volume that is difficult to hear depending on the shooting environment. Techniques for emphasizing and recording are known. As a technique for discriminating a person who is having a conversation, Patent Document 1 proposes a method for recognizing a person who is a speaker based on characteristics of lip movements and voice. In addition, Japanese Patent Laid-Open No. 2002-200002 proposes a method of grouping people in an image based on the directions and positions of two faces that are adjacent to each other.

特開２０１０－１９２９５６号公報JP 2010-192956 A 特開２０１５－０５６７０１号公報JP 2015-056701 A

しかしながら、特許文献１では、撮影時に人物の会話音声を録音する場合、会話している人物の中でユーザの興味のない人物の発話音声も録音されてしまう。ユーザの興味のある発話音声とは、発話している人物（主対象）の会話音声や主対象の話し相手となる人物の会話音声であり、それら以外の人物の発話音声が録音されてしまう。また、特許文献２では、顔の向きを特定できないと、会話している人物を適切に判別することができない。 However, in Japanese Patent Application Laid-Open No. 2002-200013, when recording conversational voices of people at the time of photographing, the spoken voices of people who are not interested in the user among the people who are having a conversation are also recorded. The utterance voice that the user is interested in is the conversation voice of the person who is speaking (the main target) and the conversation voice of the person who is the conversation partner of the main target, and the utterance voices of other people are recorded. Further, in Patent Document 2, if the orientation of the face cannot be specified, the person who is having a conversation cannot be properly identified.

本発明は、上記課題に鑑みてなされ、その目的は、撮影画像から主対象の人物と主対象の人物と会話している人物を適切に判別することができる技術を実現することである。 SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and an object thereof is to realize a technique capable of appropriately distinguishing between a main target person and a person conversing with the main target person from a photographed image.

上記課題を解決し、目的を達成するために、本発明の電子機器は、撮像された画像から人物を検出する人物検出手段と、前記人物検出手段によって検出された人物の中から、主対象となる人物を選定する主対象選定手段と、前記人物検出手段によって検出された人物の動作を検出する人物動作検出手段と、前記人物動作検出手段によって検出された動作の回数をカウントする人物動作カウント手段と、前記主対象の人物の発話が検出されてから所定の第１の時間以内に検出された所定の人物の動作回数が前記所定の第１の時間以内に所定回数となった場合、前記所定の人物を、前記主対象と会話している人物としてグルーピングを行うグルーピング手段と、を有する。 In order to solve the above problems and achieve the object, an electronic device according to the present invention includes person detection means for detecting a person from a captured image; a main target selecting means for selecting a person to be a person; a human motion detecting means for detecting the human motion detected by said human motion detecting means; and a human motion counting means for counting the number of motions detected by said human motion detecting means. and, when the number of actions of a predetermined person detected within a predetermined first time after the utterance of the main target person is detected reaches a predetermined number of times within the predetermined first time, the predetermined and a grouping means for grouping the person as a person who is conversing with the main subject.

本発明によれば、撮影画像から主対象の人物と主対象の人物と会話している人物を適切に判別することができるようになる。 According to the present invention, a main target person and a person conversing with the main target person can be appropriately discriminated from a photographed image.

本実施形態の装置構成を示すブロック図。FIG. 2 is a block diagram showing the device configuration of the embodiment; 本実施形態の撮像処理部と音声処理部の構成を示すブロック図。FIG. 2 is a block diagram showing the configurations of an imaging processing unit and an audio processing unit according to the embodiment; 本実施形態の主対象選定方法の説明図。Explanatory drawing of the main target selection method of this embodiment. 実施形態１の想定シーン１の説明図。FIG. 2 is an explanatory diagram of an assumed scene 1 of the first embodiment; 実施形態１の想定シーン１と想定シーン３の処理を示すフローチャート。4 is a flow chart showing processing of assumed scene 1 and assumed scene 3 of the first embodiment. 実施形態１の想定シーン２の説明図。FIG. 4 is an explanatory diagram of an assumed scene 2 of the first embodiment; 実施形態１の想定シーン２の処理を示すフローチャート。4 is a flowchart showing processing of assumed scene 2 of the first embodiment; 実施形態１の想定シーン３の説明図。FIG. 3 is an explanatory diagram of an assumed scene 3 of the first embodiment; 実施形態１の想定シーン４の説明図。FIG. 4 is an explanatory diagram of an assumed scene 4 of the first embodiment; 実施形態１の想定シーン４の処理を示すフローチャート。10 is a flowchart showing processing of assumed scene 4 according to the first embodiment; 実施形態２の機械学習を実行するサーバのハードウェア構成図。FIG. 8 is a hardware configuration diagram of a server that executes machine learning according to the second embodiment; 実施形態２の撮像装置の推定処理およびサーバの学習処理を実行するためのソフトウェア構成図。FIG. 10 is a software configuration diagram for executing estimation processing of the imaging device and learning processing of the server according to the second embodiment; 実施形態２の学習済モデルの概念図。FIG. 10 is a conceptual diagram of a trained model of Embodiment 2; 実施形態２の学習処理を示すフローチャート。9 is a flowchart showing learning processing according to the second embodiment; 実施形態２の推定処理を示すフローチャート。9 is a flowchart showing estimation processing according to the second embodiment;

以下、添付図面を参照して実施形態を詳しく説明する。尚、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In addition, the following embodiments do not limit the invention according to the scope of claims. Although multiple features are described in the embodiments, not all of these multiple features are essential to the invention, and multiple features may be combined arbitrarily. Furthermore, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant description is omitted.

［実施形態１］
以下では、本発明の電子機器をデジタルカメラなどの撮像装置に適用した実施形態について説明する。 [Embodiment 1]
Embodiments in which the electronic device of the present invention is applied to an imaging device such as a digital camera will be described below.

まず、図１から図３を参照して、本実施形態の撮像装置に含まれる音声処理装置の構成および機能ついて説明する。 First, with reference to FIGS. 1 to 3, the configuration and functions of the audio processing device included in the imaging device of this embodiment will be described.

図１は、実施形態１の撮像装置１００の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of an imaging device 100 according to Embodiment 1. As shown in FIG.

撮像部１０１は、撮影光学系を形成するレンズ部、レンズ部により形成された被写体の光学像を電気信号に変換するイメージセンサ、Ａ／Ｄ変換部などを含む。撮像部１０１は被写体の光学像を撮像して生成されたアナログ画像信号をデジタル信号に変換し、画像処理部１０２に出力する。画像処理部１０２は、撮像部１０１により生成されたデジタル信号に各種の画像処理を行い、画像データを生成する。なお、レンズ部は、撮像装置１００に内蔵されたレンズであっても、着脱可能なレンズであってもよい。 The imaging unit 101 includes a lens unit that forms an imaging optical system, an image sensor that converts an optical image of a subject formed by the lens unit into an electrical signal, an A/D conversion unit, and the like. The imaging unit 101 converts an analog image signal generated by capturing an optical image of a subject into a digital signal, and outputs the digital signal to the image processing unit 102 . The image processing unit 102 performs various image processing on the digital signal generated by the imaging unit 101 to generate image data. Note that the lens unit may be a lens built into the imaging device 100 or a detachable lens.

音声入力部１０３は、撮像装置１００に内蔵された、または、音声端子を介して接続された１つまたは複数のマイクにより、撮像装置１００の周辺の音声を集音して電気信号に変換する。音声入力部１０３は、撮像装置１００の周辺の音声を集音して生成されたアナログ音声信号をデジタル信号に変換して音声処理部１０４に出力する。音声処理部１０４は、音声入力部１０３により生成されたデジタル信号に各種の音声処理を行い、音声データを生成する。マイクは、指向性のあるマイクでも、指向性のないマイクでもよい。 The audio input unit 103 collects sounds around the imaging device 100 using one or a plurality of microphones built into the imaging device 100 or connected via an audio terminal, and converts the sounds into electrical signals. The audio input unit 103 converts analog audio signals generated by collecting sounds around the imaging device 100 into digital signals and outputs the digital signals to the audio processing unit 104 . The audio processing unit 104 performs various audio processing on the digital signal generated by the audio input unit 103 to generate audio data. The microphones may be directional or non-directional.

メモリ１０５は、画像処理部１０２により生成された画像データや、音声処理部１０４により生成された音声データを一時的に記憶する。また、メモリ１０５は、制御部１１１が実行するプログラムやパラメータなどを記憶するワークメモリとして利用される。 A memory 105 temporarily stores image data generated by the image processing unit 102 and audio data generated by the audio processing unit 104 . Also, the memory 105 is used as a work memory that stores programs executed by the control unit 111, parameters, and the like.

表示制御部１０６は、画像処理部１０２により生成された画像データの映像や、撮像装置１００の操作画面、メニュー画面等を表示部１０７や、不図示の映像端子を介して外部のディスプレイに表示させる。表示部１０７はタッチパネル機能を有し、ユーザによるタッチ操作を検出することで撮像装置１００の設定操作、メニュー画面の操作や被写体の選択操作などが可能である。 The display control unit 106 displays the video of the image data generated by the image processing unit 102, the operation screen of the imaging device 100, the menu screen, etc. on the display unit 107 or an external display via a video terminal (not shown). . The display unit 107 has a touch panel function, and by detecting a touch operation by the user, setting operation of the imaging apparatus 100, menu screen operation, subject selection operation, and the like are possible.

符号化処理部１０８は、メモリ１０５に一時的に記憶されている画像データや音声データを読み出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成する。なお、音声データは圧縮しないようにしてもよい。圧縮画像データは、例えば、ＭＰＥＧ２やＨ．２６４／ＭＰＥＧ４－ＡＶＣなど、どのような圧縮方式で圧縮されたものであってもよい。また、圧縮音声データも、ＡＣ３（Ａ）ＡＣ、ＡＴＲＡＣ、ＡＤＰＣＭなどのような圧縮方式で圧縮されたものであってもよい。 The encoding processing unit 108 reads image data and audio data temporarily stored in the memory 105 and performs predetermined encoding to generate compressed image data, compressed audio data, and the like. Note that the audio data may not be compressed. Compressed image data is, for example, MPEG2 or H.264. 264/MPEG4-AVC or any other compression method. Also, the compressed audio data may be compressed by a compression method such as AC3(A)AC, ATRAC, ADPCM, or the like.

記録再生部１０９は、記録媒体１１０に対して、符号化処理部１０８で生成された圧縮画像データ、圧縮音声データまたは音声データなどの各種データを記録媒体１１０に記録したり、記録媒体１１０に記録された各種データを読み出したりする。記録媒体１１０は、画像データ、音声データ等を記録することができる、磁気ディスク、光学式ディスク、半導体メモリなどのあらゆる記録媒体を含む。 The recording/reproducing unit 109 records various data such as compressed image data, compressed audio data, or audio data generated by the encoding processing unit 108 on the recording medium 110 , or records them on the recording medium 110 . and read various data that has been stored. The recording medium 110 includes all recording media such as magnetic disks, optical disks, semiconductor memories, etc., which can record image data, audio data, and the like.

制御部１１１は、撮像装置１００の各構成要素を制御することで撮像装置１００の各種制御を実行するためのＣＰＵなどの演算処理装置や各種制御を実行するためのプログラムを格納するＲＯＭなどの記憶装置や撮像装置１００に接続された外部記憶装置を含む。 The control unit 111 includes an arithmetic processing unit such as a CPU for executing various controls of the imaging apparatus 100 by controlling each component of the imaging apparatus 100, and a memory such as a ROM for storing programs for executing various controls. It includes an external storage device connected to the device and imaging device 100 .

操作部１１２は、ユーザ操作を受け付けるボタンやダイヤルなどの操作部材からなり、ユーザ操作に応じた操作信号を制御部１１１に出力する。本実施形態では、操作部１１２は、電源のオン／オフを指示するための電源ボタン、動画記録の開始／終了を指示するための撮影ボタン、光学的もしくは電子的に画像に対してズーム動作する指示するためのズームレバーを含む。また、操作部１１２は、動作モードを切り替えるためのモード切替スイッチ、各種設定を行うための十字キーや決定キーなどを含む。 The operation unit 112 includes operation members such as buttons and dials that accept user operations, and outputs operation signals corresponding to user operations to the control unit 111 . In this embodiment, the operation unit 112 includes a power button for instructing power on/off, a shooting button for instructing start/end of moving image recording, and optically or electronically zooming an image. Including a zoom lever for pointing. Further, the operation unit 112 includes a mode switch for switching operation modes, a cross key for performing various settings, an enter key, and the like.

音声出力部１１３は、記録再生部１０９により再生された音声データや圧縮音声データ、または制御部１１１により出力される音声データをスピーカ１１４や音声端子などに出力する。外部出力部１１５は、記録再生部１０９により再生された圧縮映像データや圧縮音声データ、音声データなどを外部機器に出力する。データバス１１６は、撮像装置１００の各構成要素の間で、音声データや画像データ、制御信号などの各種のデータを送受する。 The audio output unit 113 outputs audio data or compressed audio data reproduced by the recording/reproducing unit 109 or audio data output by the control unit 111 to the speaker 114, an audio terminal, or the like. The external output unit 115 outputs the compressed video data, compressed audio data, audio data, etc. reproduced by the recording/reproducing unit 109 to an external device. A data bus 116 transmits and receives various data such as audio data, image data, and control signals between the components of the imaging apparatus 100 .

次に、本実施形態の撮像装置１００の機能について説明する。 Next, functions of the imaging device 100 of this embodiment will be described.

本実施形態の撮像装置１００は、ユーザが操作部１１２に含まれる電源を操作して電源を投入する指示を行うと、不図示の電源供給部から、撮像装置１００の各構成要素に電源が供給される。 In the imaging apparatus 100 of the present embodiment, when the user operates the power supply included in the operation unit 112 to give an instruction to turn on the power, a power supply unit (not shown) supplies power to each component of the imaging apparatus 100 . be done.

電源が供給されると、制御部１１１は、操作部１１２に含まれるモード切替スイッチの操作信号により動作モードを判別する。動作モードは、例えば、静止画撮影モードや動画記録モード、再生モードなどを含む。動画記録モードでは、撮像部１０１により撮像され画像処理部１０２により生成された画像データと、音声入力部１０３により集音され音声処理部１０４により生成された音声データとを１つのファイルとして記録する。再生モードでは、記録媒体１１０に記録された圧縮画像データを記録再生部１０９により再生して表示部１０７に表示する。 When power is supplied, the control unit 111 determines the operation mode based on the operation signal of the mode changeover switch included in the operation unit 112 . Operation modes include, for example, a still image shooting mode, a moving image recording mode, a playback mode, and the like. In the moving image recording mode, image data captured by the imaging unit 101 and generated by the image processing unit 102 and audio data collected by the audio input unit 103 and generated by the audio processing unit 104 are recorded as one file. In the playback mode, the compressed image data recorded on the recording medium 110 is played back by the recording/playback unit 109 and displayed on the display unit 107 .

動画記録モードでは、制御部１１１は、撮影待機状態に移行するように制御信号を撮像装置１００の各構成要素に送信し、各構成要素は以下のような動作を行う。撮像部１０１が生成したデジタル画像信号に対して、画像処理部１０２が各種の画像処理を実行して画像データを生成し、表示制御部１０６が画像データの映像（ライブビュー）を表示部１０７に表示する。ユーザは、ライブビュー映像を見ながら自動焦点調節（ＡＦ）処理や自動露出（ＡＥ）処理などの撮影準備を行うことができる。また、音声入力部１０３が生成したデジタル音声信号に対して、音声処理部１０４が各種の音声処理を実行して音声データを生成し、音声出力部１１３が音声データをスピーカ１１４や不図示のイヤホンから出力する。ユーザは、音声を聞きながらボリューム調整などの録音準備を行うことができる。 In the moving image recording mode, the control unit 111 transmits a control signal to each component of the imaging device 100 so as to shift to the shooting standby state, and each component performs the following operations. The image processing unit 102 performs various types of image processing on the digital image signal generated by the imaging unit 101 to generate image data, and the display control unit 106 displays an image (live view) of the image data on the display unit 107. indicate. The user can prepare for photographing such as automatic focus adjustment (AF) processing and automatic exposure (AE) processing while viewing the live view image. Further, the audio processing unit 104 performs various audio processing on the digital audio signal generated by the audio input unit 103 to generate audio data, and the audio output unit 113 outputs the audio data to the speaker 114 or an earphone (not shown). Output from The user can make preparations for recording such as adjusting the volume while listening to the voice.

次に、ユーザが操作部１１２に含まれる撮影ボタンを操作して撮影を開始する指示を行うと、制御部１１１は、撮像装置１００の各構成要素に撮影開始を指示する制御信号を撮像装置１００の各構成要素に送信し、各構成要素は以下のような動作を行う。撮像部１０１および画像処理部１０２により生成された画像データをメモリ１０５に一時的に記憶する。音声入力部１０３および音声処理部１０４により生成された音声データ（マイクが１つの場合）またはマルチチャンネルの音声データ（マイクが複数の場合）をメモリ１０５に一時的に記憶する。 Next, when the user operates the shooting button included in the operation unit 112 to give an instruction to start shooting, the control unit 111 sends a control signal instructing each component of the imaging device 100 to start shooting. , and each component operates as follows. Image data generated by the imaging unit 101 and the image processing unit 102 is temporarily stored in the memory 105 . Audio data generated by the audio input unit 103 and the audio processing unit 104 (when there is one microphone) or multi-channel audio data (when there are multiple microphones) is temporarily stored in the memory 105 .

符号化処理部１０８は、メモリ１０５に一時的に記憶されている画像データや音声データを読み出して所定の符号化処理を行い、圧縮画像データ、圧縮音声データ等を生成する。制御部１１１は、符号化処理部１０８により生成された圧縮画像データと圧縮音声データを合成し、データストリームを形成して記録再生部１０９に出力する。なお、音声データを圧縮しない場合には、制御部１１１は、メモリ１０５に格納された音声データと圧縮画像データとを合成し、データストリームを形成して記録再生部１０９に出力する。 The encoding processing unit 108 reads image data and audio data temporarily stored in the memory 105 and performs predetermined encoding processing to generate compressed image data, compressed audio data, and the like. The control unit 111 synthesizes the compressed image data and compressed audio data generated by the encoding processing unit 108 , forms a data stream, and outputs the data stream to the recording/playback unit 109 . When the audio data is not compressed, the control unit 111 synthesizes the audio data stored in the memory 105 and the compressed image data, forms a data stream, and outputs the data stream to the recording/playback unit 109 .

記録再生部１０９は、ＵＤＦ、ＦＡＴ等の所定のファイルシステムに基づいて、データストリームを１つの動画ファイルとして記録媒体１１０に書き込む。制御部１１１は、このような動作を撮影中は継続する。 The recording/playback unit 109 writes the data stream as one moving image file to the recording medium 110 based on a predetermined file system such as UDF or FAT. The control unit 111 continues such operations during shooting.

次に、ユーザが操作部１１２に含まれる撮影ボタンを操作して撮影を終了する指示を行うと、制御部１１１は、撮像装置１００の各構成要素に撮影終了を指示する制御信号を送信し、各構成要素は以下のような動作を行う。 Next, when the user operates the shooting button included in the operation unit 112 to give an instruction to end shooting, the control unit 111 transmits a control signal instructing each component of the imaging device 100 to stop shooting, Each component operates as follows.

撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４は、それぞれ画像データ、音声データの生成を停止する。符号化処理部１０８は、メモリ１０５に記憶されている残りの画像データと音声データとを読み出して所定の符号化処理を行い、圧縮画像データ、圧縮音声データ等の生成した後に処理を停止する。音声データを圧縮しない場合には、圧縮画像データの生成が終了した後に処理を停止する。制御部１１１は、最後に生成された圧縮画像データと、圧縮音声データまたは音声データとを合成し、データストリームを形成して記録再生部１０９に出力する。 The imaging unit 101, the image processing unit 102, the audio input unit 103, and the audio processing unit 104 stop generating image data and audio data, respectively. The encoding processing unit 108 reads the remaining image data and audio data stored in the memory 105, performs predetermined encoding processing, and stops processing after generating compressed image data, compressed audio data, and the like. If the audio data is not to be compressed, the processing is stopped after the generation of the compressed image data is completed. The control unit 111 synthesizes the finally generated compressed image data with the compressed audio data or audio data, forms a data stream, and outputs the data stream to the recording/playback unit 109 .

記録再生部１０９は、ＵＤＦ、ＦＡＴ等の所定のファイルシステムに基づいて、データストリームを１つの動画ファイルとして記録媒体１１０に書き込み、データストリームの供給が停止されると動画ファイルを完成させて処理を停止する。 The recording/reproducing unit 109 writes the data stream as one moving image file to the recording medium 110 based on a predetermined file system such as UDF or FAT, and completes the moving image file when the supply of the data stream is stopped. Stop.

制御部１１１は、記録再生部１０９が処理を停止すると、撮影待機状態に移行させるように制御信号を撮像装置１００の各構成要素に送信し、撮影待機状態に戻る。 When the recording/reproducing unit 109 stops processing, the control unit 111 transmits a control signal to each component of the imaging apparatus 100 so as to shift to the shooting standby state, and returns to the shooting standby state.

次に、ユーザが操作部１１２に含まれるモード切替スイッチを操作して再生モードに切り替える指示を行うと、制御部１１１は、撮像装置１００の各構成要素に再生状態への移行を指示する制御信号を送信し、各構成要素は以下のような動作を行う。記録再生部１０９が記録媒体１１０に記録されている圧縮画像データと圧縮音声データとからなる動画ファイルを読み出して符号化処理部１０８に送信する。符号化処理部１０８は、記録再生部１０９から受信した圧縮画像データおよび圧縮音声データを復号して表示制御部１０６および音声出力部１１３に送信する。表示制御部１０６は、復号された画像データを表示部１０７に表示する。音声出力部１１３は、復号された音声データを内蔵された、または、音声端子に接続された外部スピーカから出力する。 Next, when the user operates the mode changeover switch included in the operation unit 112 to give an instruction to switch to the playback mode, the control unit 111 sends a control signal instructing each component of the imaging device 100 to transition to the playback state. , and each component operates as follows. The recording/reproducing unit 109 reads a moving image file composed of compressed image data and compressed audio data recorded on the recording medium 110 and transmits it to the encoding processing unit 108 . The encoding processing unit 108 decodes the compressed image data and the compressed audio data received from the recording/playback unit 109 and transmits the decoded data to the display control unit 106 and the audio output unit 113 . Display control unit 106 displays the decoded image data on display unit 107 . The audio output unit 113 outputs the decoded audio data from an internal speaker or an external speaker connected to an audio terminal.

なお、本実施形態では、音声入力部１０３または音声処理部１０４が音声信号または音声データを生成する際に、マイクにより集音された音声信号のレベル調整処理を行う。レベル調整処理は、撮像装置１００が起動してから常に行われてもよいし、撮影モードが選択されてから行われてもよい、または、音声の記録に関連するモードが選択されてから行われてもよい。また、音声の記録に関連するモードにおいて、音声の記録が開始したことに応じて行ってもよい。本実施形態では、動画の撮影が開始されたタイミングで処理を行うものとする。 In this embodiment, when the audio input unit 103 or the audio processing unit 104 generates an audio signal or audio data, level adjustment processing is performed on the audio signal collected by the microphone. The level adjustment process may be always performed after the imaging apparatus 100 is activated, may be performed after the shooting mode is selected, or may be performed after a mode related to audio recording is selected. may Alternatively, in a mode related to audio recording, it may be performed in response to the start of audio recording. In this embodiment, it is assumed that processing is performed at the timing when shooting of a moving image is started.

以上のように、本実施形態の撮像装置１００は画像や音声の記録再生を行うことができる。 As described above, the imaging apparatus 100 of this embodiment can record and reproduce images and sounds.

次に、図２を参照して、撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４の詳細な構成および機能について説明する。 Next, detailed configurations and functions of the imaging unit 101, the image processing unit 102, the audio input unit 103, and the audio processing unit 104 will be described with reference to FIG.

撮像部１０１は、撮影光学系を形成するレンズ部２０１、レンズ部２０１により取り込まれた被写体の光学像を電気信号に変換するＣＣＤ、ＣＭＯＳなどの光電変換素子からなるイメージセンサ２０２を含む。また、撮像部１０１は、レンズ部２０１を駆動するための位置センサ、モータ等の駆動機構を有するレンズ制御部２０３を含む。本実施形態では撮像部１０１がレンズ部２０１やレンズ制御部２０３を内蔵している構成であるが、撮像装置１００に対して着脱可能なレンズであってもよい。例えば、ズーム動作やフォーカス調整などの指示を、ユーザが操作部１１２を操作して入力すると、制御部１１１は、レンズ制御部２０３にレンズを移動させる制御信号（駆動信号）を送信する。レンズ制御部２０３は、制御信号に応じて、位置センサでレンズ部２０１の位置を確認し、モータ等でレンズ部２０１の移動を行う。 The imaging unit 101 includes a lens unit 201 that forms an imaging optical system, and an image sensor 202 that includes a photoelectric conversion element such as a CCD or CMOS that converts an optical image of a subject captured by the lens unit 201 into an electrical signal. The imaging unit 101 also includes a lens control unit 203 having a driving mechanism such as a position sensor and a motor for driving the lens unit 201 . In this embodiment, the imaging unit 101 incorporates the lens unit 201 and the lens control unit 203 , but a lens that is detachable from the imaging apparatus 100 may be used. For example, when the user operates the operation unit 112 to input an instruction such as zoom operation or focus adjustment, the control unit 111 transmits a control signal (driving signal) for moving the lens to the lens control unit 203 . The lens control unit 203 confirms the position of the lens unit 201 with a position sensor and moves the lens unit 201 with a motor or the like according to the control signal.

画像処理部１０２は、イメージセンサ２０２により生成された画像信号に対して各種の画質調整処理を行う画像調整部２２１を含む。画像調整処理が施された画像データはデータバス１１６を介してメモリ１０５に送信される。制御部１１１は、画像調整部２２１により画質調整処理が施された画像データを用いてＡＦ処理やＡＥ処理などを行う。また、画像処理部１０２は後述する各種の機能部を含む。人物検出部２２２は、画像調整部２２１により画質調整処理が施された画像データから、目や鼻や口などの人物の顔の特徴点を抽出し、画像データにおける人物の位置や顔の大きさなどを検出する。また、人物検出部２２２は、発話や発話における唇や頭や手などの身体の動き、視線や顔の動き、更に、顔を縦に振る又は頷く肯定動作、顔を横に捻る又は振る否定動作を検出する。人物動作検出部２２３は、喜怒哀楽の表情の変化などを検出する。人物動作カウント部２２４は、人物動作検出部２２３が検出した回数をカウントする。主対象選定部２２５は、人物検出部２２２にて検出された人物から、主対象となるどの人物を選定する。主対象選定部２２５における選定は、制御部１１１により定められた条件に基づいて行われる。主対象選定部２２５による主対象の選定方法については後述する。 The image processing unit 102 includes an image adjustment unit 221 that performs various image quality adjustment processes on image signals generated by the image sensor 202 . Image data that has undergone image adjustment processing is transmitted to memory 105 via data bus 116 . The control unit 111 performs AF processing, AE processing, and the like using image data that has been subjected to image quality adjustment processing by the image adjustment unit 221 . Also, the image processing unit 102 includes various functional units to be described later. The person detection unit 222 extracts feature points of a person's face, such as eyes, nose, and mouth, from the image data subjected to image quality adjustment processing by the image adjustment unit 221, and determines the position and size of the person's face in the image data. and so on. In addition, the person detection unit 222 detects utterances, movements of the body such as lips, head, and hands in utterances, movements of the line of sight and the face, affirmative actions of shaking the face vertically or nodding, and negative actions of twisting or shaking the face sideways. to detect The human motion detection unit 223 detects changes in facial expressions such as emotions. The human motion counting unit 224 counts the number of times the human motion detecting unit 223 has detected. The main target selection unit 225 selects a person as a main target from the persons detected by the person detection unit 222 . Selection by the main target selection unit 225 is performed based on conditions determined by the control unit 111 . A method of selecting a main target by the main target selection unit 225 will be described later.

さらに、画像処理部１０２は会話グルーピング部２２６を有する。会話グルーピング部２２６は、人物検出部２２２において検出された人物から、主対象選定部２２５にて選定された主対象である人物と会話している人物を判定し、主対象の会話グループとしてグルーピングを行う。会話グルーピング部２２６による処理の詳細は後述する。 Furthermore, the image processing unit 102 has a conversation grouping unit 226 . The conversation grouping unit 226 determines a person who is having a conversation with the main target person selected by the main target selection unit 225 from the persons detected by the person detection unit 222, and groups them as a main target conversation group. conduct. The details of the processing by conversation grouping unit 226 will be described later.

次に、音声入力部１０３および音声処理部１０４について説明する。 Next, the voice input section 103 and the voice processing section 104 will be described.

音声入力部１０３は、音声振動を電気信号に変換し、音声信号として出力するマイク２１１を含む。本実施形態では、マイク２１１は左右のＬｃｈ／Ｒｃｈの２チャンネルで構成されたステレオ方式であるが、１チャンネルのモノラル方式でも、２チャンネル以上の複数のマイクを保持する構成でもよい。Ａ／Ｄ変換部２１２は、マイク２１１により生成されたアナログ音声信号をデジタル音声信号に変換する。 The audio input unit 103 includes a microphone 211 that converts audio vibrations into electrical signals and outputs them as audio signals. In this embodiment, the microphone 211 is of a stereo system configured with left and right Lch/Rch channels, but may be of a single channel monaural system or may be configured to hold a plurality of microphones of two or more channels. The A/D converter 212 converts the analog audio signal generated by the microphone 211 into a digital audio signal.

音声処理部１０４は、音声入力部１０３により変換された音声信号に各種音声処理を行う回路である。本実施形態では、音声処理部１０４は音声抽出部２１３、音声調整部２１５および音声合成部２１７を有する。音声抽出部２１３は、人物の音声とそれ以外の音声（非人物音声）とに分離抽出することが可能であり、さらに、人物音声抽出部２１４では、人物検出部２２２の情報に基づいて、複数の人物の音声を人物ごとの音声に分離することが可能である。音声調整部２１５は、音声抽出部２１３により分離抽出された音声に対して、個別にレベル調整やイコライザ等の音声処理を行うことができる。会話音声調整部２１６は、会話グルーピング部２２６による判定結果に基づいて、分離抽出された音声に対して聞こえやすく強調したり、聞こえにくく控えめにしたりする調整処理を行う。会話音声調整部２１６による調整処理の詳細は後述する。音声合成部２１７は、音声調整部２１５により調整された個々の分離抽出された音声を合成し、再度１つの音声信号に戻す。ＡＬＣ（オートレベルコントローラ）２１９は、音声合成部２１７により合成された音声信号の振幅を所定のレベルに調整する。 The audio processing unit 104 is a circuit that performs various audio processing on the audio signal converted by the audio input unit 103 . In this embodiment, the audio processor 104 has an audio extractor 213 , an audio adjuster 215 and an audio synthesizer 217 . The voice extraction unit 213 can separate and extract human voices and other voices (non-human voices). It is possible to separate the voice of each person into the voice of each person. The audio adjustment unit 215 can individually perform audio processing such as level adjustment and an equalizer on the audio separated and extracted by the audio extraction unit 213 . Based on the result of determination by the conversation grouping unit 226, the conversation voice adjustment unit 216 performs adjustment processing for emphasizing the separated and extracted voice to make it easier to hear, or making it less audible and more conservative. Details of the adjustment processing by the conversation voice adjustment unit 216 will be described later. The voice synthesizing unit 217 synthesizes the individual separated and extracted voices adjusted by the voice adjusting unit 215, and restores them to one voice signal again. An ALC (Auto Level Controller) 219 adjusts the amplitude of the audio signal synthesized by the audio synthesizing section 217 to a predetermined level.

以上の構成により、音声処理部１０４は音声信号に所定の処理を行って生成した音声データをメモリ１０５へ送信する。 With the above configuration, the audio processing unit 104 performs predetermined processing on the audio signal and transmits the generated audio data to the memory 105 .

また、主対象選定部２２５は、以下に説明するように、メモリ１０５に一時記憶されている画像データおよび音声データを用いることで、記録された一連の画像情報を用いて音声処理の対象人物を選定することが可能となる。 In addition, as described below, the main subject selection unit 225 uses image data and audio data temporarily stored in the memory 105 to select a target person for audio processing using a series of recorded image information. It becomes possible to select.

次に、図３を参照して、主対象選定部２２５による主対象の選定方法について説明する。 Next, a method of selecting a main target by the main target selection unit 225 will be described with reference to FIG.

本実施形態における主対象とは、ユーザが着目する人物を示すため、その選定条件はさまざまな方法が考えられる。例えば、図３（ａ）の場合、会話している人物３０１は撮像装置１００がＡＦ枠３０２を合わせている被写体に対応し、ＡＦ枠３０２と会話している人物３０１が一致していることから、撮像装置１００はＡＦ枠３０２を合わせている被写体を主被写体と判定する。このように主被写体を選定することができる。 Since the main target in this embodiment indicates a person to whom the user pays attention, various methods are conceivable for the selection condition. For example, in the case of FIG. 3A, the person 301 having a conversation corresponds to the subject to which the AF frame 302 of the imaging apparatus 100 is aligned, and the AF frame 302 and the person 301 having a conversation match. , the imaging apparatus 100 determines that the subject with which the AF frame 302 is aligned is the main subject. The main subject can be selected in this way.

また、図３（ｂ）のように事前に登録された顔画像を用いる方法もある。顔を縦に振る又は頷く肯定動作や顔を横に捻る又は振る否定動作が顔画像３０３として事前にメモリ１０５に登録されており、その顔画像３０３と一致する人物を主対象と選定することもできる。 There is also a method using a pre-registered face image as shown in FIG. 3(b). An affirmative action of shaking the face vertically or nodding and a negative action of twisting or shaking the face sideways are registered in the memory 105 in advance as the face image 303, and a person who matches the face image 303 may be selected as the main target. can.

また、図３（ｃ）のようにユーザの意思により主対象を決定する方法もある。表示部１０７がタッチパネル機能を有する場合、表示されている画像中の所望の人物にユーザがタッチ操作３０４を行うことで主対象を選択することもできる。 There is also a method of determining the main object according to the user's intention, as shown in FIG. 3(c). When the display unit 107 has a touch panel function, the user can also select a main target by performing a touch operation 304 on a desired person in the displayed image.

また、図３（ｄ）のように記録済みの動画データから主対象を選定する方法もある。例えば、記録済みの動画データ３０６の中で登場する頻度やフォーカスの対象となった頻度が最も高い人物３０５を主対象として選定することができる。 There is also a method of selecting a main target from recorded moving image data as shown in FIG. 3(d). For example, the person 305 who appears most frequently in the recorded moving image data 306 and has the highest frequency of focus can be selected as the main subject.

＜グルーピング処理＞次に、図４から図１０を参照して、本実施形態の画像処理部１０２によるグルーピング処理について説明する。 <Grouping Processing> Next, the grouping processing by the image processing unit 102 of the present embodiment will be described with reference to FIGS. 4 to 10. FIG.

なお、図５、図７および図１０の処理は、制御部１１１のＣＰＵなどがＲＯＭなどの記憶デバイスに格納されているプログラムを実行し、図２に示した画像処理部１０２の各構成要素を制御することにより実現される。 5, 7 and 10, the CPU of the control unit 111 or the like executes a program stored in a storage device such as a ROM, and each component of the image processing unit 102 shown in FIG. It is realized by controlling.

＜想定シーン１＞まず、図４および図５を参照して、本実施形態の想定シーン１における処理について説明する。 <Assumed Scene 1> First, processing in the assumed scene 1 of the present embodiment will be described with reference to FIGS. 4 and 5. FIG.

図４は、本実施形態の想定シーン１の説明図である。図５は、想定シーン１の処理を示すフローチャートである。以下では、図４に例示する想定シーン１に沿って図５の処理の流れを説明する。 FIG. 4 is an explanatory diagram of an assumed scene 1 of this embodiment. FIG. 5 is a flow chart showing the processing of the assumed scene 1. As shown in FIG. Below, the flow of processing in FIG. 5 will be described along the assumed scene 1 illustrated in FIG.

図４（ａ）は、人物Ａ４０１、人物Ｂ４０２、人物Ｃ４０３が第１のグループとして会話しており、人物Ｄ４０４、人物Ｅ４０５が第２のグループとして会話している状態を例示している。図４（ａ）では、ユーザが主対象としたい人物が人物Ａ４０１であるとする。 FIG. 4A illustrates a state in which person A401, person B402, and person C403 are having a conversation as a first group, and person D404 and person E405 are having a conversation as a second group. In FIG. 4A, it is assumed that the person whom the user wants to be the main target is person A401.

Ｓ５０１：主対象の選定
主対象選定部２２５は、上述した主対象選定方法により人物Ａ４０１を主対象として選定する。ユーザの意図としては、主に、フォーカスを合わせたい人物を主対象として選択する。 S501: Selection of Main Subject The main subject selection unit 225 selects the person A401 as the main subject by the main subject selection method described above. The user's intention is to mainly select a person to be focused on as the main object.

Ｓ５０２：主対象の動作検出
主対象として人物Ａ４０１が選択されている状態で、図４（ｂ）のように主対象である人物Ａ４０１が発話すると、人物動作検出部２２３は、主対象である人物Ａ４０１が発話したことを検出する。 S502: Detection of Movement of Main Subject When the person A401 is selected as the main subject and the person A401 as the main subject speaks as shown in FIG. Detects that A401 has spoken.

Ｓ５０３：主対象以外の人物の動作検出
人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、画像中の主対象以外の人物の動作が検出されたか否かを判定する。例えば、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、図４（ｃ）のように人物Ｂ４０２が主対象である人物Ａ４０１に視線や顔を向けると、人物動作検出部２２３は、人物Ｂ４０２が主対象である人物Ａ４０１に視線や顔を向けたことを検出する。 S503: Detection of motion of person other than main target The human motion detection unit 223 detects the motion of a person other than the main target in the image within a predetermined first time after the person A401, who is the main target, speaks. determine whether or not For example, within a predetermined first time period after the main target person A401 speaks, as shown in FIG. The unit 223 detects that the person B402 turned his or her line of sight or face toward the main target person A401.

人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、画像中の人物が動作を行ったと判定した場合は、処理はＳ５０４に進む。また、人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、画像中の人物が動作を行っていないと判定した場合は、処理はＳ５０２に戻る。 If the human motion detection unit 223 determines that the person in the image has performed the motion within the predetermined first time period after the person A401, who is the main target, has spoken, the process proceeds to S504. If the human motion detection unit 223 determines that the person in the image does not move within the predetermined first period of time after the main target person A401 speaks, the process returns to S502. .

Ｓ５０４、Ｓ５０５：人物の特定、動作回数のカウント
人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に主対象である人物Ａ４０１に視線や顔を向けた人物として人物Ｂ４０２を特定する。そして、人物動作カウント部２２４は、人物Ｂ４０２の動作回数のカウントを１にして所定の第１の時間カウントを開始する。 S504, S505: Identifying Person, Counting Number of Actions The human action detection unit 223 directs the line of sight or face to the main target person A401 within a predetermined first time after the main target person A401 speaks. Person B402 is identified as the person who Then, the human motion counting unit 224 counts the number of motions of the human B402 to 1 and starts counting a predetermined first time.

Ｓ５０６～Ｓ５０８：主対象以外の人物の動作回数の判定、グルーピング
その後、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、図４（ｄ）のように人物Ｂ４０２が、例えば肯定動作をした場合、人物動作検出部２２３は、人物Ｂ４０２が肯定動作をしたと判定する。すると、人物検出部２２２は肯定動作をしたのが、Ｓ５０３で検出された人物と同一人物である人物Ｂ４０２であることを検出し、人物動作カウント部２２４は、人物Ｂ４０２の動作回数のカウント数を２にアップする。 S506-S508: Determining the number of actions of persons other than the main target, grouping After that, within a predetermined first time after the person A401 who is the main target speaks, as shown in FIG. 4D, the person B402 For example, when the person B402 makes an affirmative action, the person action detection unit 223 determines that the person B402 has made an affirmative action. Then, the person detection unit 222 detects that the person who made the affirmative action is person B402, who is the same person as the person detected in S503. up to 2.

人物動作カウント部２２４は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に動作をした人物Ｂ４０２の動作回数のカウント数が所定回数（例えば、３回）に達したか否かを判定する。人物動作カウント部２２４は、人物Ｂ４０２の動作回数のカウント数が所定回数に達したと判定した場合（Ｓ５０６でＹＥＳ）、処理はＳ５０７に進む。また、人物動作カウント部２２４は、人物Ｂ４０２の動作回数のカウント数が所定回数に達していないと判定した場合（Ｓ５０６でＮＯ）、処理はＳ５０２に戻り、人物Ｂ４０２の動作検出を継続する。 The human motion counting unit 224 determines whether the counted number of motions of the person B402 who has made a motion within a predetermined first period of time after the person A401, who is the main target, has reached a predetermined number of times (for example, three times). determine whether or not When the human motion counting unit 224 determines that the counted number of motions of the person B402 has reached the predetermined number (YES in S506), the process proceeds to S507. If the human motion counting unit 224 determines that the counted number of motions of the person B402 has not reached the predetermined number (NO in S506), the process returns to S502 to continue detecting the motion of the person B402.

Ｓ５０２に戻った場合、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、図４（ｅ）のように人物Ｂ４０２が発話すると、人物動作検出部２２３は、人物Ｂ４０２が発話したことを検出し（Ｓ５０３でＹＥＳ）、Ｓ５０４で人物Ｂ４０２を特定し、Ｓ５０５で、人物動作カウント部２２４は人物Ｂ４０２の動作回数のカウント数を３にアップする。 When returning to S502, when the person B402 speaks as shown in FIG. A speech is detected (YES in S503), the person B402 is specified in S504, and the person movement counting unit 224 increases the count number of the number of movements of the person B402 to 3 in S505.

その後、Ｓ５０６で、人物動作カウント部２２４は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に動作をした人物Ｂ４０２の動作回数のカウント数が所定回数に達したか否かを判定する。人物動作カウント部２２４は、人物Ｂ４０２の動作回数のカウント数が所定回数に達したと判定した場合（Ｓ５０６でＹＥＳ）、処理はＳ５０７に進む。 After that, in S506, the human motion counting unit 224 determines whether or not the counted number of motions of the person B402 who has made a motion within a predetermined first period of time after the person A401, who is the main target, has reached a predetermined number of times. determine whether When the human motion counting unit 224 determines that the counted number of motions of the person B402 has reached the predetermined number (YES in S506), the process proceeds to S507.

Ｓ５０７では、会話グルーピング部２２６は、人物Ｂ４０２を、主対象である人物Ａ４０１と会話をしている人物と判定し、Ｓ５０８では、人物Ｂ４０２を、主対象である人物Ａ４０１の第１のグループにグルーピングを行う。 In S507, the conversation grouping unit 226 determines that the person B402 is having a conversation with the main target person A401, and in S508 groups the person B402 into the first group of the main target person A401. I do.

以上のようにして、主対象である人物Ａ４０１と会話をしている人物に対してグルーピングを行う。 As described above, the grouping is performed for the person who is having a conversation with the person A401 who is the main target.

＜想定シーン２＞次に、図６および図７を参照して、本実施形態の想定シーン２における処理について説明する。 <Assumed Scene 2> Next, processing in the assumed scene 2 of the present embodiment will be described with reference to FIGS. 6 and 7. FIG.

図６は、本実施形態の想定シーン２の説明図である。図７は、想定シーン２の処理を示すフローチャートである。以下では、図６に例示する想定シーン２に沿って図７の処理の流れを説明する。 FIG. 6 is an explanatory diagram of an assumed scene 2 of this embodiment. FIG. 7 is a flow chart showing the processing of the assumed scene 2. As shown in FIG. Below, the flow of processing in FIG. 7 will be described along with assumed scene 2 illustrated in FIG.

図６（ａ）は、図３（ａ）と同様に、人物Ａ４０１、人物Ｂ４０２、人物Ｃ４０３が第１のグループとして会話しており、人物Ｄ４０４、人物Ｅ４０５が第２のグループとして会話している状態を例示している。そして、主対象である人物Ａ４０１と人物Ｂ４０２は既にグルーピングされているとする。 In FIG. 6A, as in FIG. 3A, person A401, person B402, and person C403 are having a conversation as a first group, and person D404 and person E405 are having a conversation as a second group. exemplifies the state. It is assumed that the person A401 and the person B402, who are the main objects, have already been grouped.

Ｓ７０１：主対象とグルーピング済みの人物を検出
会話グルーピング部２２６は、図６（ａ）のように主対象である人物Ａ４０１とグルーピング済みの人物Ｂ４０２を検出する。 S701: Detect main target and grouped person The conversation grouping unit 226 detects the main target, person A401, and grouped person B402, as shown in FIG. 6(a).

Ｓ７０２：主対象のグループ内の主対象以外の所定の人物の動作検出
図６（ｂ）のように、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話すると、人物動作検出部２２３は、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話したことを検出する。 S702: Detecting Action of Predetermined Person Other than Main Subject in Group of Main Subjects As shown in FIG. The detection unit 223 detects that a person B402 who is not the main target and who has been grouped with the person A401 who is the main target speaks.

Ｓ７０３：主対象である人物以外および主対象のグループ内の所定の人物以外の人物の動作検出、動作回数のカウント
図６（ｃ）のように、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、主対象である人物Ａ４０１のグループではない人物Ｃ４０３が、人物Ｂ４０２に視線や顔を向けると、人物動作検出部２２３は、人物Ｃ４０３が主対象である人物Ａ４０１と、主対象ではない人物Ｂ４０２に視線や顔を向けたことを検出する。 S703: Detecting motions of persons other than a person who is the main target and other than a predetermined person in the main target group, and counting the number of motions As shown in FIG. When a person C403 who is not in the group of the person A401 who is the main object turns his/her gaze or face toward the person B402 within a predetermined first time after the person B402 who is not the object speaks, the person action detection unit 223 It is detected that the person C403 turned his or her line of sight or face toward the person A401, who is the main object, and the person B402, which is not the main object.

Ｓ７０４、Ｓ７０５：人物の特定、動作回数のカウント
Ｓ７０４で、人物動作検出部２２３は、主対象である人物Ａ４０１と、主対象ではない人物Ｂ４０２に視線や顔を向けた人物として人物Ｃ４０３を特定する。Ｓ７０５で、人物動作カウント部２２４は、人物Ｃ４０３の動作回数のカウントを１にして所定の第１の時間カウントを開始する。 S704, S705: Identifying Persons, Counting the Number of Actions In S704, the human movement detection unit 223 identifies the person C403 as the person who directed his gaze and face to the person A401 who is the main target and the person B402 who is not the main target. . In S705, the human motion counting unit 224 counts the number of motions of the human C403 to 1 and starts counting a predetermined first time.

Ｓ７０６～Ｓ７０８：同一人物の動作回数の判定、グルーピング
図６（ｄ）のように、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、人物Ｃ４０３が、例えば肯定動作を行った場合、人物動作検出部２２３は、人物Ｃ４０３が肯定動作をしたと判定する。人物検出部２２２は、肯定動作をしたのが、Ｓ５０３で検出された人物と同一人物である人物Ｃ４０３であることを検出し、人物動作カウント部２２４は人物Ｃ４０３の動作回数のカウント数を２にアップする。 S706-S708: Determining the number of actions of the same person, grouping As shown in FIG. If the person C403 performs, for example, an affirmative action within this period, the human action detection unit 223 determines that the person C403 has made an affirmative action. The person detection unit 222 detects that the person who made the affirmative action is the person C403 who is the same person as the person detected in S503. Up.

人物動作カウント部２２４は、主対象である人物Ａ４０１と、主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に動作をした人物Ｃ４０３の動作回数のカウント数が所定回数（例えば、３回）に達したか否かを判定する。人物動作カウント部２２４は、人物Ｃ４０３の動作回数のカウント数が所定回数に達していないと判定した場合（Ｓ７０６でＮＯ）、処理はＳ７０２に戻り、人物Ｃ４０３の動作検出を継続する。 The human motion counting unit 224 counts the number of motions of the person C 403 who has made a motion within a predetermined first time period after the main target person A 401 and the non-main target person B 402 uttered. , three times). When the human motion counting unit 224 determines that the counted number of motions of the person C403 has not reached the predetermined number (NO in S706), the process returns to S702 to continue motion detection of the person C403.

Ｓ７０２に戻った場合、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、図６（ｅ）のように主対象のグループではない人物Ｃ４０３が、例えば発話した場合、Ｓ７０４で、人物動作検出部２２３は、人物Ｃ４０３が発話したことを検出し、Ｓ７０５で、人物動作カウント部２２４は、人物Ｃ４０３の動作回数のカウント数を３にアップする。 In the case of returning to S702, within a predetermined first time after the person B402, who is not the main object and who has been grouped with the person A401 who is the main object, speaks, in the main object group, as shown in FIG. In S704, the human motion detection unit 223 detects that the person C403 has spoken. up to.

人物動作カウント部２２４は、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に動作をした人物Ｃ４０３の動作回数のカウント数が所定回数に達したか否かを判定する。人物動作カウント部２２４は、人物Ｃ４０３の動作回数のカウント数が所定回数に達したと判定した場合（Ｓ７０６でＹＥＳ）、処理はＳ７０７に進む。Ｓ７０７では、会話グルーピング部２２６は、人物Ｃ４０３を、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２と会話をしている人物と判定する。Ｓ７０８で、会話グルーピング部２２６は、人物Ｃ４０３を、主対象である人物Ａ４０１の第１のグループにグルーピングを行う。 The human motion counting unit 224 counts the number of motions of the person C403, who has been grouped with the main target person A401 and has made a motion within a predetermined first time after the person B402, who is not the main target, has spoken. Determine whether or not the number of times has been reached. When the human motion counting unit 224 determines that the counted number of motions of the person C403 has reached the predetermined number (YES in S706), the process proceeds to S707. In S707, the conversation grouping unit 226 determines that the person C403 is grouped with the main target person A401 and is having a conversation with the non-main target person B402. In S708, the conversation grouping unit 226 groups the person C403 into the first group of the main target person A401.

以上のようにして、主対象である人物Ａ４０１とグルーピング済の、主対象ではない人物Ｂ４０２と会話をしている人物に対してグルーピングを行う。 As described above, the grouping is performed for the person who is already grouped with the person A401 who is the main target and is having a conversation with the person B402 who is not the main target.

＜想定シーン３＞次に、図５および図８を参照して、本実施形態の想定シーン３における処理について説明する。 <Assumed Scene 3> Next, processing in the assumed scene 3 of the present embodiment will be described with reference to FIGS. 5 and 8. FIG.

図８は、本実施形態の想定シーン３の説明図である。実施形態３の想定シーン３の処理フローは図５と同様である。以下では、図８に例示する想定シーン３に沿って図５の処理の流れを説明する。 FIG. 8 is an explanatory diagram of an assumed scene 3 of this embodiment. The processing flow of the assumed scene 3 of the third embodiment is the same as that shown in FIG. Below, the flow of processing in FIG. 5 will be described along with assumed scene 3 illustrated in FIG.

図８（ａ）は、人物Ａ４０１、人物Ｂ４０２、人物Ｃ４０３が第１のグループとして会話しており、人物Ｄ４０４、人物Ｅ４０５が第２のグループとして会話している。ユーザが主対象としたい人物は、人物Ａ４０１とする。 In FIG. 8A, person A401, person B402, and person C403 are having a conversation as a first group, and person D404 and person E405 are having a conversation as a second group. The person whom the user wants to be the main target is person A401.

Ｓ５０１：主対象の選定
主対象選定部２２５は、人物Ａ４０１を主対象として選定する。ユーザの意図としては、主に、フォーカスを合わせたい人物を主対象として選択する。 S501: Selection of Main Subject The main subject selection unit 225 selects the person A401 as the main subject. The user's intention is to mainly select a person to be focused on as a main target.

Ｓ５０２：主対象の動作検出
主対象として人物Ａ４０１が選定されている状態で、図８（ｂ）のように主対象である人物Ａ４０１が発話すると、人物動作検出部２２３は、主対象である人物Ａ４０１が発話したことを検出する。 S502: Detection of Action of Main Subject When the person A401, who is the main subject, speaks in a state in which the person A401 is selected as the main subject, as shown in FIG. Detects that A401 has spoken.

Ｓ５０３：主対象以外の人物の動作検出
人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、画像中の主対象以外の人物が動作を行ったか否かを判定する。例えば、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、図８（ｃ）のように、主対象である人物Ａ４０１と会話をしていない人物Ｄ４０４が発話した場合、人物動作検出部２２３は、人物Ｄ４０４が発話したことを検出する。 S503: Detection of motion of person other than main target The human motion detection unit 223 detects whether a person other than the main target in the image has made a motion within a predetermined first time after the person A401, who is the main target, has spoken. determine whether or not For example, if a person D404, who is not in conversation with the main target person A401, speaks within a predetermined first period of time after the main target person A401 speaks, as shown in FIG. The human motion detection unit 223 detects that the person D404 has spoken.

人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、主対象である人物Ａ４０１と会話をしていない人物Ｄ４０４が発話をしたと判定した場合は、処理はＳ５０４に進む。また、人物動作検出部２２３は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、主対象である人物Ａ４０１と会話をしていない人物Ｄ４０４が発話をしていないと判定した場合は、処理はＳ５０２に戻る。 If the human motion detection unit 223 determines that the person D404 who is not speaking with the main target person A401 speaks within a predetermined first time after the main target person A401 speaks, , the process proceeds to S504. In addition, the human motion detection unit 223 determines that the person D404, who is not conversing with the main target person A401, has not spoken within a predetermined first time period since the main target person A401 has spoken. If so, the process returns to S502.

Ｓ５０４～Ｓ５０５：人物の特定、動作回数のカウント
Ｓ５０４で、人物検出部２２２は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に発話した人物として人物Ｄ４０４を特定する。Ｓ５０５で、人物動作カウント部２２４は、人物Ｄ４０４の動作回数のカウントを１にして所定の第１の時間カウントを開始する。 S504-S505: Identification of Person, Counting of Number of Actions In S504, the person detection unit 222 identifies person D404 as a person who has spoken within a predetermined first period of time after person A401, who is the main target, has spoken. In S505, the human motion counting unit 224 counts the number of motions of the human D404 to 1 and starts counting a predetermined first time.

Ｓ５０６～Ｓ５０８：主対象以外の人物の動作回数の判定、グルーピング
主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、図８（ｄ）のように人物Ｄ４０４が、主対象である人物Ａ４０１に視線や顔を向けると、人物動作検出部２２３は、人物Ｄ４０４が主対象である人物Ａ４０１に視線や顔を向けたと判定する。すると、人物検出部２２２は、主対象である人物Ａ４０１に視線や顔を向けたのが、Ｓ５０３で検出された人物と同一人物である人物Ｄ４０４であることを検出し、人物動作カウント部２２４は人物Ｄ４０４の動作回数のカウント数を２にアップする。 S506-S508: Determining the Number of Actions of Persons Other than the Main Subject, Grouping Within a predetermined first time after the person A401, who is the main subject, has spoken, the person D404 is the main subject as shown in FIG. 8(d). When the person D404 directs his or her line of sight or face to the person A401 who is the main target, the human motion detection unit 223 determines that the person D404 turns his or her line of sight or face to the person A401, which is the main target. Then, the person detection unit 222 detects that it is the person D404, who is the same person as the person detected in S503, who turned his gaze and face toward the person A401, who is the main object, and the person action counting unit 224 The count number of the number of actions of the person D404 is increased to two.

人物動作カウント部２２４は、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に動作をした人物Ｄ４０４の動作回数のカウント数が所定回数（例えば、３回）に達したか否かを判定する。人物動作カウント部２２４は、人物Ｄ４０４の動作回数のカウント数が所定回数に達したと判定した場合（Ｓ５０６でＹＥＳ）、処理はＳ５０７に進む。また、人物動作カウント部２２４は、人物Ｄ４０４の動作回数のカウント数が所定回数に達していないと判定した場合（Ｓ５０６でＮＯ）、処理はＳ５０２に戻り、人物Ｄ４０４の動作検出を継続する。 The human motion counting unit 224 determines whether the counted number of motions of the person D404, who has made a motion within a predetermined first period of time after the person A401, who is the main target, has reached a predetermined number of times (for example, three times). determine whether or not When the human motion counting unit 224 determines that the counted number of motions of the person D404 has reached the predetermined number (YES in S506), the process proceeds to S507. If the human motion counting unit 224 determines that the number of motions of the person D404 has not reached the predetermined number (NO in S506), the process returns to S502 to continue detecting the motion of the person D404.

人物動作カウント部２２４は、人物Ｄ４０４の動作回数のカウント数が所定回数に達したと判定した場合（Ｓ５０６でＹＥＳ）、処理はＳ５０７に進む。 When the human motion counting unit 224 determines that the counted number of motions of the person D404 has reached the predetermined number (YES in S506), the process proceeds to S507.

Ｓ５０７では、会話グルーピング部２２６は、人物Ｄ４０４を、主対象である人物Ａ４０１と会話をしている人物と判定し、Ｓ５０８では、人物Ｄ４０４を、主対象である人物Ａ４０１の第１のグループにグルーピングを行う。 In S507, the conversation grouping unit 226 determines that the person D404 is having a conversation with the main target person A401, and in S508 groups the person D404 into the first group of the main target person A401. I do.

また、人物動作カウント部２２４は、人物Ｄ４０４の動作回数のカウント数が所定回数に達していないと判定した場合（Ｓ５０６でＮＯ）、例えば、図８（ｅ）のように、主対象である人物Ａ４０１が発話してから所定の第１の時間以内に、人物Ｄ４０４の動作が検出されなかった場合、処理はＳ５０２に戻る。 When the human motion counting unit 224 determines that the counted number of motions of the person D404 has not reached the predetermined number (NO in S506), for example, as shown in FIG. If the motion of person D404 is not detected within a predetermined first time after A401 speaks, the process returns to S502.

＜想定シーン４＞次に、図９および図１０を参照して、本実施形態の想定シーン４における処理について説明する。 <Assumed Scene 4> Next, processing in the assumed scene 4 of the present embodiment will be described with reference to FIGS. 9 and 10. FIG.

図９は、本実施形態の想定シーン４の説明図である。図１０は、実施形態４の想定シーン１の処理を示すフローチャートである。以下では、図９に例示する想定シーン４に沿って図１０の処理の流れを説明する。 FIG. 9 is an explanatory diagram of an assumed scene 4 of this embodiment. FIG. 10 is a flow chart showing processing of assumed scene 1 of the fourth embodiment. Below, the flow of processing in FIG. 10 will be described along with assumed scene 4 illustrated in FIG.

図９は、人物Ａ４０１、人物Ｂ４０２、人物Ｃ４０３が第１のグループとして会話しており、人物Ｄ４０４、人物Ｅ４０５が第２のグループとして会話している。 In FIG. 9, person A401, person B402, and person C403 are having a conversation as a first group, and person D404 and person E405 are having a conversation as a second group.

Ｓ１００１：主対象とグルーピング済みの人物を検出
図９（ａ）に示すように、主対象である人物Ａ４０１と人物Ｂ４０２と人物Ｃ４０３は既にグルーピングされている状態である。この状態で、図９（ｂ）に示すように、既にグルーピングされている人物Ｃ４０３は会話をしている第１のグループから外れ、人物Ｄ４０４と人物Ｅ４０５が会話をしている第２のグループに入ったとする。 S1001: Detect Main Target and Grouped Persons As shown in FIG. 9A, the main targets, Person A401, Person B402, and Person C403, have already been grouped. In this state, as shown in FIG. 9B, the already grouped person C403 is removed from the first group in conversation, and the person D404 and person E405 join the second group in conversation. Suppose you enter

Ｓ１００２：主対象とグルーピング済みではない人物を検出
主対象である人物Ａ４０１または、主対象である人物Ａ４０１とグルーピング済の主対象ではない人物Ｂ４０２が発話すると、人物動作検出部２２３は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話したことを検出する。 S1002: Detecting a person who has not been grouped with the main target When the main target person A401 or the main target person A401 and the non-main target person B402 who has been grouped speaks, the human motion detection unit 223 It is detected that a certain person A401 or a person B402 who is not the main target has spoken.

Ｓ１００３～Ｓ１００６：人物の動作検出、グルーピングの維持または解除
Ｓ１００３では、人物動作検出部２２３は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物が動作を行ったか否かを判定する。 S1003 to S1006: Human Action Detection, Grouping Maintenance or Release , determine whether or not a person other than the person A401 who is the main object in the image and the person other than the person B402 who is not the main object has performed an action.

ここで、図９（ｂ）のように人物Ｃ４０３が、主対象である人物Ａ４０１と主対象ではない人物Ｂ４０２が会話している第１のグループから抜けた場合、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話しても、所定の第１の時間以内にそれに対する人物Ｃ４０３の動作は所定回数行われない。したがって、会話グルーピング部２２６は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３の動作が検出されないと判定し（Ｓ１００３でＮＯ）、処理はＳ１００５に進む。 Here, as shown in FIG. 9B, when the person C403 leaves the first group in which the person A401 who is the main subject and the person B402 who is not the main subject are having a conversation, the person A401 who is the main subject or the person B402 who is not the main subject Even if the person B402, who is not the target, speaks, the person C403 does not make a predetermined number of actions in response to it within a predetermined first period of time. Therefore, the conversation grouping unit 226, within a predetermined first period of time after the person A401 who is the main subject or the person B402 who is not the main subject speaks, utters a speech other than the person A401 who is the main subject in the image and the person who is not the main subject in the image. It is determined that the action of the person C403 other than the person B402 is not detected (NO in S1003), and the process proceeds to S1005.

Ｓ１００５では、会話グルーピング部２２６は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間よりも長い所定の第２の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３が動作を行ったか否かを判定する。 In S1005, the conversation grouping unit 226 selects the main subject in the image within a predetermined second time longer than the predetermined first time after the person A401 who is the main subject or the person B402 who is not the main subject has spoken. It is determined whether or not a person C403 other than the person A401 who is the main target and the person C403 other than the person B402 who is not the main object has performed an action.

この場合、図９（ｃ）のように人物Ｃ４０３は主対象である人物Ａ４０１と主対象ではない人物Ｂ４０２が会話している第１のグループから抜けているので、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２の発話に対して、所定の第１の時間よりも長い所定の第２の時間以内であっても人物Ｃ４０３の動作は所定回数行われない。よって、会話グルーピング部２２６は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間よりも長い所定の第２の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３の動作が検出されないと判定し（Ｓ１００５でＮＯ）、Ｓ１００６で、人物Ｃ４０３のグルーピングを解除する。 In this case, as shown in FIG. 9C, the person C403 is excluded from the first group in which the main object person A401 and the non-main object person B402 are having a conversation. The action of the person C403 is not performed a predetermined number of times even within a predetermined second time period longer than the first predetermined time period with respect to the utterance of the person B402 who is not the object. Therefore, the conversation grouping unit 226 determines whether the person A 401, who is the main subject, or the person B 402, who is not the main subject, has spoken within a predetermined second time period longer than the first predetermined time period. It is determined that the action of a person C403 other than a person A401 and person B402 who is not the main target is not detected (NO in S1005), and the grouping of the person C403 is released in S1006.

また、会話グルーピング部２２６は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間よりも長い所定の第２の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３の動作が検出されたと判定した場合（Ｓ１００５でＹＥＳ）、処理はＳ１００３に戻り、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３が動作を行ったか否かを再度判定する。 In addition, the conversation grouping unit 226 selects the main target person A401 or the non-main target person B402 as the main target in the image within a predetermined second time longer than the predetermined first time after speaking. If it is determined that the action of a person C403 other than a certain person A401 and person B402 who is not the main subject has been detected (YES in S1005), the process returns to S1003, and the person A401 who is the main subject or the person B402 who is not the main subject is detected. It is determined again whether or not the person C403 other than the person A401 who is the main object in the image and the person B402 who is not the main object has performed an action within a predetermined first time after speaking.

そして、会話グルーピング部２２６は、主対象である人物Ａ４０１または主対象ではない人物Ｂ４０２が発話してから所定の第１の時間以内に、画像中の主対象である人物Ａ４０１以外および主対象ではない人物Ｂ４０２以外の人物Ｃ４０３が動作を行ったと判定した場合（Ｓ１００３でＹＥＳ）、Ｓ１００４で、人物Ｃ４０３のグルーピングを維持する（Ｓ１００４）。 Then, the conversation grouping unit 226 selects, within a predetermined first period of time after the main target person A401 or the non-main target person B402 speaks, the main target person A401 and non-main target person A401 in the image. If it is determined that the person C403 other than the person B402 has performed the action (YES in S1003), the grouping of the person C403 is maintained in S1004 (S1004).

以上のようにして、グルーピング済みの人物Ｃ４０３に対してグルーピングを維持するか解除するかを決定し再度グルーピングを行う。 As described above, it is determined whether grouping is to be maintained or canceled for the person C403 who has already been grouped, and grouping is performed again.

以上のようにしてグルーピングされた人物が会話している音声に対して音声処理部１０４により音声処理を行う。 The voice processing unit 104 performs voice processing on the voices of the persons grouped as described above, in which they are conversing.

音声処理の例としては、既存の音声分離技術を用いて、個々の人物の音声分離を行い、グルーピングされた人物のみの音声を強調したり、あるいは、グルーピングされていない人物の音声を低減させたりする。強調処理は、グルーピングが行われた人物の音声に対して、ゲインを増幅させる処理、グルーピングが行われた人物以外の人物の音声に対してゲインを減衰させる処理、グルーピングが行われた人物の音声帯域のゲインを増幅させる処理の少なくともいずれかを含む。これらの処理は、グルーピングされた人物の音声が聴き取りやすくなる処理であれば、処理方法はこれに限られない。 As an example of speech processing, existing speech separation technology is used to separate the speech of each individual person, to emphasize the speech of only grouped persons, or to reduce the speech of ungrouped persons. do. Enhancement processing includes processing to amplify the gain of the voice of the grouped person, processing to attenuate the gain of the voice of the person other than the grouped person, and processing of the voice of the grouped person. At least one of processing for amplifying the gain of the band is included. The processing method is not limited to this as long as the processing makes it easier to hear the voices of the grouped persons.

なお、本実施形態では、実際に会話しているグループ数、及び、グループ内の人物が何人であってもよい。 It should be noted that in the present embodiment, the number of groups that are actually having a conversation and the number of persons in the groups may be acceptable.

また、本実施形態では、会話をしているグループとして判定するための、動作回数のカウント数を３回としたが、この限りではない。 In addition, in the present embodiment, the number of times of operation is counted as 3 to determine that the group is having a conversation, but this is not the only option.

また、本実施形態では、発話や頷き動作、発話者に視線や顔を向ける動作を示したが、これらの例に限らず、会話をしている際に人間が行う仕種（しぐさ）を検出できればよい。例えば、発話や発話における唇や頭や手などの身体の動き、視線や顔の動き、更に、顔を縦に振る又は頷く肯定動作、または、顔を横に捻る又は振る否定動作であってもよい。更には、喜怒哀楽の表情の変化などでもよい。 In addition, in the present embodiment, utterance, nodding, and the motion of directing the line of sight or face to the speaker are shown, but the present invention is not limited to these examples. good. For example, utterance, body movements such as lips, head, and hands in utterance, gaze and face movements, and even positive actions such as shaking the face vertically or nodding, or negative actions such as twisting or shaking the face sideways. good. Furthermore, changes in facial expressions of emotions and the like may be used.

また、撮像装置１００が画像に含まれる特定の被写体を主被写体として追尾する機能を有する場合、撮像装置１００がライブビューを表示する際に、ー画像中の主対象の人物（顔）や主対象と同じグループにグルーピングされた人物（顔）を主被写体として追尾枠を表示し、ＡＦ処理やＡＥ処理やズーム処理の対象としてもよい。 Further, when the imaging device 100 has a function of tracking a specific subject included in an image as a main subject, when the imaging device 100 displays a live view, the main target person (face) in the image or the main target A tracking frame may be displayed with a person (face) grouped in the same group as the main subject, and subject to AF processing, AE processing, or zoom processing.

以上のように、本実施形態によれば、撮影画像から主対象の人物と主対象の人物と会話している人物を適切に判別することができるので、主対象と会話している人物を同一のグループとして適切にグルーピングを行うことが可能となる。そして、同じグループにグルーピングされた人物の会話音声を分離し強調して録音することにより、ユーザの興味のない人物の発話音声が録音されてしまったり、会話している人物の音声が聞き取り難くなったりする状況を改善することができる。 As described above, according to the present embodiment, it is possible to appropriately distinguish between a main target person and a person conversing with the main target person from a photographed image. Grouping can be performed appropriately as a group of By separating, emphasizing, and recording conversational voices of people grouped in the same group, the spoken voices of people the user is not interested in are recorded, and the voices of people who are having a conversation become difficult to hear. It is possible to improve the situation where

［実施形態２］次に、実施形態２について説明する。 [Embodiment 2] Next, Embodiment 2 will be described.

実施形態１では、人物の動作回数を所定回数以上検出する必要があるため、撮影中にリアルタイムに主対象と会話している人物を判定する場合は、会話を開始してから判定結果を得るまでに時間がかかる。そこで、実施形態２では、機械学習を実行して会話の開始前に、主対象と他の人物が会話を開始するか否かを推定する方法について説明する。 In the first embodiment, it is necessary to detect the number of motions of a person at least a predetermined number of times. takes time. Therefore, in the second embodiment, a method of performing machine learning and estimating whether or not the main subject and another person will start a conversation before the conversation starts will be described.

以下では、本発明の電子機器をデジタルカメラなどの撮像装置、情報処理装置をサーバに適用した実施形態について説明する。 Embodiments in which the electronic device of the present invention is applied to an imaging device such as a digital camera, and an information processing device is applied to a server will be described below.

図１１は、実施形態２の機械学習を実行するサーバ１１０１のハードウェア構成を例示している。 FIG. 11 illustrates the hardware configuration of a server 1101 that executes machine learning according to the second embodiment.

ＣＰＵ１１０２は、記憶デバイス１１０４に格納されているプログラムを実行しサーバ１１０１の各構成要素を制御する演算処理装置である。また、ＣＰＵ１１０２は、機械学習を行うための各種演算を実行する。 The CPU 1102 is an arithmetic processing unit that executes programs stored in the storage device 1104 and controls each component of the server 1101 . The CPU 1102 also executes various calculations for machine learning.

ＲＡＭ１１０３は、記憶デバイス１１０４に格納されているプログラムデータ、動画データ、音声データを一時的に記憶するメモリである。 A RAM 1103 is a memory that temporarily stores program data, video data, and audio data stored in the storage device 1104 .

記憶デバイス１１０４は、磁気ディスク、光学式ディスク、半導体メモリなどから構成され、機械学習を行うためのプログラムデータや、学習用入力データを生成するための動画データや音声データ、機械学習により得られる学習済モデルを記憶する。 The storage device 1104 includes a magnetic disk, an optical disk, a semiconductor memory, and the like, and stores program data for performing machine learning, video data and audio data for generating input data for learning, and learning obtained by machine learning. stored model.

ＧＰＵ１１０５は、記憶デバイス１１０４に格納されているプログラムを実行し、ＣＰＵ１１０２と協働して機械学習を行うための各種演算を実行する画像描画用の演算処理装置である。 The GPU 1105 is an arithmetic processing unit for image rendering that executes programs stored in the storage device 1104 and executes various operations for machine learning in cooperation with the CPU 1102 .

バス１１０６は、サーバ１１０１の各構成要素の間でデータの授受を可能にするデー伝送路である。 A bus 1106 is a data transmission path that allows data to be exchanged between components of the server 1101 .

図１２は、実施形態２の撮像装置１００の推定処理およびサーバ１１０１の学習処理を実行するためのソフトウェア構成を例示している。 FIG. 12 illustrates a software configuration for executing the estimation processing of the imaging device 100 and the learning processing of the server 1101 of the second embodiment.

撮像装置１２１１の推定処理を実行するソフトウェア１２０１は、図１の制御部１１１のＣＰＵなどの演算処理装置がＲＯＭなどの記憶装置に格納されているプログラムにより実現される。また、サーバ１１０１の学習処理を実行するソフトウェア１２１１は、図１１のＣＰＵ１１０２またはＧＰＵ１１０５が記憶デバイス１１０４に格納されているプログラムにより実現される。 The software 1201 that executes the estimation processing of the imaging device 1211 is implemented by a program stored in a storage device such as a ROM by an arithmetic processing device such as the CPU of the control unit 111 in FIG. The software 1211 that executes the learning process of the server 1101 is implemented by a program stored in the storage device 1104 by the CPU 1102 or GPU 1105 of FIG.

ＧＰＵ１１０５はデータをより多く並列処理することで効率的な演算を行うことが可能であるので、機械学習などの学習モデルを用いて複数回の学習を行う場合にはＧＰＵ１１０５で演算を行うことが有効である。よって、実施形態２では、学習部１２０６による処理にはＣＰＵ１１０２に加えてＧＰＵ１１０５を用いる。具体的には、学習モデルを含む学習プログラムを実行する場合に、ＣＰＵ１１０２とＧＰＵ１１０５が協働して演算を行うことで学習を行う。なお、学習部１２０６の処理はＣＰＵ１１０２またはＧＰＵ１１０５のみにより演算が行われてもよい。図１２のソフトウェア１２０１、１２１１の機能については図１４および図１５で後述する。 Since the GPU 1105 can perform efficient calculations by processing more data in parallel, it is effective to perform calculations on the GPU 1105 when learning multiple times using a learning model such as machine learning. is. Therefore, in the second embodiment, the processing by the learning unit 1206 uses the GPU 1105 in addition to the CPU 1102 . Specifically, when a learning program including a learning model is executed, the CPU 1102 and the GPU 1105 cooperate to perform calculations for learning. Note that the processing of the learning unit 1206 may be performed only by the CPU 1102 or the GPU 1105 . Functions of the software 1201 and 1211 in FIG. 12 will be described later with reference to FIGS. 14 and 15. FIG.

図１３は、実施形態２の学習済モデルの概念図である。図１３の学習済モデルは、図１２のサーバの学習部１２０６で行われる機械学習により生成され、撮像装置１００の推定部１２１８が学習済モデルを利用した推定処理を行う。 FIG. 13 is a conceptual diagram of a trained model according to the second embodiment. The learned model in FIG. 13 is generated by machine learning performed by the learning unit 1206 of the server in FIG. 12, and the estimating unit 1218 of the imaging device 100 performs estimation processing using the learned model.

学習済モデル１３０３の入力データとして、人物を含む画像データおよび、判定対象となる人物を指定する人物指定情報が入力される。人物を含む画像データは静止画であってもよいし、複数の静止画からなる動画であってもよい。 As input data for the trained model 1303, image data including a person and person designation information for designating a person to be determined are input. Image data including a person may be a still image, or may be a moving image including a plurality of still images.

また、人物指定情報は、例えば、会話を開始するか否かを判定したい少なくとも２名の人物の顔が、入力データである画像のどこに位置するかを示す座標情報である。ただし、人物指定情報は座標情報に限らず、顔画像の特徴情報などを用いてもよい。また、判定対象として指定する人物は２名以外の人数であってもよい。 Also, the person designation information is, for example, coordinate information indicating where the faces of at least two persons whose faces are to be determined whether to start a conversation are located in the image, which is the input data. However, the person designation information is not limited to coordinate information, and feature information of a face image or the like may be used. Also, the number of persons to be designated as determination targets may be other than two.

学習済モデル１３０３の出力データとして、入力データで指定した判定対象の２名の人物が会話を開始するか否かの推定結果が出力される。実施形態２では、判定対象の２名の人物が直接会話を開始しない場合でも、２名の人物が共通の第３者と会話を開始する場合には、会話を開始すると判定してよい。また、入力データが既に会話を開始した後の画像である場合でも、会話を開始すると判定してよい。 As output data of the trained model 1303, an estimation result as to whether or not the two persons to be determined specified by the input data will start a conversation is output. In the second embodiment, even if the two persons to be determined do not directly start a conversation, it may be determined to start a conversation if the two persons start a conversation with a common third person. Also, even if the input data is an image after the conversation has already started, it may be determined that the conversation will start.

２名の人物が会話を開始する場合は、会話を開始する直前または直後に、相手に顔を向ける、相手に近寄る、相手の体に触れる、表情が変化するなどの視覚的特徴があるので、入力データである“人物を含む画像データ”と、出力データである“会話を開始するか否かの判定結果”には相関関係がある。 When two people start a conversation, there are visual characteristics such as facing the other person, approaching the other person, touching the other person's body, and changing facial expressions immediately before or after starting the conversation. There is a correlation between "image data including a person", which is input data, and "determination result as to whether or not to start a conversation", which is output data.

図１４は、図１２のソフトウェア１２０１による学習処理を示すフローチャートである。なお、図１４の処理は、図１１に示すサーバ１１０１のＣＰＵ１１０２およびＧＰＵ１１０５が協働して図１２に示すサーバのソフトウェア１２０１を実行することにより実現される。 FIG. 14 is a flow chart showing learning processing by the software 1201 in FIG. 14 is realized by the CPU 1102 and GPU 1105 of the server 1101 shown in FIG. 11 working together to execute the software 1201 of the server shown in FIG.

図１４（ａ）は正解シーン（会話を開始すると予測されるシーン）の学習処理を示すフローチャートである。 FIG. 14(a) is a flow chart showing a learning process for a correct scene (a scene predicted to start a conversation).

Ｓ１４０１では、学習用データ生成部１２０３は、データ記憶部１２０２から人物が写っている動画データを読み込む。データ記憶部１２０２は、別の機器で撮影した動画データを予めコピーして記憶している。または、データ記憶部１２０２は、ネットワークを経由して、別の機器に保存された動画データを予めダウンロードして記憶してもよい。 In S<b>1401 , the learning data generation unit 1203 reads moving image data showing a person from the data storage unit 1202 . A data storage unit 1202 copies and stores in advance moving image data captured by another device. Alternatively, the data storage unit 1202 may pre-download and store moving image data stored in another device via a network.

Ｓ１４０２では、主対象選定部１２０４は、Ｓ１４０１で読み込んだ動画データの中から主対象となる人物を選定する。実施形態２では、主対象となる人物の選定基準に限定はないが、例えば画像中で最も中心に近い位置に顔が配置される人物を選定する。または、画像中で最も手前となる位置に顔が配置されるか人物を選定する。 In S1402, the main subject selection unit 1204 selects a person as a main subject from the moving image data read in S1401. In the second embodiment, the criteria for selecting a person to be the main object are not limited, but, for example, a person whose face is positioned closest to the center in the image is selected. Alternatively, a person is selected so that the face is arranged at the frontmost position in the image.

Ｓ１４０３では、会話シーン判別部１２０５は、Ｓ１４０１で読み込んだ動画データの中から、Ｓ１４０２で選定された主対象が、他の人物と会話しているシーンを判別する。また、主対象と会話している人物を判別する。実施形態２では、会話シーンの判別方法および、主対象と会話している人物の判別方法に限定はないが、例えば実施形態１で説明した方法により、主対象と会話をしている人物をグルーピングすることで、会話シーンと主対象と会話している人物の判別を行う。 In S1403, the conversation scene determination unit 1205 determines a scene in which the main subject selected in S1402 is having a conversation with another person from the moving image data read in S1401. It also identifies a person who is having a conversation with the main subject. In the second embodiment, there are no restrictions on the method of determining a conversation scene or the method of determining a person who is having a conversation with a main subject. By doing so, the conversation scene and the person who is talking with the main subject are discriminated.

Ｓ１４０４では、学習用データ生成部１２０３は、Ｓ１４０３で判別された会話シーンの開始直前の画像データを学習用入力データとして抽出する。ここで抽出する画像データは静止画であってもよいし、複数の静止画からなる動画であってもよい。また、会話直前のデータに限らず、会話開始後の画像が含まれてよい。 In S1404, the learning data generation unit 1203 extracts the image data immediately before the start of the conversation scene determined in S1403 as learning input data. The image data extracted here may be a still image, or may be a moving image consisting of a plurality of still images. In addition, the data is not limited to the data immediately before the conversation, and may include an image after the conversation starts.

Ｓ１４０５では、学習用データ生成部１２０３は、Ｓ１４０２で選定された主対象と、Ｓ１４０３で判別した主対象と会話している人物を指定する人物指定情報を学習用入力データとして生成する。人物指定情報の形式に限定はないが、実施形態２では、Ｓ１４０４で抽出した画像において、主対象と、主対象と会話している人物１名の顔の位置を示す座標情報とする。 In S1405, the learning data generation unit 1203 generates, as learning input data, person designation information designating the main subject selected in S1402 and a person who is having a conversation with the main subject determined in S1403. Although the format of the person designation information is not limited, in the second embodiment, coordinate information indicating the position of the main target and the face of one person who is having a conversation with the main target in the image extracted in S1404.

Ｓ１４０６では、学習部１２０６は、Ｓ１４０４で抽出された画像および、Ｓ１４０５で生成された主対象と主対象と会話している人物１名の座標情報を入力データとして、会話を開始するシーンの情報を教師データとして学習を行う。学習結果として生成された学習済モデルは、学習モデル記憶部１２０７により記憶される。機械学習の具体的なアルゴリズムとしては、最近傍法、ナイーブベイズ法、決定木、サポートベクターマシンなどが挙げられる。また、ニューラルネットワークを利用して、学習するための特徴量、結合重み付け係数を自ら生成する深層学習（ディープラーニング）も挙げられる。適宜、上記アルゴリズムのうち利用できるものを用いて本実施形態に適用することができる。学習部１２０６は、誤差検出部と、更新部と、を備えてもよい。誤差検出部は、入力層に入力される入力データに応じてニューラルネットワークの出力層から出力される出力データと、教師データとの誤差を得る。誤差検出部は、損失関数を用いて、ニューラルネットワークからの出力データと教師データとの誤差を計算するようにしてもよい。更新部は、誤差検出部で生成された誤差に基づいて、その誤差が小さくなるように、ニューラルネットワークのノード間の結合重み付け係数等を更新する。更新部は、例えば、誤差逆伝播法を用いて、結合重み付け係数等を更新する。誤差逆伝播法は、上記の誤差が小さくなるように、各ニューラルネットワークのノード間の結合重み付け係数等を調整する手法である。 In S1406, the learning unit 1206 uses, as input data, the image extracted in S1404 and the coordinate information of the main subject generated in S1405 and the one person who is conversing with the main subject, and obtains information on the scene where the conversation starts. Learning is performed as teacher data. A learned model generated as a learning result is stored in the learning model storage unit 1207 . Specific algorithms of machine learning include nearest neighbor method, naive Bayes method, decision tree, support vector machine, and the like. Another example is deep learning, in which neural networks are used to generate feature quantities and connection weighting coefficients for learning. As appropriate, any of the above algorithms that can be used can be used and applied to the present embodiment. The learning section 1206 may include an error detection section and an update section. The error detection unit obtains an error between the output data output from the output layer of the neural network according to the input data input to the input layer and the teacher data. The error detector may use a loss function to calculate the error between the output data from the neural network and the teacher data. Based on the error generated by the error detection unit, the update unit updates the weighting coefficients for coupling between nodes of the neural network so that the error is reduced. The updating unit updates the coupling weighting coefficients and the like using, for example, the error backpropagation method. The error backpropagation method is a technique for adjusting the connection weighting coefficients and the like between nodes of each neural network so as to reduce the above error.

図１４（ａ）の処理を多数の動画データを用いて繰り返し実行することにより、正解シーン（会話を開始すると予測されるシーン）の学習を行う。 A correct scene (a scene predicted to start a conversation) is learned by repeatedly executing the process of FIG.

図１４（ｂ）は不正解シーン（会話を開始しないと予測されるシーン）の学習処理を示すフローチャートである。 FIG. 14(b) is a flow chart showing learning processing of an incorrect scene (a scene predicted not to start a conversation).

図１４（ｂ）におけるＳ１４１１、Ｓ１４１２、Ｓ１４１３、Ｓ１４１４の処理は、図１４（ａ）のＳ１４０１、Ｓ１４０２、Ｓ１４０３、Ｓ１４０４と同じである。 The processing of S1411, S1412, S1413, and S1414 in FIG. 14B is the same as S1401, S1402, S1403, and S1404 in FIG. 14A.

Ｓ１４１５では、学習用データ生成部１２０３は、Ｓ１４１２で選定された主対象と、Ｓ１４１３で判別した主対象と会話している人物以外の人物１名を指定する人物指定情報を学習用入力データとして生成する。人物指定情報の形式に限定はないが、実施形態２では、Ｓ１４１４で抽出された画像において、主対象と、主対象と会話している人物以外の人物１名の顔の位置を示す座標情報とする。 In S1415, the learning data generation unit 1203 generates, as learning input data, person designation information designating one person other than the main subject selected in S1412 and the person who is having a conversation with the main subject determined in S1413. do. Although the format of the person designation information is not limited, in the image extracted in S1414, coordinate information indicating the position of the face of the main target and one person other than the person who is conversing with the main target is used in the second embodiment. do.

Ｓ１４１６では、学習部１２０６は、Ｓ１４１４で抽出された画像およびＳ１４１５で生成された主対象と主対象と会話している人物以外の人物１名の座標情報を入力データとして、会話を開始しないシーンの情報を教師データとして機械学習を行う。学習結果として生成された学習済モデルは、学習モデル記憶部１２０７により記憶される。機械学習の具体的なアルゴリズムはＳ１４０６と同様である。 In S1416, the learning unit 1206 uses, as input data, the image extracted in S1414 and the coordinate information of the main target generated in S1415 and one person other than the person who is having a conversation with the main target, to create a scene in which a conversation does not start. Machine learning is performed using information as teacher data. A learned model generated as a learning result is stored in the learning model storage unit 1207 . A specific algorithm for machine learning is the same as in S1406.

Ｓ１４１７では、学習用データ生成部１２０３は、Ｓ１４１３で判別された会話シーンと所定時間離れたシーンの画像を学習用入力データとして抽出する。ここで抽出する画像データは静止画であってもよいし、複数の静止画からなる動画であってもよい。 In S1417, the learning data generation unit 1203 extracts, as learning input data, an image of a scene separated by a predetermined time from the conversation scene determined in S1413. The image data extracted here may be a still image, or may be a moving image consisting of a plurality of still images.

Ｓ１４１８では、学習用データ生成部１２０３は、Ｓ１４１２で選定された主対象と、主対象以外の人物１名を指定する人物指定情報を学習用入力データとして生成する。人物指定情報の形式に限定はないが、実施形態２では、Ｓ１４１７で抽出された画像において、主対象と、主対象以外の人物１名の顔の位置を示す座標情報を学習用入力データとする。 In S1418, the learning data generation unit 1203 generates the main target selected in S1412 and person designation information designating one person other than the main target as learning input data. Although the format of the person designation information is not limited, in the second embodiment, coordinate information indicating the face positions of the main target and one person other than the main target in the image extracted in S1417 is used as input data for learning. .

Ｓ１４１９では、学習部１２０６は、Ｓ１４１７で抽出された画像およびＳ１４１８で生成された主対象と、主対象以外の人物１名の座標情報を入力データとして、会話を開始しないシーンの情報を教師データとして学習を行う。学習結果として生成された学習済モデルは、学習モデル記憶部１２０７により記憶される。機械学習の具体的なアルゴリズムはＳ１４０６と同様である。 In S1419, the learning unit 1206 uses the image extracted in S1417, the main target generated in S1418, and the coordinate information of one person other than the main target as input data, and the information of the scene in which the conversation does not start as training data. do the learning. A learned model generated as a learning result is stored in the learning model storage unit 1207 . A specific algorithm for machine learning is the same as in S1406.

図１４（ｂ）の処理を多数の動画データを用いて繰り返し実行することにより、不正解シーン（会話を開始しないと予測されるシーン）の学習を行う。 By repeatedly executing the processing of FIG. 14B using a large number of moving image data, the learning of incorrect scenes (scenes predicted not to start conversation) is performed.

図１５は、図１２のソフトウェア１２１１による推定処理を示すフローチャートである。図１５の処理は、図１に示す撮像装置１００の制御部１１１が図１２に示す撮像装置のソフトウェア１２１１を実行することにより実現される。 FIG. 15 is a flow chart showing estimation processing by the software 1211 in FIG. The processing in FIG. 15 is implemented by the control unit 111 of the imaging device 100 shown in FIG. 1 executing the software 1211 of the imaging device shown in FIG.

Ｓ１５０１では、撮像データ取得部１２１３は、図１の撮像部１０１により生成される画像データを取得する。 In S1501, the imaging data acquisition unit 1213 acquires image data generated by the imaging unit 101 in FIG.

Ｓ１５０２では、主対象選定部１２１２は、図２の主対象選定部２２５の機能を用いて、Ｓ１５０１で取得した画像データにおける主対象を選定する。 In S1502, the main subject selection unit 1212 selects a main subject in the image data acquired in S1501 using the function of the main subject selection unit 225 in FIG.

Ｓ１５０３では、入力データ生成部１２１６は、Ｓ１５０２で選定された主対象が人物であるか否かを判定し、人物であると判定した場合は、処理はＳ１５０４に進み、人物でないと判定した場合は、処理はＳ１５０１に戻る。 In S1503, the input data generation unit 1216 determines whether or not the main object selected in S1502 is a person. , the process returns to S1501.

Ｓ１５０４では、入力データ生成部１２１６は、Ｓ１５０２で選定した主対象および、画像中の人物を指定する情報を、Ｓ１５０５で行う推定用入力データとして生成する。人物を指定する情報の形式に限定はないが、実施形態２では、Ｓ１５０１で取得した画像において、主対象と、主対象以外の人物１名の顔の位置を示す座標情報とする。 In S1504, the input data generation unit 1216 generates the main object selected in S1502 and information designating the person in the image as input data for estimation performed in S1505. Although the format of the information specifying a person is not limited, in the second embodiment, coordinate information indicating the positions of the faces of the main subject and one person other than the main subject in the image acquired in S1501.

Ｓ１５０５では、推定部１２１８は、Ｓ１５０１で取得した画像および、Ｓ１５０４で生成した、主対象と、主対象以外の人物１名の顔の位置を示す座標情報を入力データとして、主対象と、指定された人物が会話を開始する否かを推定する。推定には学習モデル記憶部１２１７に格納されている学習済モデルを利用する。なお、学習モデル記憶部１２１７は、学習モデル記憶部１２０７に格納されている学習済モデルを、予めコピーして、図１の制御部１１１に含まれるＲＯＭ等の記憶デバイスに記憶する。 In S1505, the estimating unit 1218 uses the image acquired in S1501 and the coordinate information indicating the position of the face of the main target and one person other than the main target generated in S1504 as input data, and the main target and the designated main target. It is estimated whether the person who has A learned model stored in the learning model storage unit 1217 is used for estimation. Note that the learning model storage unit 1217 copies the learned model stored in the learning model storage unit 1207 in advance and stores it in a storage device such as a ROM included in the control unit 111 in FIG.

Ｓ１５０６では、入力データ生成部１２１６は、Ｓ１５０１で取得した画像中の全ての人物について、Ｓ１５０５における推定処理が行われたか否かを判定する。入力データ生成部１２１６は、全ての人物について推定処理が行われたと判定した場合は、処理はＳ１５０７に進み、全ての人物について推定処理が行われていないと判定した場合は、処理はＳ１５０４に戻る。 In S1506, the input data generation unit 1216 determines whether the estimation process in S1505 has been performed for all persons in the image acquired in S1501. If the input data generation unit 1216 determines that the estimation process has been performed for all persons, the process proceeds to S1507, and if it determines that the estimation process has not been performed for all persons, the process returns to S1504. .

Ｓ１５０７では、推定部１２１８は、Ｓ１５０５の推定結果を参照し、主被写体が、Ｓ１５０１で取得した画像中の人物のうち１名以上と会話を開始するか否かを判定する。推定部１２１８は、会話を開始すると判定した場合は、処理はＳ１５０８に進み、会話を開始しないと判定した場合は、処理はＳ１５０１に戻る。 In S1507, the estimation unit 1218 refers to the estimation result in S1505 and determines whether or not the main subject will start a conversation with one or more persons in the image acquired in S1501. If the estimating unit 1218 determines to start the conversation, the process advances to S1508, and if it determines not to start the conversation, the process returns to S1501.

Ｓ１５０８では、録音設定制御部１２１４は、Ｓ１５０５で会話を開始すると判定された人物の声が聴き取りやすくなるように、図１の音声入力部１０３の特性を変更する。例えば、会話を開始すると判定された人物の位置に応じてマイクの感度や周波数特性、指向性を変更したり、人物の声が強調されるようにゲインや帯域を変更する。または、音声処理部１０４を用いて、人物の音声帯域を強調したりするなど、人物の声が聴き取りやすくなるような音声処理を行ってもよい。 In S1508, the recording setting control unit 1214 changes the characteristics of the voice input unit 103 in FIG. 1 so that the voice of the person determined to start a conversation in S1505 becomes easier to hear. For example, the sensitivity, frequency characteristics, and directivity of the microphone are changed according to the position of the person determined to start talking, or the gain and band are changed so as to emphasize the voice of the person. Alternatively, the audio processing unit 104 may be used to perform audio processing that makes it easier to hear the person's voice, such as emphasizing the person's voice band.

Ｓ１５０９では、撮影設定制御部１２１５は、Ｓ１５０５で会話を開始すると判定された人物が視認しやすくなるように図１の撮像部１０１の撮影設定を変更する。例えば、Ｓ１５０５で会話を開始すると判定された人物に焦点が合うように図２のレンズ部２０１を制御して合焦位置を調整する。または、Ｓ１５０５で会話を開始すると判定された人物が画角内に適切に配置されるように、図２のレンズ部２０１の焦点距離を制御して画角を調整する。 In S1509, the imaging setting control unit 1215 changes the imaging settings of the imaging unit 101 in FIG. 1 so that the person determined to start a conversation in S1505 can be easily visually recognized. For example, the focus position is adjusted by controlling the lens unit 201 in FIG. 2 so that the person determined to start talking in S1505 is in focus. Alternatively, the angle of view is adjusted by controlling the focal length of the lens unit 201 in FIG. 2 so that the person determined to start a conversation in S1505 is appropriately arranged within the angle of view.

なお、図１５の各処理ステップのうち、Ｓ１５０５については、機械学習された学習済みモデルを用いて処理を実行したが、ルックアップテーブル（ＬＵＴ）等のルールベースの処理を行ってもよい。その場合には、例えば、入力データと出力データとの関係をあらかじめＬＵＴとして作成する。そして、この作成したＬＵＴを図１の制御部１１１に含まれるＲＯＭ等の記憶装置に格納しておくとよい。Ｓ１５０５の処理を行う場合には、この格納されたＬＵＴを参照して、出力データを取得することができる。つまりＬＵＴは、上記処理ステップと同等の処理を行うためのプログラムとして、ＣＰＵあるいはＧＰＵなどと協働で動作することにより処理を行う。 Of the processing steps in FIG. 15, S1505 is executed using a machine-learned trained model, but rule-based processing such as a lookup table (LUT) may also be executed. In that case, for example, the relationship between the input data and the output data is created in advance as an LUT. Then, it is preferable to store the created LUT in a storage device such as a ROM included in the control unit 111 of FIG. When performing the processing of S1505, the output data can be acquired by referring to this stored LUT. That is, the LUT is a program for performing processing equivalent to the processing steps described above, and performs processing by operating in cooperation with the CPU or GPU.

図１４および図１５の各処理により、主対象が画像中の人物と会話を開始するか否かを、会話の開始前（または、開始直後）に推定することができる。これにより、撮影中にリアルタイムに会話シーンの撮影に適した録音設定、撮影設定を行うことが可能になる。会話シーンに適した録音設定、撮影設定をリアルタイムに行うことで、後処理により、音声や撮影画像を加工する場合に比べて高品質に会話シーンを録音録画することができる。また、撮影時にリアルタイムに録音音声や撮影画像を視聴する場合においても、聴き取りやすい会話音声と、会話している人物が視認しやすい映像を提供することができる。
［他の実施形態］
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 14 and 15, it is possible to estimate whether or not the main subject will start a conversation with the person in the image before (or immediately after) the conversation. As a result, recording settings and shooting settings suitable for shooting conversation scenes can be performed in real time during shooting. By performing recording settings and shooting settings suitable for conversation scenes in real time, it is possible to record conversation scenes with higher quality than in the case of processing voices and captured images by post-processing. In addition, even when viewing recorded voices and captured images in real time at the time of shooting, it is possible to provide easy-to-hear conversational voices and images that are easy for the person speaking to visually recognize.
[Other embodiments]
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in the computer of the system or apparatus reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

発明は本実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、発明の範囲を公にするために請求項を添付する。 The invention is not limited to this embodiment, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, the claims are appended to make public the scope of the invention.

１００…撮像装置（ハードウェア、１０２…画像処理部、１０４…音声処理部、２２２…人物検出部、２２３…人物動作検出部、２２４…人物発話検出部、２２５…主対象選定部、２２６…会話グルーピング部 100... imaging device (hardware, 102... image processing unit, 104... voice processing unit, 222... person detection unit, 223... person action detection unit, 224... person utterance detection unit, 225... main subject selection unit, 226... conversation Grouping part

Claims

person detection means for detecting a person from the imaged image;
a main target selection means for selecting a main target person from among the persons detected by the person detection means;
human motion detection means for detecting the motion of the person detected by the human detection means;
human motion counting means for counting the number of motions detected by the human motion detecting means;
When the number of actions of the predetermined person detected within a predetermined first time after the utterance of the main target person is detected reaches a predetermined number of times within the predetermined first time, the predetermined person as a person who is conversing with the main subject.

The grouping means is adapted to detect utterances of the main target person or a predetermined person grouped with the main target person, and the predetermined person other than the main target and the predetermined person detected within a predetermined first time. When the number of actions of a person other than the person reaches a predetermined number of times within the predetermined first time period, the other person is grouped as a person who is conversing with the main target. The electronic device according to claim 1.

The grouping means determines that the number of actions of the other person who is not in conversation with the main target person detected within the predetermined first time after the main target person's utterance is detected is the predetermined number of times. 2. The electronic device according to claim 1, wherein when a predetermined number of times has been reached within the first period of time, the other person is grouped as a person who is conversing with the main subject.

The grouping means detects an action of a person other than the main target and the predetermined person within a predetermined first time period after the utterance of the main target or the predetermined person is detected. , maintaining the grouping of said other persons;
Actions of persons other than the main target and the predetermined person are not detected within a predetermined second time longer than the first time after the utterance of the main target or the predetermined person is detected. 4. The electronic device according to claim 2 or 3, wherein the grouping of the other person is canceled if the grouping of the other person is canceled.

The human action detection means detects the actions of the person detected by the person detection means, such as speaking, nodding, shaking the face or neck vertically, shaking the face or neck sideways, and directing the gaze or the face to the person who is speaking. 5. The electronic device according to any one of claims 1 to 4, wherein at least one of a motion and a motion of expressing emotions is detected.

6. The apparatus according to any one of claims 1 to 5, further comprising voice processing means for performing voice processing on voices uttered by persons in the group based on information on the persons grouped by said grouping means. 1. The electronic device according to item 1.

The audio processing means performs a process of amplifying the gain of the voice of the person for whom the grouping has been performed, or a process of attenuating the gain of the voice of the person other than the person who has been grouped, or 7. The electronic device according to claim 6, wherein at least one of a process of amplifying the gain of the voice band of the person subjected to said grouping is performed.

imaging means;
a person detection means for detecting a person from the image generated by the imaging means;
a person designation means for designating a main target person and a person other than the main target from the persons detected by the person detection means;
an image including the person specified by the person specifying means and the person specified by the person specifying means based on a trained model for estimating whether or not the person specified by the person specifying means will start a conversation; and an estimating means for estimating whether or not the person specified by the person specifying means will start a conversation, using information about the person as input data.

an audio input means for inputting audio during shooting of moving image data; and a control means for controlling the audio input means;
9. The electronic device according to claim 8, wherein said control means changes any one of frequency characteristics, directivity and gain of said voice input means according to the estimation result of said estimation means.

further comprising audio processing means for processing the audio input by the audio input means;
10. The electronic device according to claim 9, wherein said control means controls said audio processing means so as to adjust the gain of the audio of a scene in which a person is speaking according to the estimation result of said estimation means. machine.

further comprising a lens unit for capturing an optical image of a subject into the imaging unit; and a lens control unit for controlling the lens unit;
9. The electronic device according to claim 8, wherein the lens control means changes the focal length or the in-focus position of the lens unit according to the estimation result of the estimation means.

An information processing device that generates a trained model for estimating whether a specific person in an image and another person will start a conversation,
a storage means for storing moving image data generated by the imaging means;
a person detection means for detecting a person from the video data stored in the storage means;
a selection means for selecting a specific person from among the persons detected by the person detection means;
Conversation scene discrimination means for discriminating a scene in which the specific person is having a conversation with another person from the video data stored in the storage means;
Using an image immediately before or after the conversation scene extracted by the conversation scene determination means as input data for learning a scene in which conversation is predicted to start,
learning data generation means for using an image of a scene separated by a predetermined time from the conversation scene extracted by the conversation scene determination means as input data for learning a scene in which conversation is predicted not to start. and information processing equipment.

A method of controlling an electronic device, comprising:
a person detection step of detecting a person from the captured image;
a main target selection step of selecting a main target person from among the persons detected by the person detection step;
a human motion detection step of detecting the motion of the person detected by the human detection step;
a human motion counting step for counting the number of motions detected by the human motion detection step;
When the number of actions of the predetermined person detected within a predetermined first time after the utterance of the main target person is detected reaches a predetermined number of times within the predetermined first time, the predetermined person as a person conversing with said main subject.

A method of controlling an imaging device, comprising:
a person detection step of detecting a person from the image generated by the imaging means;
a person designation step of designating a main target person and a person other than the main target from the persons detected by the person detection step;
An image including the person specified by the person specifying step and information of the person specified by the person specifying step based on a trained model for estimating whether the person specified by the person specifying step will initiate a conversation. as input data, an estimation step of estimating whether the person specified by said person specifying step will start a conversation.

A method for generating a trained model for estimating whether a specific person in an image and another person will start a conversation, comprising:
a person detection step of detecting a person from the recorded video data;
a selection step of selecting a specific person from among the persons detected by the person detection step;
a conversation scene determination step of determining a scene in which the specific person is having a conversation with another person from the moving image data;
An image immediately before or after the conversation scene extracted by the conversation scene determination step is used as input data for learning a scene in which conversation is predicted to start,
and a learning data generation step of using an image of a scene separated by a predetermined time from the conversation scene extracted by the conversation scene determination step as input data for learning a scene in which conversation is predicted not to start. and how to.

A computer readable program for executing a method according to any one of claims 13 to 15 on a computer.