JP5029986B2

JP5029986B2 - Information processing apparatus and program

Info

Publication number: JP5029986B2
Application number: JP2007122883A
Authority: JP
Inventors: グンティダーロットウィブンチャイ; 真人戸上; 敦小池; 郁也荒井
Original assignee: NEC Casio Mobile Communications Ltd
Current assignee: NEC Casio Mobile Communications Ltd
Priority date: 2007-05-07
Filing date: 2007-05-07
Publication date: 2012-09-19
Anticipated expiration: 2027-05-07
Also published as: JP2008278433A

Description

本発明は、音声及び画像を処理する情報処理装置とコンピュータを情報処理装置として機能させるプログラムに関する。 The present invention relates to an information processing apparatus that processes sound and images and a program that causes a computer to function as the information processing apparatus.

複数のマイクロホンを使用して取得した音から、目的とする音（以下、「目的音」という）のみを分離抽出できる音源分離技術が存在する。音源分離技術は、目的音が到来する方向を含む特定領域を予め設定し、この特定領域から到来する目的音のみを分離抽出して、特定領域以外の領域から到来する音を除去する、等の手法を用いている。 There is a sound source separation technique capable of separating and extracting only a target sound (hereinafter referred to as “target sound”) from sounds acquired using a plurality of microphones. The sound source separation technology sets a specific area including the direction in which the target sound arrives in advance, separates and extracts only the target sound that arrives from this specific area, and removes sound arriving from areas other than the specific area, etc. The method is used.

例えば、特開２００１−８４７１３号公報は、ピントの合っている位置に存在するものが発生した音を周囲雑音から分離抽出するビデオカメラ一体型音源分離収音マイクロホンシステム、を開示する。 For example, Japanese Patent Laid-Open No. 2001-84713 discloses a video camera-integrated sound source separation / sound collection microphone system that separates and extracts sounds generated at a focused position from ambient noise.

また、目的音を発生している音源の位置（方向）を特定できる音源定位技術が存在する。例えば、非特許文献１は、複数のマイクロホンを使用して音を発生している音源の位置を判別した上で、その位置に存在する人物の顔を判別し、その人物が発生する目的音を分離抽出できるロボットを開示する。
特開２００１−８４７１３号公報戸上真人, 天野明雄, 新庄広, 鴨志田亮太，“人間共生ロボットEMIEWの聴覚機能”，人工知能学会，pp.59-64, 2005/10/14 There is also a sound source localization technique that can specify the position (direction) of a sound source that generates a target sound. For example, Non-Patent Document 1 discriminates the position of a sound source that generates sound using a plurality of microphones, then determines the face of a person existing at that position, and determines the target sound generated by that person. A robot capable of separation and extraction is disclosed.
JP 2001-84713 A Masato Togami, Akio Amano, Hiroshi Shinjo, Ryota Kamoshida, “Hearing Function of Human Symbiotic Robot EMIEW”, Japanese Society for Artificial Intelligence, pp.59-64, 2005/10/14

特許文献１に開示されているビデオカメラ一体型音源分離収音マイクロホンシステムは、目的音を発生している音源がビデオカメラのピント位置から外れた場合に、目的音を分離抽出できない。 The video camera-integrated sound source separation / collection microphone system disclosed in Patent Document 1 cannot separate and extract a target sound when the sound source generating the target sound deviates from the focus position of the video camera.

また、非特許文献１に開示されているロボットは、目的音を発生している人物が、複数のマイクロホン素子を使用して判別した位置から外れた場合に、目的音を分離抽出できない。
即ち、従来の技術では、一旦特定した位置から音源が移動してしまうと、以後、その音源からの音を抽出することができなくなってしまう。 In addition, the robot disclosed in Non-Patent Document 1 cannot separate and extract the target sound when the person generating the target sound deviates from the position determined using a plurality of microphone elements.
That is, in the conventional technique, once the sound source moves from the specified position, it becomes impossible to extract the sound from the sound source thereafter.

本発明は、上記問題点に鑑みてなされたものであり、抽出対象の音の到来方向を特定する領域から、音源が外れた場合でも、その音源からの音を分離抽出可能とすることを目的とする。
また、本発明は、上記問題点に鑑みてなされたものであり、音源が移動した場合でも、その音源からの音を分離抽出可能とすることを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to enable separation and extraction of sound from a sound source even when the sound source is out of the region for specifying the direction of arrival of the sound to be extracted. And
Another object of the present invention is to make it possible to separate and extract sound from a sound source even when the sound source moves.

上記課題を解決するため、本発明の情報処理装置は、
音を入力する音声入力手段と、画像を撮像する撮像手段と、データを表示する表示手段と、前記撮像手段が撮像した撮像画像と、分離抽出対象とする音の到来する方向を特定するための特定領域と、を重ねて前記表示手段に表示させる表示制御手段と、音源の方向を特定する音源定位手段と、前記音源定位手段が特定した音源の方向が、前記表示手段が表示した特定領域に対応するか否かを検出する音源有無検出手段と、前記音源有無検出手段が音源の方向が前記特定領域に対応しないことを検出した場合に、該音源の方向に一致するように前記特定領域の位置を変更する領域位置変更手段と、前記撮像手段が撮像した撮像画像と、前記領域位置変更手段が変更した位置にある特定領域と、を重ねて前記表示手段に表示させる領域表示制御手段と、前記領域表示制御手段が前記表示手段に表示させた特定領域が特定する方向から到来する音を、前記音声入力手段が入力した音から分離抽出する音源分離手段と、を備え、前記表示制御手段は、前記撮像画像の撮像画像内に人物が含まれるか否かを判別する画像内人物有無判別手段と、前記画像内人物有無判別手段が前記撮像画像内に人物が含まれると判別した場合に、該撮像画像に含まれる人物の数が複数か否かを判別する人数単複判別手段と、前記人数単複判別手段が複数でないと判別した場合に、前記撮像手段の撮像画像の全領域の位置とサイズとを前記特定領域の位置とサイズとして設定する初期領域設定手段と、前記撮像手段が撮像した撮像画像と、前記初期領域設定手段が設定した特定領域と、を重ねて前記表示手段に表示させる手段と、を備える、ことを特徴とする。
In order to solve the above problems, an information processing apparatus of the present invention provides:
An audio input means for inputting sound, an imaging unit for capturing an image, and display means for displaying the data, the captured image of the imaging means has captured, for specifying the direction of arrival of sound and the separation extraction target A display control unit that displays the specific region on the display unit, a sound source localization unit that identifies the direction of the sound source, and a direction of the sound source identified by the sound source localization unit in the specific region displayed by the display unit. Sound source presence / absence detecting means for detecting whether or not the sound source is detected; and when the sound source presence / absence detecting means detects that the direction of the sound source does not correspond to the specific area, an area position changing means for changing the position, the imaging means and the captured image captured is pre SL region display control hand to display a specific area, on the display means superimposed on the position change area position changing unit If the sound which the region display control unit comes from a direction in which the specific region to identify which is displayed on the display unit, and a sound source separating means for separating and extracting from the sound the sound input means inputs, the display control Means for determining whether or not a person is included in the captured image of the captured image; and when the person presence / absence determining means in the image determines that a person is included in the captured image In addition, when it is determined that the number of persons included in the captured image is plural and the number of persons determination means is not plural, the position of the entire area of the captured image of the imaging means is determined. Initial area setting means for setting the specific area as the position and size of the specific area, the captured image captured by the imaging means, and the specific area set by the initial area setting means are superimposed on the display means. And means for, and is characterized in that.

更に、コンピュータに対して本発明の主要機能を実行させるためのプログラムを提供する。 Furthermore, a program for causing a computer to execute the main functions of the present invention is provided.

本発明の情報処理装置によれば、現在設定されている特定領域に目的音を発生している音源がないことを検出した場合に、特定領域の位置を変更して、変更後の特定領域内に位置する音源からの音を分離抽出する。これにより、例えば、音源が現在の特定領域から外れても、その音源を含むように特定領域の位置を変更して、その音源からの音を分離抽出することが可能となる。 According to the information processing apparatus of the present invention, when it is detected that there is no sound source generating the target sound in the currently set specific area, the position of the specific area is changed, and the changed specific area is changed. The sound from the sound source located at is separated and extracted. Thereby, for example, even if the sound source deviates from the current specific region, it is possible to change the position of the specific region so as to include the sound source and to separate and extract the sound from the sound source.

以下、図１乃至１９を参照して、本発明の実施形態に係る情報処理装置を説明する。以下では、情報処理装置を、通信機能を備える携帯端末１に適用する例について説明する。 Hereinafter, an information processing apparatus according to an embodiment of the present invention will be described with reference to FIGS. Below, the example which applies information processing apparatus to the portable terminal 1 provided with a communication function is demonstrated.

携帯端末１は、テレビ（ＴＶ）電話機能を備える携帯電話装置であり、例えば、図１に示すような折り畳み型のものであり、キーボード２、カメラ３、表示パネル４、複数のマイク６、通知用ＬＥＤ８、等を備えている。
キーボード２は、ユーザに操作され、様々なデータや指示、例えば、テレビ電話の起動及び終了、抽出対象とする音声を特定するための目的音範囲の大きさや位置の変更、等を指示するために使用される。 The mobile terminal 1 is a mobile phone device having a TV (TV) telephone function, and is, for example, a foldable type as shown in FIG. 1, and includes a keyboard 2, a camera 3, a display panel 4, a plurality of microphones 6, and a notification. LED8 etc. are provided.
The keyboard 2 is operated by the user in order to instruct various data and instructions, for example, start and end of a videophone, change of the size and position of a target sound range for specifying a voice to be extracted, and the like. used.

カメラ３は、ＣＣＤ（Charge Coupled Devices）カメラ、ＣＭＯＳカメラ等から構成され、画像（静止画又は動画）、例えば、テレビ電話用の画像を撮影する。また、カメラ３は、ズームイン・ズームアウト機能を有しており、撮像範囲を変更可能である。
表示パネル４は、ＬＣＤ（液晶表示）パネルとドライバ回路等から構成され、任意の画像、例えば、カメラ３が撮像したテレビ電話の話者の画像や目的音範囲の画像等を表示する。 The camera 3 includes a CCD (Charge Coupled Devices) camera, a CMOS camera, and the like, and takes an image (still image or moving image), for example, an image for a videophone. The camera 3 has a zoom-in / zoom-out function, and can change the imaging range.
The display panel 4 includes an LCD (liquid crystal display) panel, a driver circuit, and the like, and displays an arbitrary image, for example, an image of a videophone speaker captured by the camera 3 or an image of a target sound range.

通話用マイク５は、通話音声を入力する。通話用マイク５は、マイク６と同様に音源の位置する方向を特定するために使用しても良い。
マイク６は、複数個配置され、それぞれ、集音した音の音声レベルに応じた音声信号をそれぞれ出力する。マイク６の数と位置はマイク６が入力した音の位相差から音源の方向を特定できる数と位置なら、どの数でもどの位置でも良い。
スピーカ７は、受話音声等を出力する。
通知用ＬＥＤ８は、点灯・点滅等により種々の情報をユーザに通知する。 The call microphone 5 inputs call voice. Similarly to the microphone 6, the call microphone 5 may be used to specify the direction in which the sound source is located.
A plurality of microphones 6 are arranged, and each output a sound signal corresponding to the sound level of the collected sound. The number and position of the microphones 6 may be any number and position as long as the direction and direction of the sound source can be specified from the phase difference of the sound input from the microphone 6.
The speaker 7 outputs a received voice or the like.
The notification LED 8 notifies the user of various information by lighting / flashing.

携帯端末１は、図２に示すように、制御部１１、無線通信部１２、記憶部１３、操作部１４、撮像部１５、表示部１６、音声入力部１７、音声出力部１８、通知部１９、バス２０、等の回路構成を有する。 As shown in FIG. 2, the mobile terminal 1 includes a control unit 11, a wireless communication unit 12, a storage unit 13, an operation unit 14, an imaging unit 15, a display unit 16, a voice input unit 17, a voice output unit 18, and a notification unit 19. , Bus 20 and the like.

制御部１１は、ＣＰＵ（Central Processing Unit）等から構成され、記憶部１３が記憶する動作プログラムに従って、携帯端末１全体を制御する。また、制御部１１は、音源定位分離部１１１と、画像処理部１１２と、検出部１１３と、通知制御部１１４と、目的音範囲手動変更部１１５と、目的音範囲自動変更部１１６と、を備える。 The control unit 11 includes a CPU (Central Processing Unit) and the like, and controls the entire mobile terminal 1 according to an operation program stored in the storage unit 13. The control unit 11 includes a sound source localization separation unit 111, an image processing unit 112, a detection unit 113, a notification control unit 114, a target sound range manual change unit 115, and a target sound range automatic change unit 116. Prepare.

音源定位分離部１１１は、音声入力部１７が備える複数のマイク６が入力した音の位相差から音源の位置する方向を特定する。また、音源定位分離部１１１は、予め設定されている目的音範囲から到来した音のみを分離抽出する。更に、音源定位分離部１１１は、音声入力部１７が入力する音（デジタル音声信号）の各音源を特定し、音源別に音声レベルを判別し、最も大きな音声レベルの音を特定する。 The sound source localization separation unit 111 identifies the direction in which the sound source is located from the phase difference between the sounds input by the plurality of microphones 6 included in the sound input unit 17. Further, the sound source localization separation unit 111 separates and extracts only sounds that have arrived from a preset target sound range. Furthermore, the sound source localization separation unit 111 identifies each sound source of the sound (digital sound signal) input by the sound input unit 17, determines the sound level for each sound source, and identifies the sound with the highest sound level.

画像処理部１１２は、撮像画像に、任意の位置及び大きさで目的音範囲を重ねる処理を施す。また、画像処理部１１２は、撮像画像に含まれる人物を特定し、特定した人物が撮像画像上で占める領域を特定する。更に、画像処理部１１２は、音源定位分離部１１１が特定した音源とその音源から到来する音の音声レベルとを対応付けた画像を生成する。 The image processing unit 112 performs a process of superimposing the target sound range on the captured image at an arbitrary position and size. In addition, the image processing unit 112 identifies a person included in the captured image and identifies an area occupied by the identified person on the captured image. Furthermore, the image processing unit 112 generates an image in which the sound source specified by the sound source localization separation unit 111 is associated with the sound level of sound coming from the sound source.

検出部１１３は、目的音範囲内から到来する音の音声レベルを音源定位分離部１１１から取得し、その音声レベルが所定レベル以上なら目的音源が目的範囲内にあると判別し、小さければ、ないと判別する。また、検出部１１３は、画像処理部１１２によって判別された特徴点を取得して、目的音範囲内の目的音源の有無を検出する。通知制御部１１４は、通知部１９又は音声出力部１８を制御して、目的音範囲に目的音源がないことをユーザに通知させる。 The detection unit 113 acquires the sound level of sound coming from within the target sound range from the sound source localization separation unit 111, and determines that the target sound source is within the target range if the sound level is equal to or higher than a predetermined level. Is determined. Further, the detection unit 113 acquires the feature points determined by the image processing unit 112 and detects the presence or absence of the target sound source within the target sound range. The notification control unit 114 controls the notification unit 19 or the audio output unit 18 to notify the user that there is no target sound source in the target sound range.

目的音範囲手動変更部１１５は、後述する目的音範囲変更モード情報１３１が「手動」に設定されている場合、ユーザ操作に応答した操作部１４からの指示に従って目的音範囲の位置とサイズとを変更する。 The target sound range manual changing unit 115 sets the position and size of the target sound range in accordance with an instruction from the operation unit 14 in response to a user operation when target sound range change mode information 131 described later is set to “manual”. change.

目的音範囲自動変更部１１６は、後述する目的音範囲変更モード情報１３１が「自動」に設定されている場合、検出部１１３が目的音範囲に目的音源が無いことを検出したことに応答して、目的音源を含むよう目的音範囲を変更する。このとき、撮像画像に目的音源が含まれていなければ、目的音源を含むよう撮像部１５の撮像範囲（例えば、画角）を変更させる。 When the target sound range change mode information 131 described later is set to “automatic”, the target sound range automatic changing unit 116 responds to the detection unit 113 detecting that there is no target sound source in the target sound range. The target sound range is changed to include the target sound source. At this time, if the target sound source is not included in the captured image, the imaging range (for example, the angle of view) of the imaging unit 15 is changed to include the target sound source.

無線通信部１２は、通話音声や撮像画像等を、基地局を介して通話先に送信する。また、無線通信部１２は、基地局を介し音声信号や画像データを通信相手から受信する。 The wireless communication unit 12 transmits call voice, captured images, and the like to the call destination via the base station. Further, the wireless communication unit 12 receives audio signals and image data from the communication partner via the base station.

記憶部１３は、制御部１１のプログラム、音声データ、映像データ、等を記憶する。例えば、記憶部１３は、図８、図９及び図１１〜１６を参照して後述する処理を制御部１１に実行させるための制御プログラム、テレビ電話のアプリケーションプログラム、等を記憶する。なお、記憶部１３は、内蔵メモリ又は外部メモリのいずれから構成してもよい。 The storage unit 13 stores a program of the control unit 11, audio data, video data, and the like. For example, the storage unit 13 stores a control program for causing the control unit 11 to execute processing described later with reference to FIGS. 8, 9, and 11 to 16, a videophone application program, and the like. The storage unit 13 may be composed of either an internal memory or an external memory.

また、記憶部１３は、目的音範囲変更モード情報１３１と、目的音源有無情報１３２と、通知情報１３３と、目的音範囲情報１３４と、目的音角度情報１３５と、を記憶する。 The storage unit 13 also stores target sound range change mode information 131, target sound source presence / absence information 132, notification information 133, target sound range information 134, and target sound angle information 135.

目的音範囲変更モード情報１３１は、図３に示すように、目的音範囲を手動又は自動のいずれにより変更するかを示す。
目的音源有無情報１３２は、図４に示すように、テレビ電話等の実行中に目的音源が目的音範囲に存在するか否かを示す。 The target sound range change mode information 131 indicates whether the target sound range is changed manually or automatically, as shown in FIG.
As shown in FIG. 4, the target sound source presence / absence information 132 indicates whether or not the target sound source exists in the target sound range during execution of a videophone or the like.

通知情報１３３は、図５に示すように、通知モード、通知方法、通知動作、の項目を含む。
「通知モード」は、目的音源が目的音範囲に存在しないことをユーザに通知するか否かを示す項目である。「通知方法」は、ユーザに通知するときに使用する通知方法（ライトアップ、バイブレーション、音声出力のいずれか１つ）を示す項目である。「通知動作」は、ユーザ通知を実行しているか否か（実行中又は停止中）を示す項目である。 As shown in FIG. 5, the notification information 133 includes items of a notification mode, a notification method, and a notification operation.
“Notification mode” is an item indicating whether or not to notify the user that the target sound source does not exist in the target sound range. “Notification method” is an item indicating a notification method (any one of light-up, vibration, and audio output) used when notifying the user. “Notification operation” is an item indicating whether a user notification is being executed (execution or stoppage).

目的音範囲情報１３４は、図６に示すように、撮像画像上に設定される矩形の目的音範囲を定義する情報であり、４つの頂点の座標（ｘ，ｙ）から構成される。図１７（ａ）の例では、目的音範囲は、撮像画像ＰＴ１上のｘ１≦ｘ≦ｘ２、ｙ１≦ｙ≦ｙ２の領域ＯＳ１であり、目的音範囲情報１３４は、図６に示すように、その４つの頂点の座標（ｘ１，ｙ２）、（ｘ１，ｙ１）、（ｘ２，ｙ２）、（ｘ２，ｙ１）となる。 As shown in FIG. 6, the target sound range information 134 is information that defines a rectangular target sound range set on the captured image, and includes four vertex coordinates (x, y). In the example of FIG. 17A, the target sound range is a region OS1 of x1 ≦ x ≦ x2 and y1 ≦ y ≦ y2 on the captured image PT1, and the target sound range information 134 is as shown in FIG. The coordinates of the four vertices are (x1, y2), (x1, y1), (x2, y2), (x2, y1).

目的音角度情報１３５は、携帯端末１から目的音源をみたときの角度（方向）を示す情報であり、例えば、図７に示すように、目的音範囲の右端・左端位置の、所定の基準（０度）方向に対する角度を示す。 The target sound angle information 135 is information indicating an angle (direction) when the target sound source is viewed from the mobile terminal 1, and for example, as shown in FIG. 7, a predetermined reference ( Indicates the angle to the (0 degree) direction.

図２に示す操作部１４は、キーボード２等を備え、データや指示を入力し、制御部１１に与える。 The operation unit 14 shown in FIG. 2 includes a keyboard 2 and the like, inputs data and instructions, and gives them to the control unit 11.

撮像部１５は、カメラ３を備え、画像（静止画又は動画）を撮像し、制御部１１に送信する。 The imaging unit 15 includes the camera 3, captures an image (still image or moving image), and transmits the image to the control unit 11.

表示部１６は、表示パネル４とドライバ回路等から構成され、制御部１１の制御下に画像を表示パネル４に表示する。 The display unit 16 includes the display panel 4 and a driver circuit, and displays an image on the display panel 4 under the control of the control unit 11.

音声入力部１７は、マイク５，６を備え、各マイク５，６が入力した音声信号を制御部１１に供給する。
音声出力部１８は、制御部１１の制御下に、スピーカ７から音声を出力する。 The audio input unit 17 includes microphones 5 and 6, and supplies the audio signal input by the microphones 5 and 6 to the control unit 11.
The sound output unit 18 outputs sound from the speaker 7 under the control of the control unit 11.

通知部１９は、通知用ＬＥＤ、振動発生機構、等から構成され、制御部１１の制御下に、通知用ＬＥＤ８の点灯動作、振動発生機構のバイブレーション動作により種々の情報をユーザに通知する。
バス２０は、各部間で相互にデータを伝送する。 The notification unit 19 includes a notification LED, a vibration generation mechanism, and the like. Under the control of the control unit 11, the notification unit 19 notifies the user of various types of information through the lighting operation of the notification LED 8 and the vibration operation of the vibration generation mechanism.
The bus 20 transmits data between the units.

次に、図８、図９及び図１１〜１６を参照して、上記構成を有する携帯端末１のテレビ電話の動作について説明する。なお、この携帯端末の通信動作自体は通常の携帯電話と同一であり、以下、本実施形態で特徴的なテレビ電話の動作について説明する。 Next, with reference to FIG. 8, FIG. 9 and FIGS. 11 to 16, the operation of the videophone of the portable terminal 1 having the above configuration will be described. Note that the communication operation itself of this mobile terminal is the same as that of a normal mobile phone, and the characteristic videophone operation of this embodiment will be described below.

ユーザが操作部１４を操作してテレビ電話の開始を指示すると、指示に応答して、制御部１１は、図８に示すテレビ電話アプリケーションの処理を開始する。制御部１１は、まず、撮像部１５を起動して、カメラ３の撮像画像を取得し、これを表示部１６に供給して表示パネル４に表示させ処理を開始させる（ステップＳ１）。
続いて、制御部１１は、目的音範囲決定処理（ステップＳ２）を実行する。 When the user operates the operation unit 14 to instruct the start of the videophone, in response to the instruction, the control unit 11 starts the processing of the videophone application shown in FIG. First, the control unit 11 activates the imaging unit 15, acquires a captured image of the camera 3, supplies the image to the display unit 16, displays the image on the display panel 4, and starts processing (step S1).
Then, the control part 11 performs the target sound range determination process (step S2).

携帯端末１のテレビ電話機能は、携帯端末１の前に複数の人物（話者となる可能性がある者）が存在する場合に、特定の話者（音源）の音声（目的音）を分離・抽出して通話先に送信する機能を備える。目的音範囲決定処理（ステップＳ２）は、分離対象とする目的音（話者、音源）を特定するための画像領域である目的音範囲を表示パネル４上に設定する処理である。 The videophone function of the mobile terminal 1 separates the voice (target sound) of a specific speaker (sound source) when there are a plurality of persons (persons who may become speakers) in front of the mobile terminal 1・ Has a function to extract and send to the callee. The target sound range determination process (step S2) is a process of setting a target sound range, which is an image area for specifying a target sound (speaker, sound source) to be separated, on the display panel 4.

図９に示すように、目的音範囲決定処理では、検出部１１３は、撮像画像内に人物（話者となる可能性がある者）がいるか否か、いる場合には、人物の数が複数か否かの判別を画像処理部１１２に要求する（ステップＳ２１）。画像処理部１１２は、撮像画像を解析し、例えば、パターンマッチングにより、人間の顔と類似する画像の有無・数を判別することにより、撮像画像内に人物がいるか否か、いる場合には、人物の数が複数か否かの判別を行い、検出部１１３に判別結果を通知する。
検出部１１３は、画像処理部１１２からの通知に基づいて、撮像画像内の人物の数が複数（２人以上）か否かを判別する（ステップＳ２２）。 As shown in FIG. 9, in the target sound range determination process, the detection unit 113 determines whether or not there is a person (a person who may become a speaker) in the captured image. Is requested to the image processing unit 112 (step S21). The image processing unit 112 analyzes the captured image and determines whether or not there is a person in the captured image by determining the presence / absence / number of images similar to a human face by pattern matching, for example. It is determined whether the number of persons is plural, and the detection unit 113 is notified of the determination result.
Based on the notification from the image processing unit 112, the detection unit 113 determines whether the number of persons in the captured image is plural (two or more) (step S22).

複数であれば（ステップＳ２２；Ｙｅｓ）、画像処理部１１２は、その内から顔画像が最も大きな人物を特定する。テレビ電話の場合、主な話者がカメラ３の正面に位置し、この話者の顔画像が最も大きくなる傾向がある。そこで、この人物からの音声を目的音とするため、この人物が撮像画像中に占める領域を特定し、その領域に目的音範囲を重ねる。
制御部１１は、撮像画像と目的音範囲とを重ねた画像を表示部１６を介して表示パネル４に表示させる（ステップＳ２４）。 If there are a plurality of images (step S22; Yes), the image processing unit 112 identifies the person with the largest face image. In the case of a videophone, the main speaker is located in front of the camera 3 and the face image of this speaker tends to be the largest. Therefore, in order to use the voice from this person as the target sound, an area occupied by this person in the captured image is specified, and the target sound range is overlaid on the area.
The control unit 11 causes the display panel 4 to display an image obtained by superimposing the captured image and the target sound range (step S24).

具体例で説明すると、撮影画像が図１０（ａ）に示す撮像画像ＰＴ１とした場合、撮像画像ＰＴ１中に、複数の人物の画像Ｐ１，Ｐ２，Ｐ３が存在すると判別され、顔画像の最も大きな人物Ｐ１が特定され、人物Ｐ１の顔画像の占める位置が特定され、図１０（ｂ）に示すように、人物Ｐ１の顔画像に目的音範囲ＯＳ１が重ねて表示される。 Specifically, when the captured image is the captured image PT1 shown in FIG. 10A, it is determined that a plurality of human images P1, P2, and P3 exist in the captured image PT1, and the largest facial image is obtained. The person P1 is specified, the position occupied by the face image of the person P1 is specified, and as shown in FIG. 10B, the target sound range OS1 is displayed over the face image of the person P1.

この段階で、ユーザは、キーボード２上のキーを操作して、目的音範囲ＯＳ１の大きさと位置を変更（編集）可能である。
例えば、ユーザは、人物Ｐ１上の目的音範囲ＯＳ１を、図１０（ｃ）に示すように広くすることを指示できる。画像処理部１１２は、指示に従って、目的音範囲ＯＳ１のサイズを変更し、表示部１６を介して表示パネル４に表示する。また、例えば、人物Ｐ２の声を目的音に設定して、通話先に選択的に送信したい場合、ユーザは操作部１４を操作して目的音範囲ＯＳ１を人物Ｐ２の領域に移動する。 At this stage, the user can change (edit) the size and position of the target sound range OS1 by operating keys on the keyboard 2.
For example, the user can instruct to widen the target sound range OS1 on the person P1 as shown in FIG. The image processing unit 112 changes the size of the target sound range OS1 in accordance with the instruction and displays it on the display panel 4 via the display unit 16. Further, for example, when the voice of the person P2 is set as the target sound and the user wants to selectively transmit the call to the destination, the user operates the operation unit 14 to move the target sound range OS1 to the area of the person P2.

なお、ユーザが目的音範囲を決定するとき、図１０（ｂ）に示すように、撮像画像ＰＴ１上の音源（人物Ｐ１〜Ｐ３等）の位置（方向）と、その位置に対応する方向から到来する音の音声レベルと、を対応付けた音声レベル表示画像ＶＤ１を表示部１６に表示してもよい。 When the user determines the target sound range, as shown in FIG. 10B, it comes from the position (direction) of the sound source (persons P1 to P3, etc.) on the captured image PT1 and the direction corresponding to the position. A sound level display image VD1 that associates the sound level of the sound to be played may be displayed on the display unit 16.

この場合、音の到来する方向と音の音声レベルとは、音源定位分離部１１１により特定される。画像処理部１１２は、音源定位分離部１１１が特定した方向を基に撮像画像ＰＴ１上の位置を定め、その位置と音声レベルとを対応付けて音声レベル表示画像ＶＤ１を生成する。ステップＳ２４を終了すると、処理はステップＳ２５に進む。 In this case, the sound source localization separation unit 111 identifies the direction in which the sound arrives and the sound level of the sound. The image processing unit 112 determines a position on the captured image PT1 based on the direction specified by the sound source localization separation unit 111, and generates the sound level display image VD1 by associating the position with the sound level. When step S24 ends, the process proceeds to step S25.

一方、人数が複数でなければ（１人以下）（ステップＳ２２；Ｎｏ）、画像処理部１１２は、テレビ電話の話者は最大でも１人であり、この話者の音声と他の人物の音声とを分別するよう目的音範囲を設定する必要が無い。そのため、この話者が目的音範囲から外れにくくするよう目的音範囲を撮像画像の全領域に定め、制御部１１は、撮像画像と目的音範囲とを重ねた画像を表示部１６を介して表示パネル４に表示させる（ステップＳ２３）。 On the other hand, if the number of persons is not plural (one or less) (step S22; No), the image processing unit 112 has at most one videophone speaker, and the voice of this speaker and the voices of other persons. There is no need to set the target sound range so as to be distinguished from each other. For this reason, the target sound range is defined as the entire region of the captured image so that the speaker is unlikely to deviate from the target sound range, and the control unit 11 displays an image obtained by superimposing the captured image and the target sound range via the display unit 16. It is displayed on the panel 4 (step S23).

例えば、撮像画像が、図１０（ｄ）に示す撮像画像ＰＴ２の場合、撮像画像ＰＴ２内の人物は１人であり、画像処理部１１２は、撮像画像ＰＴ２の全領域を目的音範囲ＯＳ２の大きさに設定し、図１０（ｅ）に示すように、表示部１６を介して表示パネル４に表示させる。ステップＳ１３でも、音声レベル表示画像ＶＤ１と実質的に同一の音声レベル表示画像ＶＤ２を生成して、表示部１６を介して表示パネル４に表示してもよい。ステップＳ２３を終了すると、処理はステップＳ２５に進む。 For example, when the captured image is the captured image PT2 shown in FIG. 10D, there is one person in the captured image PT2, and the image processing unit 112 sets the entire area of the captured image PT2 to the size of the target sound range OS2. Then, as shown in FIG. 10E, the image is displayed on the display panel 4 via the display unit 16. Also in step S13, an audio level display image VD2 substantially the same as the audio level display image VD1 may be generated and displayed on the display panel 4 via the display unit 16. When step S23 ends, the process proceeds to step S25.

操作部１４がユーザに操作され目的音範囲の決定を指示すると、制御部１１は、その時点で設定されている目的音範囲を示す目的音範囲情報１３４を生成し、図６に示すように、記憶部１３に記憶する（ステップＳ２５）。これが目的音範囲の初期値となる。
続いて、制御部１１は、設定した目的音範囲に基づいて目的音角度を求め、この目的音角度を示す目的音角度情報１３５を図７に示すように記憶部１３に記憶する。なお、目的音範囲に基づいて目的音角度を求める方法については後述する。以上で目的音範囲決定処理（図８，ステップＳ２）が終了して、処理は図８のステップＳ３に進む。 When the operation unit 14 is operated by the user to instruct the determination of the target sound range, the control unit 11 generates target sound range information 134 indicating the target sound range set at that time, and as shown in FIG. It memorize | stores in the memory | storage part 13 (step S25). This is the initial value of the target sound range.
Subsequently, the control unit 11 obtains a target sound angle based on the set target sound range, and stores target sound angle information 135 indicating the target sound angle in the storage unit 13 as shown in FIG. A method for obtaining the target sound angle based on the target sound range will be described later. Thus, the target sound range determination process (FIG. 8, step S2) is completed, and the process proceeds to step S3 in FIG.

続いて、制御部１１は、目的音源検出動作、通知動作、の初期設定を行う（ステップＳ３）。具体的には、制御部１１は、図４に示す目的音源有無情報１３２の初期値を「あり」、図５に示す通知情報１３３の通知方法の初期値を「ライトアップ」、に設定する。 Subsequently, the control unit 11 performs initial setting of the target sound source detection operation and the notification operation (step S3). Specifically, the control unit 11 sets the initial value of the target sound source presence / absence information 132 illustrated in FIG. 4 to “present” and the initial value of the notification method of the notification information 133 illustrated in FIG. 5 to “light-up”.

また、制御部１１は、ユーザ操作に応じて操作部１４が選択した動作モード（「手動」又は「自動」）を図３に示す目的音範囲変更モード情報１３１に設定する。なお、動作モードの設定内容に関わらず、通知制御部１１４は、図５に示す通知情報１３３の通知モードの初期値をオンに設定する。以上で、初期設定（ステップＳ３）が終了し、処理はステップＳ４に進む。 In addition, the control unit 11 sets the operation mode (“manual” or “automatic”) selected by the operation unit 14 in accordance with the user operation in the target sound range change mode information 131 illustrated in FIG. Note that the notification control unit 114 sets the initial value of the notification mode of the notification information 133 illustrated in FIG. Thus, the initial setting (step S3) is completed, and the process proceeds to step S4.

ステップＳ４において、音源定位分離部１１１は、マイク６が収集した音声のうち、目的音角度情報１３５が示す目的音角度から到来する音だけを、テレビ電話の話者の音声である目的音として分離抽出する。制御部１１は、抽出した目的音のデータ、即ち、テレビ電話の話者の音声をカメラ３の撮像画像と共に無線通信部１２を介して、テレビ電話の相手に送信する。 In step S4, the sound source localization separation unit 111 separates only the sound arriving from the target sound angle indicated by the target sound angle information 135 among the sounds collected by the microphone 6 as the target sound that is the voice of the videophone speaker. Extract. The control unit 11 transmits the extracted target sound data, that is, the voice of the videophone speaker along with the captured image of the camera 3 to the other party of the videophone via the wireless communication unit 12.

また、制御部１１は、無線通信部１２を介してテレビ電話の相手からの音声データと画像データを受信し、音声データを音声出力部１８に提供してスピーカ７から放音させ、画像データを表示部１６を介して表示パネル４に表示させる。 The control unit 11 also receives audio data and image data from the other party of the videophone via the wireless communication unit 12, provides the audio data to the audio output unit 18, emits sound from the speaker 7, and outputs the image data. The image is displayed on the display panel 4 via the display unit 16.

制御部１１は、テレビ電話の終了指示の有無を判別する（ステップＳ５）。終了が指示されたと判別すると（ステップＳ５；Ｙｅｓ）、制御部１１は、撮像部１５を介してカメラ３に撮像動作を終了させ、表示部１６を介して表示パネル４に表示動作を終了させ（ステップＳ７）、テレビ電話を終了する。なお、ムービー撮影の終了の場合は、開始した時点からの撮像部１５から撮像された画像データと音源定位分離部１１１から分離抽出された音声データを記憶部１３に記憶する。
終了指示がないと判別すると（ステップＳ５；Ｎｏ）、制御部１１は、現在の動作モード（「手動」又は「自動」）に対応した目的音範囲の変更処理（ステップＳ６）を実行する。ステップＳ６の処理は、テレビ電話の実行中に定期的に実行される。 The control unit 11 determines whether or not there is an instruction to end the videophone (step S5). When it is determined that the end is instructed (step S5; Yes), the control unit 11 causes the camera 3 to end the imaging operation via the imaging unit 15, and ends the display operation to the display panel 4 via the display unit 16 ( Step S7), the videophone call is terminated. In the case of the end of movie shooting, the image data captured from the imaging unit 15 from the start time and the audio data separated and extracted from the sound source localization separation unit 111 are stored in the storage unit 13.
When it is determined that there is no end instruction (step S5; No), the control unit 11 executes a target sound range changing process (step S6) corresponding to the current operation mode ("manual" or "automatic"). The process in step S6 is periodically executed during the videophone call.

図１１に示すように、目的音範囲の変更処理では、先ず、テレビ電話の話者が設定されている目的音範囲から移動したか否かを判別するために、検出処理が実行される（ステップＳ６１）。検出処理において、検出部１１３は、図１２に示すように、現在設定されている目的音範囲で特定される音源から到来する音の音声レベルの判別を音源定位分離部１１１に要求する。音源定位分離部１１１は、複数のマイク６が入力する音声信号の相関と位相差から、音が到来する方向とその音の音声レベルを判別し、検出部１１３に通知する。 As shown in FIG. 11, in the target sound range changing process, first, a detection process is executed in order to determine whether or not the videophone speaker has moved from the set target sound range (step S1). S61). In the detection process, as shown in FIG. 12, the detection unit 113 requests the sound source localization separation unit 111 to determine the sound level of the sound coming from the sound source specified in the currently set target sound range. The sound source localization separation unit 111 determines the direction of sound arrival and the sound level of the sound from the correlation and phase difference of the sound signals input by the plurality of microphones 6 and notifies the detection unit 113 of the result.

検出部１１３は、通知に基づいて目的音範囲に対応する方向からの音、即ち、分離対象である目的音が到来しているか否かを判別する（ステップＳ６１１）。検出部１１３は、目的音範囲に対応する方向からの音の音声レベルが所定レベルより大きければ目的音が到来していると判別し、小さければ到来していないと判別する。目的音が到来していると判別すると（ステップＳ６１１；Ｙｅｓ）、検出部１１３は目的音源有無情報１３２を「あり」に設定し（ステップＳ６１５）、処理は図１１にリターンし、ステップＳ６２に進む。 Based on the notification, the detection unit 113 determines whether or not the sound from the direction corresponding to the target sound range, that is, the target sound to be separated has arrived (step S611). The detection unit 113 determines that the target sound has arrived if the sound level of the sound from the direction corresponding to the target sound range is higher than a predetermined level, and determines that it has not arrived if the sound level is low. If it is determined that the target sound has arrived (step S611; Yes), the detection unit 113 sets the target sound source presence / absence information 132 to “present” (step S615), and the process returns to FIG. 11 and proceeds to step S62. .

検出部１１３は、目的音が到来していないと判別すると（ステップＳ６１１；Ｎｏ）、話者、即ち、目的音源が目的音範囲に存在しない（移動した）のか、又は、目的音源は目的音範囲に存在するが音を発生していないのか、を撮像画像上の目的音範囲に目的音源の画像があるか否かに基づいて判別する。そのため、処理はステップＳ６１２に進む。 When the detection unit 113 determines that the target sound has not arrived (step S611; No), the speaker, that is, the target sound source does not exist (moves) in the target sound range, or the target sound source falls within the target sound range. It is determined whether or not sound is generated based on whether or not there is an image of the target sound source in the target sound range on the captured image. Therefore, the process proceeds to step S612.

本実施形態では、音源は、人物であり、目的音範囲に人物の顔の画像があるか否かで、音源が存在するか否かを判別する。
まず、検出部１１３は、目的音範囲内の顔の有無を判別するよう、画像処理部１１２に要求する（ステップＳ６１２）。このとき、検出部１１３は、画像処理部１１２に判別する特徴点（顔（目の動き））を指示し、画像処理部１１２からの特徴点判別結果（目的音範囲に顔（目の動き）があるか否か）の通知を待つ。 In the present embodiment, the sound source is a person, and whether or not a sound source exists is determined based on whether or not there is an image of a person's face in the target sound range.
First, the detection unit 113 requests the image processing unit 112 to determine the presence or absence of a face within the target sound range (step S612). At this time, the detection unit 113 instructs the feature point (face (eye movement)) to be discriminated to the image processing unit 112, and the feature point discrimination result (face (eye movement) in the target sound range) from the image processing unit 112. Wait for notification.

ここで、目の動きの有無を判別するのは、撮像画像に人物写真が含まれている場合、画像処理部１１２が、目的音源の人物ではなく人物写真を検出するおそれがあるからである。画像処理部１１２は、特徴点として検出された目の候補に動きがあるか否かを検出し、動きがあることを検出した場合に目を検出したものと判別する。
なお、画像処理部１１２は、例えば、二値化した撮像画像上で黒の画素群をラベリングして、右目及び左目を構成する候補となる黒の画素領域を特定し、特定した画素領域の重心の移動する態様から人間の目の動きがあると判別する。 Here, the presence / absence of eye movement is determined because there is a possibility that the image processing unit 112 may detect a person photograph instead of the person of the target sound source when the photographed image includes a person photograph. The image processing unit 112 detects whether or not the eye candidate detected as the feature point has a motion, and determines that the eye has been detected when the motion is detected.
Note that the image processing unit 112, for example, labels a black pixel group on a binarized captured image, identifies a black pixel region that is a candidate for constituting the right eye and the left eye, and centroid of the identified pixel region It is determined that there is a human eye movement from the movement mode.

人物の顔（目の動き）があれば（ステップＳ６１２；Ｙｅｓ）、検出部１１３は、目的音源有無情報１３２を「あり」に設定し（ステップＳ６１５）、処理は図１１にリターンし、ステップＳ６２に進む。顔（目の動き）がなければ（ステップＳ６１２；Ｎｏ）、検出部１１３は、画像処理部１１２に目的音範囲内の体（体の動き）の有無の判別を要求する（ステップＳ６１３）。 If there is a human face (eye movement) (step S612; Yes), the detection unit 113 sets the target sound source presence / absence information 132 to “present” (step S615), and the process returns to FIG. 11 and step S62. Proceed to If there is no face (eye movement) (step S612; No), the detection unit 113 requests the image processing unit 112 to determine whether there is a body (body movement) within the target sound range (step S613).

ここで、体の動きを判別するのは、例えば、撮像部１５が撮影している場面が、図１０（ｅ）に示す、人物Ｐ１の背景にボードＢＤが配置された場面で、人物Ｐ１がこのボードＢＤに向かって議事録を書いている状態のように、人物の顔（目）が撮像部１５を向いておらず、目を検出できない場合などもあり得るためである。判別の要求後、検出部１１３は、要求に応答して画像処理部１１２から判別結果が通知されるまで待機する。
画像処理部１１２は、例えば、ソーベルフィルタ処理を施した撮像画像から人物の体の輪郭の候補となるエッジラインを抽出し、そのエッジラインの重心位置を求め、求めた重心位置の移動する態様から人物の体の動きがあると判別する。 Here, the movement of the body is determined, for example, when the scene captured by the imaging unit 15 is a scene in which the board BD is arranged in the background of the person P1 shown in FIG. This is because there may be a case where the person's face (eyes) is not facing the imaging unit 15 and the eyes cannot be detected as in the state of writing the minutes toward the board BD. After the request for determination, the detection unit 113 waits until the determination result is notified from the image processing unit 112 in response to the request.
The image processing unit 112 extracts, for example, an edge line that is a candidate for a contour of a human body from a captured image that has been subjected to Sobel filter processing, obtains a centroid position of the edge line, and moves the obtained centroid position It is determined that there is a person's body movement.

画像処理部１１２にて体（体の動き）があると通知すると（ステップ６１３；Ｙｅｓ）、検出部１１３は、目的音源有無情報１３２を「あり」に設定し（ステップＳ６１５）、ないと通知すると（ステップＳ６１３；Ｎｏ）、「なし」に設定する（ステップＳ６１４）。以上で検出処理（図１１，ステップＳ６１）が終了して、処理はステップＳ６２に進む。 When the image processing unit 112 notifies that there is a body (movement of the body) (step 613; Yes), the detection unit 113 sets the target sound source presence / absence information 132 to “present” (step S615), and notifies that it does not exist. (Step S613; No), “None” is set (Step S614). Thus, the detection process (FIG. 11, step S61) ends, and the process proceeds to step S62.

通知制御部１１４は、目的音源有無情報１３２を参照して、検出部１１３が現在設定されている目的音範囲に目的音源があることを検出したか否かを判別する（ステップＳ６２）。目的音源が目的音範囲になければ（ステップＳ６２；Ｎｏ）、通知制御部１１４は、その旨をユーザに通知するか否かを判別するため、図５に示す通知情報１３３に含まれる通知モードの設定が「オン」であるか否かを判別する（ステップＳ６３）。オンでなければ（ステップＳ６３；Ｎｏ）、処理はステップＳ６５に進む。オンならば（ステップＳ６３；Ｙｅｓ）、通知制御部１１４は、図５に示す通知情報１３３に設定されている通知方法に従ってユーザ通知を行うために、通知処理を実行する（ステップＳ６４）。 The notification control unit 114 refers to the target sound source presence / absence information 132 and determines whether or not the detection unit 113 has detected that the target sound source is within the currently set target sound range (step S62). If the target sound source is not in the target sound range (step S62; No), the notification control unit 114 determines whether or not to notify the user to that effect in the notification mode included in the notification information 133 shown in FIG. It is determined whether or not the setting is “ON” (step S63). If it is not on (step S63; No), the process proceeds to step S65. If it is ON (step S63; Yes), the notification control unit 114 executes notification processing in order to perform user notification according to the notification method set in the notification information 133 shown in FIG. 5 (step S64).

図１３に示すように、通知処理では、通知制御部１１４は、図５に示す通知情報１３３の通知動作の設定が「通知中」であるか否かを判別する（ステップＳ６４１）。通知中なら（ステップＳ６４１；Ｙｅｓ）、改めてユーザ通知をする必要が無いため、制御部１１は通知処理を終了し、処理は図１１にリターンし、ステップＳ６５に進む。 As illustrated in FIG. 13, in the notification process, the notification control unit 114 determines whether the notification operation setting of the notification information 133 illustrated in FIG. 5 is “notifying” (step S641). If notification is in progress (step S641; Yes), there is no need to notify the user again, so the control unit 11 ends the notification process, the process returns to FIG. 11, and the process proceeds to step S65.

通知中でなければ（ステップＳ６４１；Ｎｏ）、通知制御部１１４は、通知動作を「通知中」に設定して（ステップＳ６４２）、図５に示す通知情報１３３の通知方法が「ライトアップ」であるか否かを判別する（ステップＳ６４３）。ライトアップなら（ステップＳ６４３；Ｙｅｓ）、通知制御部１１４は、通知部１９にＬＥＤを点灯させ（ステップＳ６４４）、なければ（ステップＳ６４３；Ｎｏ）、通知方法が「バイブレーション」であるか否かを判別する（ステップＳ６４５）。 If notification is not in progress (step S641; No), the notification control unit 114 sets the notification operation to “notifying” (step S642), and the notification method of the notification information 133 shown in FIG. It is determined whether or not there is (step S643). If it is light-up (step S643; Yes), the notification control unit 114 turns on the LED in the notification unit 19 (step S644), and if not (step S643; No), whether or not the notification method is “vibration”. A determination is made (step S645).

バイブレーションなら（ステップＳ６４５；Ｙｅｓ）、通知制御部１１４は、通知部１９にバイブレーション動作を実行させ（ステップＳ６４６）、なければ（ステップＳ６４５；Ｎｏ）、通知方法が「音声出力」であるか否かを判別する（ステップＳ６４７）。音声出力ならば（ステップＳ６４７；Ｙｅｓ）、通知制御部１１４は、音声出力部１８にメロディ音を出力させる（ステップＳ６４８）。以上で通知処理（図１１，ステップＳ６４）が終了して、処理は図１１のステップＳ６５に進む。 If it is a vibration (step S645; Yes), the notification control unit 114 causes the notification unit 19 to perform a vibration operation (step S646). If not (step S645; No), whether or not the notification method is “voice output”. Is determined (step S647). If it is an audio output (step S647; Yes), the notification control unit 114 causes the audio output unit 18 to output a melody sound (step S648). Thus, the notification process (FIG. 11, step S64) ends, and the process proceeds to step S65 in FIG.

制御部１１は、目的音範囲変更モード情報１３１を参照して、動作モードが「手動」に設定されているか否かを判別する（ステップＳ６５）。動作モードが「手動」であれば（ステップＳ６５；Ｙｅｓ）、目的音範囲内に目的音源が存在しないことをユーザに通知して、ユーザにより指示された目的音範囲を目的音範囲として新たに設定するために、目的音範囲手動変更部１１５は目的音範囲手動変更処理を実行する（ステップＳ６６）。 The control unit 11 refers to the target sound range change mode information 131 and determines whether or not the operation mode is set to “manual” (step S65). If the operation mode is “manual” (step S65; Yes), the user is notified that the target sound source does not exist within the target sound range, and the target sound range designated by the user is newly set as the target sound range. In order to do this, the target sound range manual changing unit 115 executes a target sound range manual changing process (step S66).

図１４に示すように、目的音範囲手動変更処理では、目的音範囲手動変更部１１５は、ユーザが操作部１４を操作して、現在設定されている目的音範囲を変更するよう指示したか否かを判別する（ステップＳ６６１）。目的音範囲手動変更部１１５は、変更を指示していないと判別すると（ステップＳ６６１；Ｎｏ）、処理は図８にリターンし、ステップＳ４に戻る。 As illustrated in FIG. 14, in the target sound range manual change process, the target sound range manual change unit 115 determines whether the user has operated the operation unit 14 to change the currently set target sound range. Is determined (step S661). When the target sound range manual changing unit 115 determines that the change is not instructed (step S661; No), the process returns to FIG. 8 and returns to step S4.

ユーザが、操作部１４を操作して現在設定されている目的音範囲の位置及びサイズの変更を指示すると（ステップＳ６６１；Ｙｅｓ）、目的音範囲手動変更部１１５は、その指示に従って、変更された目的音範囲を目的音範囲情報１３４に設定する（ステップＳ６６２）。 When the user operates the operation unit 14 to instruct a change in the position and size of the currently set target sound range (step S661; Yes), the target sound range manual change unit 115 is changed according to the instruction. The target sound range is set in the target sound range information 134 (step S662).

続いて、目的音範囲手動変更部１１５は、後述する数式１乃至４に基づいて、その目的音範囲から目的音角度を求め（ステップＳ６６３）、求めた目的音角度を示す目的音角度情報１３５を生成し、記憶部１３に記憶させる。 Subsequently, the target sound range manual changing unit 115 obtains a target sound angle from the target sound range based on Equations 1 to 4 described later (step S663), and obtains target sound angle information 135 indicating the obtained target sound angle. It is generated and stored in the storage unit 13.

なお、ステップＳ６６３において、目的音範囲から目的音角度を求める方法について説明する。具体例として、目的音範囲から目的音角度を求めるときの撮像画像を、図１７（ａ）に示す撮像画像ＰＴ１、即ち、図１７（ｂ）に示す撮像部１５の画角θ３の中心を所定の基準（０度）の９０度方向に向けて撮像した画像、を例にとり説明する。 A method for obtaining the target sound angle from the target sound range in step S663 will be described. As a specific example, the captured image when the target sound angle is obtained from the target sound range is set to the captured image PT1 shown in FIG. 17A, that is, the center of the angle of view θ3 of the imaging unit 15 shown in FIG. An image taken in the 90 degree direction of the reference (0 degree) will be described as an example.

水平方向の幅ｘ１，ｘ２及び撮像画像ＰＴ１の横幅ｘ３は、各幅の一方の端を撮像画像ＰＴ１上のｘ座標の原点（０）に対応させたときに他端が対応するｘ座標値から求まる。また、撮像部１５の画角θ３は、撮像画像ＰＴ１の横幅ｘ３に対応する角度であり、撮像部１５のズーム機能により予め定められた値である。これらを以下の数式１、２に代入し、図１７（ｂ）に示すθ１、θ２を求められる。
（数１） θ１＝（ｘ１／ｘ３）×θ３
（数２） θ２＝（ｘ２／ｘ３）×θ３ The horizontal widths x1 and x2 and the horizontal width x3 of the captured image PT1 are determined from the x coordinate value corresponding to the other end when one end of each width corresponds to the origin (0) of the x coordinate on the captured image PT1. I want. In addition, the angle of view θ3 of the imaging unit 15 is an angle corresponding to the horizontal width x3 of the captured image PT1, and is a value determined in advance by the zoom function of the imaging unit 15. By substituting these into the following formulas 1 and 2, θ1 and θ2 shown in FIG.
(Equation 1) θ1 = (x1 / x3) × θ3
(Equation 2) θ2 = (x2 / x3) × θ3

また、図１７（ｂ）に示す目的音角度ｍ度、ｎ度はそれぞれ以下の数式３、４で表すことができる。ここで、数式１から求めたθ１と画角θ３とを数式３に代入して、目的音範囲の左端に対応する目的音角度ｍ度が求まる。また、数式２から求めたθ２と画角θ３とを数式４に代入して、目的音範囲の右端に対応する目的音角度ｎ度が求まる。
（数３）ｍ＝９０＋（θ３／２）−θ１
（数４）ｎ＝９０＋（θ３／２）−θ２ Moreover, the target sound angles m degrees and n degrees shown in FIG. 17B can be expressed by the following formulas 3 and 4, respectively. Here, the target sound angle m degrees corresponding to the left end of the target sound range is obtained by substituting θ1 and the angle of view θ3 obtained from Expression 1 into Expression 3. Further, by substituting θ2 and the angle of view θ3 obtained from Equation 2 into Equation 4, the target sound angle n degrees corresponding to the right end of the target sound range is obtained.
(Equation 3) m = 90 + (θ3 / 2) −θ1
(Equation 4) n = 90 + (θ3 / 2) −θ2

なお、数式１乃至４は、目的音範囲自動変更処理（図１１，ステップＳ６７）において目的音角度を求める場合でも同様に適用可能である。以上で目的音範囲手動変更処理（図１１，ステップＳ６６）が終了して、処理は図８にリターンし、ステップＳ４に戻る。 Equations 1 to 4 can be similarly applied even when the target sound angle is obtained in the target sound range automatic changing process (FIG. 11, step S67). Thus, the target sound range manual change processing (FIG. 11, step S66) is completed, the processing returns to FIG. 8, and returns to step S4.

一方、ステップＳ６５で判別した現在の動作モードが「手動」でなければ（図１１，ステップＳ６５；Ｎｏ）、目的音範囲自動変更部１１６は、検出部１１３が目的音範囲内に目的音源が存在しないことを検出したことに応答し、目的音源を含むよう目的音範囲を自動的に変更するために、目的音範囲自動変更処理を実行する（ステップＳ６７）。 On the other hand, when the current operation mode determined in step S65 is not “manual” (FIG. 11, step S65; No), the target sound range automatic changing unit 116 has the target sound source within the target sound range. In response to detecting that the target sound source has not been detected, a target sound range automatic change process is executed in order to automatically change the target sound range to include the target sound source (step S67).

テレビ電話では、携帯端末１の周囲からマイク６に到来する音の内、主な話者の声が最も大きくなる傾向がある。そのため、図１５に示すように、目的音範囲自動変更処理では、先ず、目的音範囲自動変更部１１６は、複数のマイク６が収集した音の内で音声レベルが最も大きな音の到来する方向を探索する（ステップＳ６７１）。 In videophones, the voices of the main speakers tend to be loudest among the sounds arriving at the microphone 6 from around the mobile terminal 1. Therefore, as shown in FIG. 15, in the target sound range automatic change processing, first, the target sound range automatic change unit 116 determines the direction of arrival of the sound having the highest sound level among the sounds collected by the plurality of microphones 6. Search is performed (step S671).

このとき、目的音範囲自動変更部１１６は、その方向の判別及びその方向から到来する音の音声レベルの判別を、音源定位分離部１１１に要求する。音源定位分離部１１１は、要求に応じて、判別した方向と音声レベルとを目的音範囲自動変更部１１６に通知する。
具体例で説明すると、図１８（ａ）において、人物Ｐ１がマイク６にて収集した音の内で音声レベルが最も大きな音を発生している場合、音源定位分離部１１１は、判別結果として、図１８（ｂ）に示す角度ｐ度を示す情報と人物Ｐ１の発生する音の音声レベルとを、目的音範囲自動変更部１１６に通知する。 At this time, the target sound range automatic changing unit 116 requests the sound source localization separating unit 111 to determine the direction and the sound level of the sound coming from the direction. The sound source localization separation unit 111 notifies the target sound range automatic change unit 116 of the determined direction and sound level in response to a request.
To explain with a specific example, in FIG. 18A, when the sound having the highest sound level among the sounds collected by the person P1 with the microphone 6 is generated, the sound source localization separation unit 111 determines, as a determination result, Information indicating the angle p degree shown in FIG. 18B and the sound level of the sound generated by the person P1 are notified to the target sound range automatic changing unit 116.

目的音範囲自動変更部１１６は、音源定位分離部１１１から通知された音声レベルが所定レベル以上か否かに基づいて、目的音源の有無を判別する（ステップＳ６７２）。この所定レベルは、マイク６が収集した音声レベルが最も大きな音がテレビ電話の話者の声か否かを判別するための基準であり、例えば、人間が会話するときの平均的な音声レベルに設定したものである。 The target sound range automatic changing unit 116 determines the presence or absence of the target sound source based on whether or not the sound level notified from the sound source localization separating unit 111 is equal to or higher than a predetermined level (step S672). This predetermined level is a reference for determining whether or not the sound with the highest sound level collected by the microphone 6 is the voice of the speaker of the videophone. For example, the predetermined level is an average sound level when a person talks. It is set.

所定レベルよりも小さければ、目的音範囲自動変更部１１６は、目的音源がないと判別し（ステップＳ６７２；Ｎｏ）、所定レベル以上ならあると判別する（ステップＳ６７２；Ｙｅｓ）。目的音源がないと判別すると、処理は図８にリターンし、ステップＳ４に進む。あると判別すると、目的音範囲自動変更部１１６は、音源定位分離部１１１から取得した音声レベルが最も大きな音の到来する角度（方向）に基づいて、目的音角度を新たに決定する（ステップＳ６７３）。 If it is smaller than the predetermined level, the target sound range automatic changing unit 116 determines that there is no target sound source (step S672; No), and determines that it is equal to or higher than the predetermined level (step S672; Yes). If it is determined that there is no target sound source, the process returns to FIG. 8 and proceeds to step S4. If it is determined that there is, the target sound range automatic changing unit 116 newly determines the target sound angle based on the angle (direction) at which the sound having the highest sound level acquired from the sound source localization separating unit 111 arrives (step S673). ).

具体例で説明すると、目的音範囲自動変更部１１６は、音源定位分離部１１１が通知した方向（図１８（ｂ）及び図１９（ｂ）に示すｐ度）から円周方向にそれぞれ所定角度Ａｐ度ずらした（ｐ＋Ａｐ）度と（ｐ−Ａｐ）度とを新たな目的音角度に定め（図１９（ｂ））、その目的音角度を示す目的音角度情報１３５を生成し、記憶部１３に記憶させる。 More specifically, the target sound range automatic changing unit 116 has a predetermined angle Ap in the circumferential direction from the direction notified by the sound source localization separation unit 111 (p degrees shown in FIGS. 18B and 19B). The (p + Ap) degree and (p-Ap) degree shifted by degrees are set as new target sound angles (FIG. 19B), target sound angle information 135 indicating the target sound angles is generated, and stored in the storage unit 13. Remember me.

次に、目的音範囲自動変更部１１６は、画像処理部１１２に新たに定めた目的音角度に対応するよう撮像画像上の目的音範囲の位置を変更させ（ステップＳ６７４）、更に、撮像画像と画像処理部１１２が位置を変更した目的音範囲とを重ねた画像を表示部１６に表示させる。 Next, the target sound range automatic changing unit 116 causes the image processing unit 112 to change the position of the target sound range on the captured image so as to correspond to the newly determined target sound angle (step S674). The image processing unit 112 causes the display unit 16 to display an image superimposed with the target sound range whose position has been changed.

ここで、新たに定めた目的音角度に対応する撮像画像上の目的音範囲の位置を求める方法について説明する。具体例で説明すると、目的音範囲自動変更部１１６は、以下に示す数式５に従って、図１９（ｂ）に示す新たに定めた目的音角度（ｐ＋Ａｐ）度から、図１９（ａ）に示す撮像画像ＰＴ１の幅ｘ４に対応する図１９（ｂ）に示す角度θ４を定める。続いて、目的音範囲自動変更部１１６は、撮像部１５の画角がθ４度となるよう撮像部１５に変更させる。
（数５） θ４＝（ｐ＋Ａｐ−９０）×２ Here, a method for obtaining the position of the target sound range on the captured image corresponding to the newly determined target sound angle will be described. Specifically, the target sound range automatic changing unit 116 performs imaging shown in FIG. 19A from the newly determined target sound angle (p + Ap) degree shown in FIG. An angle θ4 shown in FIG. 19B corresponding to the width x4 of the image PT1 is determined. Subsequently, the target sound range automatic changing unit 116 causes the imaging unit 15 to change the angle of view of the imaging unit 15 to θ4 degrees.
(Equation 5) θ4 = (p + Ap−90) × 2

また、目的音範囲の横幅ｘ２に対応する角度θ２は、以下の数式６に新たに定めた目的音角度（ｐ＋Ａｐ）度と（ｐ−Ａｐ）度とを代入して、求められる。目的音範囲ＯＳ１の水平位置ｘ２は、数式５から求まるθ４と数式６から求まるθ２と撮像画像ＰＴ１の横幅ｘ４とを、以下の数式７に代入して求められる。これにより、変更後の目的音角度に対応する目的音範囲を表示部１６に表示することができる。
（数６） θ２＝（ｐ＋Ａｐ）−（ｐ−Ａｐ）＝２Ａｐ
（数７）ｘ２＝（θ２／θ４）×ｘ４ Also, the angle θ2 corresponding to the horizontal width x2 of the target sound range is obtained by substituting newly defined target sound angles (p + Ap) degrees and (p−Ap) degrees into the following Equation 6. The horizontal position x2 of the target sound range OS1 is obtained by substituting θ4 obtained from Equation 5, θ2 obtained from Equation 6, and the lateral width x4 of the captured image PT1 into Equation 7 below. Thereby, the target sound range corresponding to the target sound angle after the change can be displayed on the display unit 16.
(Equation 6) θ2 = (p + Ap) − (p−Ap) = 2Ap
(Expression 7) x2 = (θ2 / θ4) × x4

なお、数式７は、画像処理部１１２が音声レベル表示画像を生成する際、撮像画像上の音源の位置（方向）を特定するためにも使用される。具体例で説明すると、画像処理部１１２が図１０（ｂ）に示す音声レベル表示画像ＶＤ１を生成する場合、撮像部１５のズーム機能から特定した撮像部１５の画角θ４と、図１０（ａ）に示す撮像画像ＰＴ１の横幅ｘ４と、数式６から求めるθ２と、を数式７に代入して、撮像画像において音源の方向に対応するｘ座標値ｘ２を求めればよい。以上で目的音範囲自動変更処理（図１１，ステップＳ６７）が終了して、処理は図８にリターンし、ステップＳ４に戻る。 Equation 7 is also used to specify the position (direction) of the sound source on the captured image when the image processing unit 112 generates the sound level display image. More specifically, when the image processing unit 112 generates the sound level display image VD1 shown in FIG. 10B, the angle of view θ4 of the image pickup unit 15 specified from the zoom function of the image pickup unit 15 and FIG. Substituting the horizontal width x4 of the captured image PT1 and θ2 obtained from Equation 6 into Equation 7 to obtain the x coordinate value x2 corresponding to the direction of the sound source in the captured image. Thus, the target sound range automatic changing process (FIG. 11, step S67) is completed, the process returns to FIG. 8, and returns to step S4.

一方、目的音範囲に目的音源があると判別すると（図１１，ステップＳ６２；Ｙｅｓ）、通知制御部１１４は、通知モードがオンか否かを判別する（ステップＳ６８）。オンでなければ（ステップＳ６８；Ｎｏ）、処理は図８のステップＳ４に戻る。オンならば（ステップＳ６２；Ｎｏ）、通知制御部１１４は、ユーザ通知が既に実行されているか否かを判別するため、通知情報１３３の通知動作が「通知中」を示しているか否かを判別する（ステップＳ６９）。通知中でなければ（ステップＳ６９；Ｎｏ）、処理は図８のステップＳ４に戻る。 On the other hand, when it is determined that the target sound source is in the target sound range (FIG. 11, step S62; Yes), the notification control unit 114 determines whether the notification mode is on (step S68). If it is not on (step S68; No), the process returns to step S4 in FIG. If it is on (step S62; No), the notification control unit 114 determines whether or not the notification operation of the notification information 133 indicates “notifying” in order to determine whether or not the user notification has already been executed. (Step S69). If the notification is not in progress (step S69; No), the process returns to step S4 in FIG.

通知中なら（ステップＳ６９；Ｙｅｓ）、通知制御部１１４は、通知解除処理（ステップＳ６１０）を実行する。図１６に示すように、通知解除処理では、通知制御部１１４は、通知情報１３３の通知動作を「停止中」に設定して（ステップＳ６１０１）、通知方法が「ライトアップ」であるか否かを判別する（ステップＳ６１０２）。通知制御部１１４は、ライトアップなら（ステップＳ６１０２；Ｙｅｓ）、通知部１９に通知用ＬＥＤを消灯させ（ステップＳ６１０３）、なければ（ステップＳ６１０２；Ｎｏ）、通知方法がバイブレーションか否かを判別する（ステップＳ６１０４）。 If notification is in progress (step S69; Yes), the notification control unit 114 executes notification cancellation processing (step S610). As shown in FIG. 16, in the notification release process, the notification control unit 114 sets the notification operation of the notification information 133 to “stopped” (step S6101), and whether or not the notification method is “light-up”. Is discriminated (step S6102). If it is light-up (step S6102; Yes), the notification control unit 114 causes the notification unit 19 to turn off the notification LED (step S6103). If not (step S6102; No), the notification control unit 114 determines whether the notification method is vibration. (Step S6104).

バイブレーションなら（ステップＳ６１０４；Ｙｅｓ）、通知制御部１１４は、通知部１９にバイブレーション動作を解除させ（ステップＳ６１０５）、なければ（ステップＳ６１０４；Ｎｏ）、通知方法が音声出力か否かを判別する（ステップＳ６１０６）。音声出力なら（ステップＳ６１０６；Ｙｅｓ）、通知制御部１１４は、音声出力部１８にメロディ音の出力を解除させる（ステップＳ６１０７）。以上で通知解除処理（図１１，ステップＳ６１０）が終了する。 If it is vibration (step S6104; Yes), the notification control unit 114 causes the notification unit 19 to cancel the vibration operation (step S6105). If not (step S6104; No), the notification control unit 114 determines whether the notification method is audio output (step S6104; No). Step S6106). If it is an audio output (step S6106; Yes), the notification control unit 114 causes the audio output unit 18 to cancel the output of the melody sound (step S6107). This completes the notification cancellation process (FIG. 11, step S610).

以上により目的音範囲の変更処理（ステップＳ６）が終了し、処理は図８にリターンしてステップＳ４に進み、音源定位分離部１１１は現在設定されている目的音範囲から到来する音を分離抽出する。制御部１１は、抽出した目的音のデータをカメラ３の撮像画像と共に無線通信部１２を介して、テレビ電話の相手に送信する。
以後、テレビ電話アプリケーションが終了されるまで、携帯端末１はステップＳ４〜Ｓ６の処理を繰り返し実行する。 Thus, the target sound range changing process (step S6) is completed, the process returns to FIG. 8 and proceeds to step S4, where the sound source localization separation unit 111 separates and extracts sounds coming from the currently set target sound range. To do. The control unit 11 transmits the extracted target sound data together with the captured image of the camera 3 to the other party of the videophone via the wireless communication unit 12.
Thereafter, the portable terminal 1 repeatedly executes the processes of steps S4 to S6 until the videophone application is terminated.

以上説明したように、本実施形態によれば、音源定位分離部１１１により特定した目的音の到来方向が、分離抽出対象とする音の到来方向を特定する目的音範囲から外れた場合に、目的音の到来方向に対応するよう目的音範囲の位置を変更し、音源定位分離部１１１は、位置を変更した後の目的音範囲から到来する音を分離抽出する。
これにより、現在設定されている目的音範囲から目的音の到来方向から外れた場合でも、目的音範囲の位置を変更し、変更後の目的音範囲から目的音を分離抽出することができる。 As described above, according to the present embodiment, when the arrival direction of the target sound specified by the sound source localization separation unit 111 is outside the target sound range that specifies the arrival direction of the sound to be separated and extracted, The position of the target sound range is changed to correspond to the direction of arrival of the sound, and the sound source localization separation unit 111 separates and extracts the incoming sound from the target sound range after the position is changed.
As a result, even when the target sound range deviates from the target sound arrival direction, the position of the target sound range can be changed, and the target sound can be separated and extracted from the changed target sound range.

本実施形態によれば、目的音範囲を変更するときの動作モードの設定が「自動」である場合、目的音範囲自動変更部１１６は、ユーザによる目的音範囲の変更操作を伴うことなく、音源定位分離部１１１が特定した目的音の到来方向に対応するよう目的音範囲の位置を変更する。音源定位分離部１１１は、変更後の目的音範囲から到来する音のみを分離抽出する。
これにより、目的音範囲に目的音源がない場合でも目的音の到来方向に対応するよう目的音範囲を自動的に変更して、変更した目的音範囲から目的音のみを分離抽出することができる。 According to the present embodiment, when the setting of the operation mode when changing the target sound range is “automatic”, the target sound range automatic changing unit 116 does not involve the user's operation of changing the target sound range, and the sound source The position of the target sound range is changed so as to correspond to the arrival direction of the target sound specified by the localization separation unit 111. The sound source localization separation unit 111 separates and extracts only sounds coming from the changed target sound range.
Thus, even when there is no target sound source in the target sound range, the target sound range can be automatically changed to correspond to the arrival direction of the target sound, and only the target sound can be separated and extracted from the changed target sound range.

また、本実施形態によれば、動作モードの設定が「自動」である場合、目的音範囲自動変更部１１６は、撮像画像に目的音源の画像が存在しなければ、音源定位分離部１１１が定位する目的音の到来方向（目的音を発生している音源の方向）が含まれるよう撮像部１５にカメラ３の画角を変更させる。
これにより、現在の目的音源の位置する方向が現在カメラ３の撮像している範囲内に対応しない場合でも、カメラ３の画角を変更し、目的音の到来方向に位置する目的音源が撮像画像に含まれるようにできる。 In addition, according to the present embodiment, when the operation mode is set to “automatic”, the target sound range automatic changing unit 116 determines that the sound source localization separating unit 111 does the localization if the target sound source image does not exist in the captured image. The imaging unit 15 is caused to change the angle of view of the camera 3 so that the direction of arrival of the target sound to be included (the direction of the sound source generating the target sound) is included.
Thus, even when the current direction of the target sound source does not correspond to the range currently captured by the camera 3, the angle of view of the camera 3 is changed, and the target sound source positioned in the direction of arrival of the target sound is captured. Can be included.

更に、本実施形態によれば、動作モードの設定が「自動」である場合、目的音範囲自動変更部１１６は、目的音範囲に目的音源が存在しないときでも、マイク６が収集する音の音声レベルの内で最も大きなものが所定レベル以上であれば、その音声レベルを示す音が到来する方向を含むよう目的音範囲を変更し、音源定位分離部１１１は、変更した目的音範囲から到来する音のみを分離抽出する。
これにより、目的音範囲に目的音源がない場合でも、マイク６が収集する音の内で音声レベルが最も大きく所定レベル以上である音を目的音とみなして、その音が到来する方向に対応して目的音範囲を変更し、その音を分離抽出する。この場合、目的音の音声レベルがマイク６にて収集する他の音の音声レベルと比べて最大であれば、ユーザにとっての利便性が向上する。 Furthermore, according to the present embodiment, when the operation mode is set to “automatic”, the target sound range automatic changing unit 116 can collect the sound of the sound collected by the microphone 6 even when the target sound source does not exist in the target sound range. If the largest level is equal to or higher than the predetermined level, the target sound range is changed to include the direction in which the sound indicating the sound level arrives, and the sound source localization separation unit 111 comes from the changed target sound range. Separate and extract only sound.
As a result, even when there is no target sound source in the target sound range, the sound with the highest sound level among the sounds collected by the microphone 6 is regarded as the target sound, and the direction in which the sound arrives is handled. To change the target sound range and separate and extract the sound. In this case, if the sound level of the target sound is maximum as compared with the sound levels of other sounds collected by the microphone 6, convenience for the user is improved.

本実施形態によれば、通知モードが「オン」に設定されている場合、通知制御部１１４は、検出部１１３が目的音範囲に目的音源がないことを検出したときに、その旨をユーザに通知する。
これにより、ユーザは、目的音範囲から目的音源が外れたことを把握できる。 According to the present embodiment, when the notification mode is set to “ON”, the notification control unit 114 notifies the user when the detection unit 113 detects that there is no target sound source in the target sound range. Notice.
Thereby, the user can grasp that the target sound source is out of the target sound range.

また、この通知をするときの通知方法はユーザにより変更可能であり、本実施形態では、ＬＥＤ点滅，バイブレーション，音声出力、の内から選択可能である。
これにより、ユーザは、現在の状況に応じて、任意の通知方法を選択することができる。例えば、テレビ電話を使用した会議中であれば周囲へのマナーを考慮して、音声出力以外の通知方法を選択できる。 In addition, the notification method for making this notification can be changed by the user, and in this embodiment, it can be selected from among LED blinking, vibration, and audio output.
Thereby, the user can select an arbitrary notification method according to the current situation. For example, during a conference using a videophone, a notification method other than audio output can be selected in consideration of manners to the surroundings.

本実施形態によれば、目的音範囲を変更するときの動作モードが「手動」に設定されている場合、目的音範囲手動変更部１１５は、ユーザが操作部１４を操作して位置やサイズを変更した目的音範囲を示す目的音範囲情報１３４を生成し、記憶部１３に記憶させる。
これにより、ユーザは、例えば、目的音範囲から目的音源が外れた場合に、目的音源を含むよう目的音範囲の位置や大きさを変更し、変更後の目的音範囲から目的音が分離抽出されるように調節できる。 According to the present embodiment, when the operation mode for changing the target sound range is set to “manual”, the target sound range manual changing unit 115 allows the user to operate the operation unit 14 to change the position and size. The target sound range information 134 indicating the changed target sound range is generated and stored in the storage unit 13.
Thus, for example, when the target sound source is out of the target sound range, the user changes the position and size of the target sound range so that the target sound source is included, and the target sound is separated and extracted from the changed target sound range. Can be adjusted.

本実施形態によれば、ユーザは操作部１４を操作して目的音範囲の位置や大きさを任意に変更（編集）可能であり、画像処理部１１２はユーザの変更した位置と大きさに応じた目的音範囲の画像を生成し、音源定位分離部１１１は、その目的音範囲に対応する方向から到来する音を分離抽出する。
これにより、ユーザは、目的音範囲を変更して、変更後の目的音範囲から分離抽出される目的音を自己の好みに応じて調節できる。 According to the present embodiment, the user can arbitrarily change (edit) the position and size of the target sound range by operating the operation unit 14, and the image processing unit 112 can respond to the position and size changed by the user. The sound source localization separation unit 111 separates and extracts sound coming from the direction corresponding to the target sound range.
Thus, the user can change the target sound range and adjust the target sound separated and extracted from the changed target sound range according to his / her preference.

また、このとき、制御部１１は、表示部１６を介して、撮像画像に目的音範囲を重ねた画像を表示パネル４に表示させる。
これにより、ユーザは、目的音範囲内の目的音源を視覚的に確認できる。例えば、ユーザは、目的音範囲が誤った目的音源を特定していることを一見して把握でき、操作性が向上される。 At this time, the control unit 11 causes the display panel 4 to display an image in which the target sound range is superimposed on the captured image via the display unit 16.
Thereby, the user can visually confirm the target sound source within the target sound range. For example, the user can grasp at a glance that the target sound source with the wrong target sound range is specified, and the operability is improved.

また、本実施形態によれば、ユーザが目的音範囲の大きさや位置を変更（編集）するときに、画像処理部１１２は、音源定位分離部１１１が特定した複数のマイク６が入力した音の到来方向が撮像画像ＰＴ１上で対応する位置を定め、その位置と音声レベルとを対応付けて音声レベル表示画像ＶＤ１を生成し、制御部１１は表示部１６を介して表示パネル４に表示させる。
これにより、ユーザは、撮像画像上で目的音の音声レベルが極大である方向を一見して把握した上で、目的音範囲を変更することができる。 Further, according to the present embodiment, when the user changes (edits) the size or position of the target sound range, the image processing unit 112 performs sound input by the plurality of microphones 6 specified by the sound source localization separation unit 111. A position corresponding to the arrival direction on the captured image PT1 is determined, and the position and the sound level are associated with each other to generate a sound level display image VD1, and the control unit 11 causes the display panel 16 to display the sound level display image VD1.
Thereby, the user can change the target sound range after grasping the direction in which the sound level of the target sound is maximum on the captured image.

本実施形態によれば、目的音範囲の初期値を設定するときに、カメラ３の撮像画像に複数の人物が含まれていれば、画像処理部１１２は、その内から顔画像が最も大きな人物を特定し、その人物が撮像画像中に占める領域を特定して、その領域に目的音範囲を重ねる。
これにより、例えば、テレビ電話中に主な話者がカメラ３の正面に位置する場合、その話者の声を目的音とするよう目的音範囲が表示されるため、ユーザが目的音範囲を設定する際の利便性が高まる。 According to the present embodiment, when the initial value of the target sound range is set, if a plurality of persons are included in the captured image of the camera 3, the image processing unit 112 has the largest face image among them. Is specified, the area occupied by the person in the captured image is specified, and the target sound range is overlaid on the area.
Thus, for example, when a main speaker is located in front of the camera 3 during a videophone call, the target sound range is displayed so that the voice of the speaker is set as the target sound, so the user sets the target sound range. Convenience when doing is increased.

また、本実施形態によれば、目的音範囲の初期値を設定するときに、カメラ３の撮像画像に含まれる人物が１人以下であれば、画像処理部１１２は、撮像画像の全領域に目的音範囲を重ねる。
これにより、撮像画像に含まれる人物の音声と他の人物の音声との分別が不要なときに、その人物が目的音範囲から外れにくくすることができる。 Further, according to the present embodiment, when setting the initial value of the target sound range, if the number of persons included in the captured image of the camera 3 is one or less, the image processing unit 112 covers the entire area of the captured image. Overlap the target sound range.
Accordingly, when it is not necessary to distinguish between the voice of a person included in the captured image and the voice of another person, it is possible to make it difficult for the person to be out of the target sound range.

本実施形態によれば、検出部１１３は、初期値に設定された目的音範囲から到来する音の音声レベルを音源定位分離部１１１から取得して所定レベル以上か否かを判別し、所定レベル以上であれば目的音範囲に目的音を発生する音源があると判別する。
これにより、初期値に設定された目的音範囲に存在するものが目的音を発生する目的音源である可能性が高まる。 According to the present embodiment, the detection unit 113 obtains the sound level of sound arriving from the target sound range set to the initial value from the sound source localization separation unit 111, determines whether or not it is equal to or higher than the predetermined level, and determines the predetermined level. If it is above, it will discriminate | determine that there exists a sound source which generate | occur | produces a target sound in the target sound range.
This increases the possibility that the target sound source that generates the target sound is present in the target sound range set to the initial value.

また、本実施形態によれば、検出部１１３は、初期値に設定された目的音範囲から到来する音の音声レベルが所定レベルよりも小さければ、目的音範囲内の撮像画像上に人物の顔の画像が有るか否かを判別し、有れば、目的音源があると判別する。これにより、目的音源である話者であるときに、その話者が目的音範囲外に移動したのか、目的音範囲内には存在するが発声していないのか、を判別できる。 Further, according to the present embodiment, the detection unit 113 detects the person's face on the captured image within the target sound range if the sound level of the sound coming from the target sound range set to the initial value is lower than the predetermined level. It is determined whether there is a target sound source. Thus, when the speaker is the target sound source, it can be determined whether the speaker has moved out of the target sound range or whether the speaker is within the target sound range but not speaking.

本実施形態によれば、目的音範囲内の撮像画像上に人物の顔画像があるか否かを判別するとき、画像処理部１１２は、目的音範囲内の撮像画像上に目の動きが有るか否かに基づいて、その判別を行う。
これにより、撮像画像中に話者と人物写真が含まれている場合でも、話者の目の動きがあることを検出して、話者と人物写真とを分別することができる。 According to the present embodiment, when determining whether or not there is a human face image on the captured image within the target sound range, the image processing unit 112 has an eye movement on the captured image within the target sound range. The determination is made based on whether or not.
As a result, even when a captured image includes a speaker and a person photograph, it is possible to detect the movement of the speaker's eyes and separate the speaker from the person photograph.

更に、検出部１１３は、目的音範囲内に顔が存在しないと判別すると、人物の体の有無を判別し、有れば、目的音範囲内に目的音源があると判別する。これにより、人物の顔がカメラ３に向いていない場合でも、目的範囲内に話者の目的音源である話者が目的範囲内に存在することを検出できる。 Further, when the detection unit 113 determines that no face exists in the target sound range, the detection unit 113 determines the presence or absence of a human body, and if there is, determines that the target sound source is within the target sound range. Thereby, even when a person's face is not facing the camera 3, it can be detected that a speaker which is the target sound source of the speaker is within the target range.

以下、本発明の変形例及び応用例について説明する。
目的音範囲の形は矩形に限らず、例えば、撮像画像上の人物の画像の輪郭に合わせた形であってもよい。この場合、画像処理部１１２が、撮像画像を解析し、例えば、パターンマッチングにより、人物の顔体の全体像に類似する領域を判別して、目的音範囲に定めるようにすればよい。 Hereinafter, modified examples and application examples of the present invention will be described.
The shape of the target sound range is not limited to a rectangle, and may be, for example, a shape that matches the contour of a person image on the captured image. In this case, the image processing unit 112 may analyze the captured image, determine a region similar to the whole image of the human face by pattern matching, for example, and determine the target sound range.

目的音源の位置を目的音範囲から外れにくくするには、目的音源を他の音源から特定できる限度内で目的音範囲をなるべく大きくすればよい。そのため、目的音範囲をなるべく大きく設定するようユーザに促すメッセージを表示部１６に表示してもよい。 In order to make it difficult to deviate the position of the target sound source from the target sound range, the target sound range may be made as large as possible within a limit that allows the target sound source to be identified from other sound sources. Therefore, a message prompting the user to set the target sound range as large as possible may be displayed on the display unit 16.

検出処理（図１１，ステップＳ６１）は、顔（目の動き）の検出後に体（体の動き）を検出するに限られない。例えば、複数の人物が撮像画像に含まれる場合に、体を検出しても目的音源の人物を特定するのが困難であれば、体（体の動き）を検出するステップを省略してもよい。また、例えば、目的音源の人物の顔が撮像部１５を向いていない場合、顔（目の動き）の検出により目的音源が目的範囲内にあると判別せずに、体（体の動き）の検出を続けて行うようにすればよい。
これらの組合せは、画像処理部１１２の性能とユーザの利用状況とに応じて任意に定めればよい。 The detection process (FIG. 11, step S61) is not limited to detecting the body (body movement) after detecting the face (eye movement). For example, when a plurality of persons are included in the captured image, the step of detecting the body (body movement) may be omitted if it is difficult to identify the person of the target sound source even if the body is detected. . Further, for example, when the face of the person of the target sound source does not face the imaging unit 15, the body (body movement) is detected without determining that the target sound source is within the target range by detecting the face (eye movement). The detection may be performed continuously.
These combinations may be arbitrarily determined according to the performance of the image processing unit 112 and the usage status of the user.

目的音範囲内に目的音源がないことをユーザに通知するときの通知方法は、ＬＥＤ点灯、バイブレーション動作、音声出力、に限らず、例えば、目的音源がないことを示すメッセージを表示パネル４に表示するようにしてもよい。 The notification method for notifying the user that there is no target sound source within the target sound range is not limited to LED lighting, vibration operation, and sound output, for example, a message indicating that there is no target sound source is displayed on the display panel 4 You may make it do.

その他、本発明の概念は、専用のコンピュータシステムに限らず、例えば、撮像部及び複数の音声入力部を備える、携帯電話機、ＰＤＡ、電子カメラ、等の任意の携帯型電子装置に適用可能である。即ち、コンピュータを携帯端末１として機能・動作させるためのコンピュータプログラムを作成し、配布し、貸与し、これをコンピュータにインストールして、携帯端末１として、これを使用、譲渡、貸与などしてもよい。 In addition, the concept of the present invention is not limited to a dedicated computer system, and can be applied to any portable electronic device such as a mobile phone, a PDA, an electronic camera, and the like that includes an imaging unit and a plurality of audio input units. . That is, a computer program for causing a computer to function and operate as the mobile terminal 1 is created, distributed, and lent, and this is installed in the computer so that the mobile terminal 1 can be used, transferred, or lent. Good.

携帯端末を開いた状態で正面から見たときの外観図である。It is an external view when it sees from the front in the state which opened the portable terminal. 携帯端末の構成を示すブロック図である。It is a block diagram which shows the structure of a portable terminal. 目的音範囲変更モード情報の例を示す図である。It is a figure which shows the example of the target sound range change mode information. 目的音源有無情報の例を示す図である。It is a figure which shows the example of target sound source presence / absence information. 通知情報の例を示す図である。It is a figure which shows the example of notification information. 目的音範囲情報の例を示す図である。It is a figure which shows the example of target sound range information. 目的音角度情報の例を示す図である。It is a figure which shows the example of target sound angle information. テレビ電話機能の起動から終了までの間の、携帯端末が目的音を分離抽出する処理を示すフローチャートである。It is a flowchart which shows the process in which a portable terminal isolate | separates and extracts a target sound from the starting to the end of a videophone function. 目的音範囲決定処理を示すフローチャートである。It is a flowchart which shows the target sound range determination process. （ａ）は、複数の人物が含まれる撮像画像を示す図である。（ｂ）は、撮像画像上で目的音範囲が顔画像の最も大きな人物の顔に重ねられた画像及び音声レベル表示画像を示す図である。（ｃ）は、撮像画像上で目的音範囲が顔画像の最も大きな人物の体全体に重ねられた画像を示す図である。（ｄ）は、人物が１人のみ含まれる撮像画像を示す図である。（ｅ）は、撮像画像上で目的音範囲が撮像画像の全領域を指し示した画像及び音声レベル表示画像を示す図である。(A) is a figure which shows the captured image containing a some person. (B) is a figure which shows the image and audio | voice level display image which were superimposed on the face of the person with the largest target sound range of a face image on the captured image. (C) is a figure which shows the image on which the target sound range was superimposed on the whole body of the person with the largest face image on the captured image. (D) is a figure which shows the captured image in which only one person is included. (E) is a figure which shows the image and audio | voice level display image in which the target sound range pointed out the whole area | region of the captured image on the captured image. 動作モードに対応した目的音範囲の変更処理を示すフローチャートである。It is a flowchart which shows the change process of the target sound range corresponding to an operation mode. 検出処理を示すフローチャートである。It is a flowchart which shows a detection process. 通知処理を示すフローチャートである。It is a flowchart which shows a notification process. 目的音範囲手動変更処理を示すフローチャートである。It is a flowchart which shows the target sound range manual change process. 目的音範囲自動変更処理を示すフローチャートである。It is a flowchart which shows the target sound range automatic change process. 通知解除処理を示すフローチャートである。It is a flowchart which shows a notification cancellation | release process. （ａ）は、撮像画像と目的音範囲とが重ねられた画像を示す図である。（ｂ）は、撮像部の画角及び目的音範囲に対応する角度を示す図である。(A) is a figure which shows the image on which the captured image and the target sound range were overlapped. (B) is a figure which shows the angle corresponding to the angle of view and target sound range of an imaging part. （ａ）は、目的音範囲変更前の撮像画像と目的音源の位置とを示す図である。（ｂ）は、目的音範囲変更前の撮像部の画角を示す図である。(A) is a figure which shows the captured image before the target sound range change, and the position of the target sound source. (B) is a figure which shows the angle of view of the imaging part before the target sound range change. （ａ）は、目的音範囲変更後の撮像画像と目的音源の位置とを示す図である。（ｂ）は、目的音範囲変更後の撮像部の画角を示す図である。(A) is a figure which shows the captured image after the target sound range change, and the position of the target sound source. (B) is a figure which shows the angle of view of the imaging part after the target sound range change.

Explanation of symbols

１…携帯端末、１１…制御部、１２…無線通信部、１３…記憶部、１４…操作部、１５…撮像部、１６…表示部、１７…音声入力部、１８…音声出力部、１９…通知部、２０…バス、１１１…音源定位分離部、１１２…画像処理部、１１３…検出部、１１４…通知制御部、１１５…目的音範囲手動変更部、１１６…目的音範囲自動変更部、１３１…目的音範囲変更モード情報、１３２…目的音源有無情報、１３３…通知情報、１３４…目的音範囲情報、１３５…目的音角度情報、ＰＴ１…撮像画像、ＰＴ２…撮像画像、ＯＳ１…目的音範囲、ＯＳ２…目的音範囲、Ｐ１，Ｐ２，Ｐ３，Ｐ４…人物、ＶＤ１，ＶＤ２…音声レベル表示画像、ＢＤ…ボード、ｘ１，ｘ２…撮像画像上の水平方向の幅、ｘ３…撮像画像の横幅、ｘ４…撮像画像の横幅、ｙ１，ｙ２…撮像画像上の垂直方向の高さ，ｙ３…撮像画像の高さ、ｍ…目的音範囲の左端に対応する目的音角度、ｎ…目的音範囲の右端に対応する目的音角度、ｐ…音源の角度、Ａ１…撮像画像の左端の角度、Ａ２…撮像画像の右端の角度、Ａｐ…所定角度、θ１，θ２…角度、θ３，θ４…撮像部の画角 DESCRIPTION OF SYMBOLS 1 ... Portable terminal, 11 ... Control part, 12 ... Wireless communication part, 13 ... Memory | storage part, 14 ... Operation part, 15 ... Imaging part, 16 ... Display part, 17 ... Audio | voice input part, 18 ... Audio | voice output part, 19 ... Notification unit, 20 ... bus, 111 ... sound source localization separation unit, 112 ... image processing unit, 113 ... detection unit, 114 ... notification control unit, 115 ... target sound range manual change unit, 116 ... target sound range automatic change unit, 131 ... target sound range change mode information, 132 ... target sound source presence / absence information, 133 ... notification information, 134 ... target sound range information, 135 ... target sound angle information, PT1 ... captured image, PT2 ... captured image, OS1 ... target sound range, OS2 ... target sound range, P1, P2, P3, P4 ... person, VD1, VD2 ... sound level display image, BD ... board, x1, x2 ... horizontal width on the picked-up image, x3 ... horizontal width of the picked-up image, x4 ... Width of captured image y1, y2: vertical height on the captured image, y3: height of the captured image, m: target sound angle corresponding to the left end of the target sound range, n: target sound angle corresponding to the right end of the target sound range, p: angle of the sound source, A1: angle at the left end of the captured image, A2: angle at the right end of the captured image, Ap: predetermined angle, θ1, θ2, angle, θ3, θ4: angle of view of the imaging unit

Claims

Voice input means for inputting sound;
An imaging means for capturing an image;
Display means for displaying data;
Display control means for displaying the picked-up image picked up by the image pick-up means and a specific area for specifying the direction of arrival of the sound to be separated and extracted on the display means;
Sound source localization means for specifying the direction of the sound source;
Sound source presence / absence detecting means for detecting whether or not the direction of the sound source specified by the sound source localization means corresponds to the specific area displayed by the display means;
Area position changing means for changing the position of the specific area so as to coincide with the direction of the sound source when the sound source presence / absence detecting means detects that the direction of the sound source does not correspond to the specific area;
A captured image in which the imaging means has captured a certain area at the position changed before Symbol area position changing unit, and a region display control means for displaying on the display means repeatedly,
Sound source separation means for separating and extracting sound coming from the direction specified by the specific area displayed on the display means by the area display control means from the sound input by the voice input means ;
The display control means includes
In-image person presence / absence determining means for determining whether or not a person is included in the captured image of the captured image;
A number-of-persons determination unit that determines whether or not the number of persons included in the captured image is plural when the person presence / absence determination unit in the image determines that a person is included in the captured image;
An initial region setting unit that sets the position and size of the entire region of the captured image of the image capturing unit as the position and size of the specific region when the number of persons determination unit is determined to be not plural;
Means for superimposing the captured image captured by the imaging unit and the specific region set by the initial region setting unit on the display unit;
Equipped with a,
An information processing apparatus characterized by that.

The region position changing means is
When the sound source presence / absence detecting unit detects that the direction of the sound source does not correspond to the specific region, the sound source localization unit searches for a direction in which sound arrives, and
An image position specifying means for specifying a position corresponding to the direction searched by the direction searching means on the captured image picked up by the image pickup means;
Changing means for changing the position of the specific area to the position specified by the image position specifying means;
The information processing apparatus according to claim 1, comprising:

The area display control means includes
A sound source image presence / absence determining means for determining whether or not a sound source located in the direction searched by the direction searching means exists on a captured image captured by the imaging means;
If the sound source in the sound source image presence determining means is determined to not present on the captured image, the image pickup means, to change the angle of view to include the image of the sound source to the shooting Zoga image, the field angle The information processing apparatus according to claim 2, further comprising: a display unit configured to superimpose the captured image captured by the imaging unit that has changed the image and the specific region on the display unit.

The area display control means includes
When the sound source image presence / absence determining means determines that a sound source is present on the captured image, the sound source image specifies the size of the sound source image located in the direction searched by the direction search means in the captured image captured by the image capturing means Size identification means;
The size of the specific area is changed to the size specified by the sound source image size specifying means, and the specific area size is displayed on the display means by superimposing the captured image picked up by the imaging means and the specific area having the changed size. The information processing apparatus according to claim 3, further comprising a changing unit.

The direction searching means includes
When the sound source presence / absence detecting means detects that the direction of the sound source does not correspond to the specific area, the maximum level direction search for searching for the direction of arrival of the sound having the highest level among the sounds input by the sound input means Means,
The sound source image size specifying means determines whether or not the level of sound coming from the direction searched by the maximum level direction searching means is equal to or higher than a predetermined level. The information processing apparatus according to claim 4, wherein the size is specified.

The sound source presence / absence detecting means includes
An incoming sound level discriminating means for discriminating whether or not the level of sound coming from the direction specified by the specific region is a predetermined level or higher;
Feature point presence / absence discrimination that discriminates whether or not a feature point of a person exists in an image within the specific region of the captured image of the imaging unit when the incoming sound level determination unit determines that the level is not equal to or higher than a predetermined level. Means,
Means for discriminating that there is a sound source for generating a sound to be separated and extracted in a direction corresponding to the specific area when the characteristic point presence / absence determining means determines that the feature point of the person is in the image in the specific area; When,
The information processing apparatus according to claim 1, further comprising:

The region position changing means is
When the sound source presence / absence detection unit detects that the direction of the sound source does not correspond to the specific area, a notification unit that notifies the fact,
Changing means for changing at least one of the position and size of the specific area in response to a user operation;
The information processing apparatus according to claim 1, further comprising:

The display control means includes
In-image person presence / absence determining means for determining whether or not a person is included in the captured image of the captured image;
If the image a person existence determination means has determined that a person is included in the captured image, and number of people s discrimination means the number of persons included in the captured image to determine more or not,
A face maximum person specifying means for specifying a person with the largest face image in the captured image of the imaging means when the number of persons determining means determines that the number of persons is plural;
Person area specifying means for specifying the image area of the person specified by the maximum face person specifying means in the captured image of the image pickup means;
Second initial area setting means for presetting the specific area so as to include at least a part of the image area of the person specified by the person area specifying means;
Means for displaying the picked-up image picked up by the image pickup means and the specific area set by the second initial area setting means on the display means in an overlapping manner;
The information processing apparatus according to any one of claims 1 to 7 , further comprising:

A direction-of-arrival image specifying means for specifying the direction in which the sound input by the sound input means arrives in the captured image captured by the imaging means;
Image generation means for generating a direction level-corresponding image in which the direction specified by the direction-of-arrival image specifying means and the level of sound coming from the direction are associated with each other;
A corresponding image display control unit that causes the display unit to display a direction level corresponding image generated by the image generation unit by superimposing the captured image captured by the imaging unit and the specific region;
While the associated image display control means displays the direction level associated image on the display means, in response to a user operation, at least one of the position and size of the specific area displayed by the display means is displayed. Level visible area changing means to be changed;
The information processing apparatus according to any one of claims 1 to 8, further comprising a.

Communication means for communicating with other devices;
Application execution means for executing an application that uses the communication means,
The communication unit transmits at least one of the sound data separated and extracted by the sound source separation unit and the captured image data captured by the imaging unit to the other device while the application execution unit is executing the application. to, the information processing apparatus according to any one of claims 1 to 9, characterized in that.

Storage means for storing data;
Means for storing in the storage means at least one of the sound separated and extracted by the sound source separation means and the captured image captured by the imaging means;
The information processing apparatus according to any one of claims 1 to 10, characterized in that.

The computer,
A captured image pickup means has captured the display control means and the specific region, Ru is displayed on the display unit by overlapping to identify the direction of arrival of the sound with the separation target,
Sound source localization means for specifying the direction of the sound source ,
The sound source localization means the direction of the sound source identified, the sound source detecting means, wherein the display control means detects whether or not corresponding to the specified area is displayed on the display unit,
Area position changing means for changing the position of the specific area so as to coincide with the direction of the sound source when the sound source presence / absence detecting means detects that the direction of the sound source does not correspond to the specific area;
Wherein the captured image pickup means has captured the area position changing unit and a specific area in the position has changed, the region display control means for displaying on the front Symbol display section superimposed,
A sound coming from direction in which the region display control means for specifying a specific area is displayed on the display unit, the sound source separating means for separating and extracting from the input sound, to function as,
The display control means includes
In-image person presence / absence determining means for determining whether or not a person is included in the captured image of the captured image;
A single person determination unit for determining whether or not the number of persons included in the captured image is plural when the person presence / absence determination means in the image determines that a person is included in the captured image;
An initial region setting unit that sets the position and size of the entire region of the captured image of the imaging unit as the position and size of the specific region when the number of persons determination unit is determined to be not plural;
Causing the captured image captured by the imaging unit and the specific region set by the initial region setting unit to overlap and display on the display unit;
A program characterized by that .